This document provides a technical overview of the Data Quality Monitoring Tool, detailing its architecture, data processing pipeline, monitoring mechanisms, and deployment strategy within Google Cloud Platform (GCP).

The tool ensures data quality and integrity by leveraging Talend for data ingestion, BigQuery for storage and processing, and Dataplex for rule execution. The processed results are then visualized in  Qlik Sense .

This documentation serves as a reference for understanding the system components, workflows, and operational best practices.

Architecture

Below is the high level architecture for the Data Quality KPIs monitoring tool.


1. Data Ingestion, Processing and Transformation

The process begins with the ingestion of data for each domain from various SAP source systems:

Each table from each domain has its own dedicated Talend job responsible for ingesting and loading the data into the GCP BigQuery Data Ocean, specifically the following datasets:

Also, views including only the necessary data are created in the following datasets:

This views are the sole source for the for the quality checks performed by Dataplex.

2. Data Quality Execution in Dataplex

Once the views are created, the data quality rules are executed using GCP Dataplex Service and the validation results are stored in the following BigQuery table:

3. Data Model Dimension Tables Population

After ingestion, two routines are executed to populate the Data Model (DM) Dimension tables:

4. Data Model Fact Tables Population

A routine is executed to populate the DM Fact tables:

5. Failed Records Handling & Export

final Talend job - PL_DQ_BQ_to_Gshet_Selfservice - handles failed records:

[env] is one of the following: dev, test, ppd, prod

6. Visualization in Qlik Sense

The processed and validated data is available for visualization and analysis in Qlik Sense. 

Data Model

The Data Model consists of a set of structured tables within BigQuery that store processed and validated data. These tables are designed to support efficient querying, data quality monitoring, and reporting.

The following tables are part of the Data Model schema:

[env] is one of the following: dev, test, ppd, prod

Each table plays a crucial role in storing metadata, data quality rules, validation results, and failed records for further analysis, you can find the schema showing the relationships between the DM tables below:

Scheduling

The following process are scheduled on a weekly basis.  

1. Talend Ingestion Jobs

The Ingestion Jobs are scheduled to run within Talend every week, at the beginning of the process.

2. Data Quality Scans

Initially "On Demand" for testing purposes and then "Scheduled" to run every week within Dataplex.

3. Routines Execution -  Data Model Dimension Tables

The 2 routines are triggered using scheduled queries on a weekly basis within BigQuery.

[env] is one of the following: dev, test, ppd, prod

4. Routines Execution - Data Model Fact Tables

The routine is triggered using scheduled queries on a weekly basis within BigQuery.

[env] is one of the following: dev, test, ppd, prod

5. Talend Report Job

The Talend Job PL_DQ_BQ_to_Gshet_Selfservice is scheduled to run within Talend every week, at the end of the process.

6. QlikSense Refresh

The QlikSense refresh schedule is set by the Visualization Engineer within QlikSense.

Process Scheduling Details

Bellow you can find a table that summarizes the processes, their frequency, duration window and average duration.

Process Frequency Duration Window Average Duration (min)
Talend Ingestion JobsEvery Sunday21:00 CET
Dataplex Data Quality ScansEvery Monday4:00 - 5:00 CET1
BigQuery Routine insert_DIM_Domain Every Monday5:00 - 5:30 CET1
BigQuery Routine insert_DIM_kpi_dimension Every Monday5:30 - 6:00 CET1
BigQuery Routine RT_DPtoDMmapping_Datespecific Every Monday6:00 - 6:30 CET1
Talend Report Job PL_DQ_BQ_to_Gshet_Selfservice Every Monday6:30 - 7:00 CET5
QlikSenseEvery Monday7:30 CET1

Error Handling

To maintain the reliability of the data quality pipeline, a structured error handling procedure is in place for each scheduled process. 

In the event of a failure, it's crucial not only to resolve and rerun the failed step, but also to re-execute all subsequent steps in the pipeline — as they may have run on incomplete or outdated data.

For a full overview of how the data and processes flow together, please refer to the Architecture & Data Flow Diagram.

1. Talend Ingestion Jobs

What to check:

Next steps:

2. Data Quality Scans

What to check:

Next steps:

3. Routines Execution - Data Model Dimension Tables

What to check:

Next steps:

4. Routines Execution - Data Model Fact Tables

What to check:

Next steps:

5. Talend Report Job

What to check:

Next steps:

6. QlikSense Refresh

What to check:

Next steps:

Monitoring

To ensure the smooth operation of the data pipeline, monitoring is implemented using Google Cloud Platform monitoring tools.

These tools help track system performance, identify issues, and ensure data integrity throughout the process. The key monitoring tools used are:

Environments and Deployment

The data processing pipeline is deployed across four different Google Cloud Platform environments to ensure a structured and controlled rollout:

The deployment process involves migrating updates between these environments and is managed by the DataOps Team in collaboration with D ata Engineers .

As of now, four deployments have been completed. Detailed documentation related to these deployments is available in the following Google Drive folder :

Known Bugs

Currently, no bugs have been identified in the system.

Roadmap

FSD

TSD