Architecture

Below is the high level architecture for the Data Quality KPIs monitoring tool.

Data, Analytics & AI Platform > Data Quality Monitoring Tool - Technical Documentation > architecture.png

1. Data Ingestion

A Master Talend job orchestrates the entire data processing pipeline. The process begins with the ingestion of data for each domain from various SAP source systems into Google Cloud Platform (GCP).

SAP Source Systems & Data Domains:

HR: Data is sourced from SAP SuccessFactors.
SSR: Data is sourced from SAP WP1 and SAP PF1.
FIN: Data is sourced from SAP BW, WP1, and PF1.
MRK: Data is sourced from SAP BW, WP1, and PF1.

Each table from each domain has its own dedicated Talend job responsible for ingesting and loading the data into the GCP BigQuery Service - ODS Dataset.

2. Data Processing and Transformation

After ingestion, several routines are executed to populate the Data Model (DM) tables:

Routine: DM.insert_DIM_Domain populates DM.DIM_domain table.
Routine: DM.insert_DIM_kpi_dimension populates DM.DIM_kpi_dimension table.

3. Data Quality Execution in Dataplex

The data quality rules are executed using GCP Dataplex Service.
The validation results are stored in a BigQuery Table: DM.Dataplex_quality.

4. Data Model Population

A routine is executed to populate the Data Model (DM) tables:

Routine: DM.RT_DPtoDMmapping_Datespecific populates the following tables:
- DM.DIM_DATE
- DM.DIM_quality_rule
- DM.FACT_data_quality
- DM.FACT_failed_records

5. Failed Records Handling & Export

A final Talend job - PL_DQ_BQ_to_Gshet_Selfservice - handles failed records:

Generates a CSV file with failed records.
Uploads the CSV file to a Google Drive folder.
Updates DM.FACT_failed_records with the URL to the CSV file.

6. Visualization in Qlik Sense

The processed and validated data is available for visualization and analysis in Qlik Sense.

Data Flow

Main Data Source	SAP (SuccessFactors, BW, PF1, WP1)
ETL and Preparation	Talend Jobs are used to ingest and load data from SAP source systems to the BigQuery Data Ocean, specifically the following datasets: prj-data-dm-hr-[env].ODS prj-data-dm-structure-[env].ODS prj-data-dm-finance-[env].ODS prj-data-dm-marketing-[env].ODS prj-data-dq-selfservice-[env].ODS _{[env] is one of the following: dev, test, ppd, prod}
Data Source for DQMT Project	After tables are loaded, views - including only the necessary data for the rules - are created in the following datasets: prj-data-dm-hr-[env].DS_prj_dqkpi prj-data-dm-structure-[env].DS_prj_sls_dataquality_kpi prj-data-dm-finance-[env].DS_prj_sls_dataquality_kpi prj-data-dm-marketing-[env].DS_prj_sls_dataquality_kpi prj-data-dq-selfservice-[env].DS_prj_sls_dataquality_kpi _{[env] is one of the following: dev, test, ppd, prod} This views are the sole source for the for the quality checks performed by Dataplex.
Data Residing in BigQuery	Views are created in the Dataset DataOcean_dataquality_kpi.prj-data-dq-selfservice-[env] Data residing on BigQuery brought in prj-data-dq-selfservice-dev project using the necessary views to be Scanned and checked by DataPlex, the generated metrics naturally reside in Big Query (Dataplex_quality)
Data Quality Metrics Ingestion	Populating the BigQuery fact table with DQ metrics obtained from the previously mentioned sources By Mapping the resulting table to the developed DM (Using a stored procedure)
Big Query Views	Creating the views necessary to answer Business requirements, providing them with degree of flexibility in a way that they can have more control over the data the need to query according to business changing requirements, to approach selfservice.
Visualization	Uses QlikSense that connects to bigQuery views to ingest data, Use Qlik Sense to Visualize, present, add alerting capabilities for different business domains Different KPIs and Different Business Rules. and Failed records
Google Drive	Used to store failed records Data, then URL for the sheet is inserted in FACT_failed_records associated with quality_rule_key.

Data Model

Data, Analytics & AI Platform > Data Quality Monitoring Tool - Technical Documentation > image-2024-12-30_12-55-37.png

Data Mapping:

The out put of Dataplex quality checks is Dataplex_quality table, It is mapped to the Data Model "in the previous section", on Big query to later be used as the sole data source for QlikSense visualization.

Data Mapping is detailed in this document.

Procedures

Dataplex to Data Model Mapping Stored Procedure:
Procedure Name:

Scheduling:

Scheduled Profile Scans: Runs on Dataplex and scheduled while created.
Scheduled Data Quality Rules Scan: Runs on Dataplex and scheduled while created.
Scheduled Stored Procedure Run: Routine is triggered using scheduled query on weekly Basis.
Scheduled QlikSense Refresh: set by the visualization engineer on qliksense.

Time of Runs and Duration Window:

Dataplex	4:00 - 5:00 CET
BigQuery Routine	5:00 - 6:00 CET
Talend	6:00 - 9:00 CET
QlikSense	9:00 CET

Monitoring

GCP Monitoring tools:

Dataplex Logs
Big Query Logs
Cloud Monitoring Dashboard

Error Handling

Failure alert are set in rule creation to alert stakeholders/users when a rule fails.
Stored procedure scheduling failure alert is sent in case the scheduled Routine, doesn't run as intended.

Known Bugs

No Identified Bugs.

Roadmap

FSD

TSD