Page tree


You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

Architecture

Below is the high level architecture for the Data Quality KPIs monitoring tool.


Master Talend job orchestrates the entire data processing pipeline:

1. Data Ingestion

The process begins with the ingestion of data for each domain from various SAP source systems:

  • HR data is sourced from SAP SuccessFactors
  • SSR data is sourced from SAP WP1 and SAP PF1

  • FIN data is sourced from SAP BW, WP1, and SAP PF1

  • MRK data is sourced from SAP BW, WP1, and SAP PF1

Each table from each domain has its own dedicated Talend job responsible for ingesting and loading the data into the GCP BigQuery Data Ocean, specifically the following datasets:

  • prj-data-dm-hr-[env].ODS
  • prj-data-dm-structure-[env].ODS
  • prj-data-dm-finance-[env].ODS
  • prj-data-dm-marketing-[env].ODS
  • prj-data-dq-selfservice-[env].ODS

2. Data Processing and Transformation

After ingestion, two routines are executed to populate the Data Model (DM) Dimension tables:

  • Routine prj-data-dq-selfservice-[env].DM.insert_DIM_Domain populates prj-data-dq-selfservice-[env].DM.DIM_domain table

  • Routine prj-data-dq-selfservice-[env].DM.insert_DIM_kpi_dimension populates prj-data-dq-selfservice-[env].DM.DIM_kpi_dimension table

Also, views including only the necessary data are created in the following datasets:

  • prj-data-dm-hr-[env].DS_prj_dqkpi
  • prj-data-dm-structure-[env].DS_prj_sls_dataquality_kpi
  • prj-data-dm-finance-[env].DS_prj_sls_dataquality_kpi
  • prj-data-dm-marketing-[env].DS_prj_sls_dataquality_kpi
  • prj-data-dq-selfservice-[env].DS_prj_sls_dataquality_kpi

This views are the sole source for the for the quality checks performed by Dataplex.

3. Data Quality Execution in Dataplex

Once the views are created, the data quality rules are executed using GCP Dataplex Service and the validation results are stored in the following BigQuery table:

  • prj-data-dq-selfservice-[env].DM.Dataplex_quality

4. Data Model Population

A routine is executed to populate the DM Fact tables:

  • Routine prj-data-dq-selfservice-[env].DM.RT_DPtoDMmapping_Datespecific populates the following tables: 

    • prj-data-dq-selfservice-[env].DM.DIM_DATE

    • prj-data-dq-selfservice-[env].DM.DIM_quality_rule

    • prj-data-dq-selfservice-[env].DM.FACT_data_quality

    • prj-data-dq-selfservice-[env].DM.FACT_failed_records

5. Failed Records Handling & Export

final Talend job - PL_DQ_BQ_to_Gshet_Selfservice - handles failed records:

    • Generates a CSV file with failed records.
    • Uploads the CSV file to a Google Drive folder.
    • Updates prj-data-dq-selfservice-[env].DM.FACT_failed_records with the URL to the CSV file, associated with the quality_rule_key.

[env] is one of the following: dev, test, ppd, prod

6. Visualization in Qlik Sense

The processed and validated data is available for visualization and analysis in Qlik Sense. 

Data Model

Scheduling

The following process are scheduled on a weekly basis.  

1. Talend Ingestion Jobs

The Ingestion Jobs are scheduled to run within Talend every week, at the beginning of the process.

2. Data Quality Scans

Initially "On Demand" for testing purposes and then "Scheduled" to run every week within Dataplex.

3. Routines Execution

The 3 routines are triggered using scheduled queries on a weekly basis within BigQuery.

  • prj-data-dq-selfservice-[env].DM.insert_DIM_Domain
  • prj-data-dq-selfservice-[env].DM.insert_DIM_kpi_dimension
  • prj-data-dq-selfservice-[env].DM.RT_DPtoDMmapping_Datespecific

4. Talend Report Job

The Talend Job PL_DQ_BQ_to_Gshet_Selfservice is scheduled to run within Talend every week, at the end of the process.

5. QlikSense Refresh

The QlikSense refresh schedule is set by the Visualization Engineer within QlikSense.

Process Scheduling Details

Bellow you can find a table that summarizes the processes, their frequency, duration window and average duration.

ProcessFrequencyDuration WindowAverage Duration (min)
Talend Ingestion JobsEvery Sunday21:00 CET
Dataplex Data Quality ScansEvery Monday4:00 - 5:00 CET1
BigQuery Routine insert_DIM_DomainEvery Monday5:00 - 5:05 CET1
BigQuery Routine insert_DIM_kpi_dimensionEvery Monday5:05 - 5:10 CET1
BigQuery Routine RT_DPtoDMmapping_DatespecificEvery Monday5:10 - 5:15 CET1
Talend Report Job PL_DQ_BQ_to_Gshet_SelfserviceEvery Monday6:00 - 7:00 CET5
QlikSenseEvery Monday8:00 CET1

Monitoring

GCP Monitoring tools:

  • Dataplex Logs
  • Big Query Logs
  • Cloud Monitoring Dashboard 

Error Handling

  • Failure alert are set in rule creation to alert stakeholders/users when a rule fails.
  • Stored procedure scheduling failure alert is sent in case the scheduled Routine, doesn't run as intended. 

Known Bugs

No Identified Bugs.

Roadmap

FSD

TSD