| Table of Contents |
|---|
Author | Version | Comments | Date |
0.1 | Initial Version |
|
...
0.2 | Revisions to the original text |
|
...
1- Introduction
| Anchor | ||||
|---|---|---|---|---|
|
This document describes how to define data quality metrics on the data loaded into GCP.
| Anchor | ||||
|---|---|---|---|---|
|
This document is intended for the Data Architectures, Data Engineering and operational team.
It will be used as reference for any project or domain for the developments of the models.
| Anchor | ||||
|---|---|---|---|---|
|
| Anchor | ||||
|---|---|---|---|---|
|
Data quality is essential for any business to make informed decisions. Measuring the right Data Quality metrics ensures that we have most trustable data. Different types of metrics can be used to track and report the quality of data available in GCP. Each type of KPI has it's own importance and role.
| Anchor | ||||
|---|---|---|---|---|
|
Data quality is reliant upon the following dimensions:
Completeness: Incomplete data is as dangerous as inaccurate/wrong data. Data requirements should be clearly specified based on the information needs of the organization and data collection processes matched to these requirements.
...
- Degree to which specified data values are up to date between data change and processing
| Anchor | ||||
|---|---|---|---|---|
|
A data quality framework is a systematic process that consistently monitors data quality, implements a variety of data quality processes and triggers the alarms as soon as the quality of a particular KPI is observed below threshold.
| Anchor | ||||
|---|---|---|---|---|
|
| Anchor | ||||
|---|---|---|---|---|
|
To implement a data quality monitoring framework, a data quality data mart is needed.
Described below the data mart can be used to handle the data quality rules and store the result sets.
| Anchor | ||||
|---|---|---|---|---|
|
- DQ_RULE_CONFIG: A table to store all predefined data quality rules.
- DQ_CONFIG: A table to store the possible DQ_CHECK. Checks which can be executed at table and attribute level
- DQ_RULE_AUDIT_RESULTS: A table to store the execution results of data quality rules which can help to monitor the performance of the KPI's
| Anchor | ||||
|---|---|---|---|---|
|
| Anchor | ||||
|---|---|---|---|---|
|
# | Field Name | Description | Type | Example |
001 | DQ_CHECK_ID | Unique Identifier | Integer | E.g. 1001 |
002 | DQ_CHECK_NAME | Name of the DQ check | String | E.g. NOT_NULL_CHECK, CUSTOM_CHECK |
003 | DQ_CHECK_DESC | Details about the DQ check | String | E.g. This check will be to validate the number of nulls values present in the given attribute |
004 | CHECK_STAGE | Details weather the check can be applied at FIELD or TABLE level | String | E.g. FIELD, TABLE |
005 | INSERT_DATETIME | Date & Time the record was first inserted into the table | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
006 | UPDATE_DATETIME | Date & Time the record was last updated | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
| Anchor | ||||
|---|---|---|---|---|
|
# | Field Name | Description | Type | Example |
001 | RULE_ID | Unique Identifier | Integer | E.g. 1 |
002 | RULE_NAME | Name of the Rule | String | E.g. COMP_CD_NOT_NULL_CHECK |
003 | RULE_DESCR | Details about the rule | String | E.g. |
004 | DOMAIN_NAME | Name of the domain, the rule is related to . | String | E.g. Finance, HR, Industrial and R&I |
005 | DIMENSION_NAME | Dimension (Type) of DQ Check | String | E.g. Completeness, Conformity, Accuracy, Consistency, Timeliness, Integrity |
006 | PROJECT_NAME | Name of the GCP project | String | E.g. pcm_dev |
007 | TABLE_NAME | Name of the Table that holds the data on which Data Quality Rules need to execute | String | E.g. COMPANY_CODE |
008 | COLUMN_NAME | Column Name on which rule is executed | String | E.g. C_COMPCDE |
009 | DQ_CHECK_ID | Surrogate key to DQ_CONFIG.DQ_CHECK_ID attribute | Integer | E.g. 1001 |
010 | FREEQUENCY | Indicates how frequently the check is scheduled to execute | String | E.g. Daily, Weekly |
011 | FILTER_CONDITION | Provide the filter condition if the checks are related to rang check. Should be populated as 'NA' for regular checks like Not Null, custom checks. | String | E.g. where comp_cd in ('A','B') |
012 | SQL_STATEMENT | SQL statement can be used for custom DQ checks. | String | |
013 | SEVERITY | Severity level of the DQ check | String | E.g. High |
014 | RULE_EXECUTE_STATUS_CODE | Indicates the status of the code. 'Active' means the check will be scheduled to run. Inactive means the schedule will be skipped to execute. | String | E.g. Active, Inactive |
015 | LOWER_THRESHOLD_VALUE | Lower threshold value can be used to check the result of the DQ check | Integer | E.g. 96 |
016 | UPPER_THRESHOLD_VALUE | Upper threshold value can be used to check the result of the DQ check | Integer | E.g. 100 |
017 | INSERT_DATETIME | Date & Time the record was first inserted into the table | timestamp | E.g. 2022-12-12 17:00:21 UTC |
018 | UPDATE_DATETIME | Date & Time the record was last updated | timestamp | E.g. 2022-12-12 17:00:21 UTC |
| Anchor | ||||
|---|---|---|---|---|
|
# | Field Name | Description | Type | Example |
001 | DQ_RULE_RESULT_ID | Id of the run, this information come from Talend job run | Integer | E.g. 1 |
002 | RULE_ID | Reference to DQ_RULE_CONFOG.RULE_ID | Integer | E.g. 1 |
003 | DOMAIN_NAME | Name of the domain, the rule is related to . | String | E.g. Finance, HR, Industrial and R&I |
004 | DIMENSION_NAME | Dimension (Type) of DQ Check | String | E.g. Completeness, Conformity, Accuracy, Consistency, Timeliness, Integrity |
005 | TABLE_NAME | Name of the Table that holds the data on which Data Quality Rules need to execute | String | E.g. COMPANY_CODE |
006 | COLUMN_NAME | Column Name on which rule is executed | String | E.g. C_COMPCDE |
007 | STATUS | Status of the execution | String | E.g. Green, Amber, Red |
008 | RECORD_COUNT | Number of the records present in the table | Integer | E.g. 1543 |
009 | SUCCESS_PERCENTAGE | Count of the records that returned by the DQ check / RECORD_COUNT | Integer | E.g. 97 |
010 | FAILURE_PERCENTAGE | Count of the records that's not satisfied by the DQ check / RECORD_COUNT | Integer | E.g. 99 |
011 | INSERT_DATETIME | Date & Time the record was first inserted into the table | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
012 | UPDATE_DATETIME | Date & Time the record was last updated | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
| Anchor | ||||
|---|---|---|---|---|
|
Not Null validation rule:
Verify the company code C_COMPCDE attribute is populated without null values.
As the rules is at a attribute level it will have below entry in DQ_CONFIG table
...
DIMENSION_NAME | PROJECT_NAME | TABLE_NAME | COLUMN_NAME | DQ_CHECK_ID | FREQUENCY | SEVERITY | RULE_EXECUTE_STATUS_CODE | LOWER_THRESHOLD_VALUE | UPPER_THRESHOLD_VALUE |
predict-credit-mgt-v2-dev | COMPANY_CODE | C_COMPCDE | 1001 | Daily | High | Active | 100 | 100 |
| Anchor | ||||
|---|---|---|---|---|
|
Range check validation rule:
Verify the company code C_AUTHMA attribute has got values ECo or SCo only.
As the rules is at a attribute level it will have below entry in DQ_CONFIG table
...
DIMENSION_NAME | PROJECT_NAME | TABLE_NAME | COLUMN_NAME | DQ_CHECK_ID | FREQUENCY | SEVERITY | RULE_EXECUTE_STATUS_CODE | LOWER_THRESHOLD_VALUE | UPPER_THRESHOLD_VALUE |
predict-credit-mgt-v2-dev | COMPANY_CODE | C_AUTHMA | 1002 | Daily | High | Active | 98 | 100 |
| Anchor | ||||
|---|---|---|---|---|
|
Custom check validation rule:
Verify the referential integrity between credit_mgmt code.C_COMPCDE attribute and company_code.C_COMPCDE.
As the rules is at a attribute level it will have below entry in DQ_CONFIG table
...
DIMENSION_NAME | PROJECT_NAME | TABLE_NAME | COLUMN_NAME | DQ_CHECK_ID | FREQUENCY | SEVERITY | RULE_EXECUTE_STATUS_CODE | LOWER_THRESHOLD_VALUE | UPPER_THRESHOLD_VALUE |
predict-credit-mgt-v2-dev | COMPANY_CODE | C_AUTHMA | 1003 | Weekly | High | Active | 92 | 95 |
| Anchor | ||||
|---|---|---|---|---|
|
Reference:


