Author | Version | Comments | Date |
0.1 | Initial Version |
| |
0.2 | Revisions to the original text |
|
This document describes how to define data quality metrics on the data loaded into GCP.
This document is intended for the Data Architectures, Data Engineering and operational team.
It will be used as reference for any project or domain for the developments of the models.
Data quality is essential for any business to make informed decisions. Measuring the right Data Quality metrics ensures that we have most trustable data. Different types of metrics can be used to track and report the quality of data available in GCP. Each type of KPI has it's own importance and role.
Data quality is reliant upon the following dimensions:
Completeness: Incomplete data is as dangerous as inaccurate/wrong data. Data requirements should be clearly specified based on the information needs of the organization and data collection processes matched to these requirements.
Consistency: Data consistency ensures the data is the same across the organization no matter where it appears.
Conformity: Data conformity measures the data is following the set of standard data definitions like data type, size and format.
Accuracy: As the name implies, it refers to the exactness of the data. Data cannot have any erroneous elements and must convey the correct message without being misleading.
Integrity: Integrity ensures that all the data in a database can be traced and connected to other data
Timeliness: Data should be captured as quickly as possible after the event or activity and must be available quickly and frequently enough to support management decisions.
A data quality framework is a systematic process that consistently monitors data quality, implements a variety of data quality processes and triggers the alarms as soon as the quality of a particular KPI is observed below threshold. 
To implement a data quality monitoring framework, a data quality data mart is needed.
Described below the data mart can be used to handle the data quality rules and store the result sets.

# | Field Name | Description | Type | Example |
001 | DQ_CHECK_ID | Unique Identifier | Integer | E.g. 1001 |
002 | DQ_CHECK_NAME | Name of the DQ check | String | E.g. NOT_NULL_CHECK, CUSTOM_CHECK |
003 | DQ_CHECK_DESC | Details about the DQ check | String | E.g. This check will be to validate the number of nulls values present in the given attribute |
004 | CHECK_STAGE | Details weather the check can be applied at FIELD or TABLE level | String | E.g. FIELD, TABLE |
005 | INSERT_DATETIME | Date & Time the record was first inserted into the table | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
006 | UPDATE_DATETIME | Date & Time the record was last updated | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
# | Field Name | Description | Type | Example |
001 | RULE_ID | Unique Identifier | Integer | E.g. 1 |
002 | RULE_NAME | Name of the Rule | String | E.g. COMP_CD_NOT_NULL_CHECK |
003 | RULE_DESCR | Details about the rule | String | E.g. |
004 | DOMAIN_NAME | Name of the domain, the rule is related to . | String | E.g. Finance, HR, Industrial and R&I |
005 | DIMENSION_NAME | Dimension (Type) of DQ Check | String | E.g. Completeness, Conformity, Accuracy, Consistency, Timeliness, Integrity |
006 | PROJECT_NAME | Name of the GCP project | String | E.g. pcm_dev |
007 | TABLE_NAME | Name of the Table that holds the data on which Data Quality Rules need to execute | String | E.g. COMPANY_CODE |
008 | COLUMN_NAME | Column Name on which rule is executed | String | E.g. C_COMPCDE |
009 | DQ_CHECK_ID | Surrogate key to DQ_CONFIG.DQ_CHECK_ID attribute | Integer | E.g. 1001 |
010 | FREEQUENCY | Indicates how frequently the check is scheduled to execute | String | E.g. Daily, Weekly |
011 | FILTER_CONDITION | Provide the filter condition if the checks are related to rang check. Should be populated as 'NA' for regular checks like Not Null, custom checks. | String | E.g. where comp_cd in ('A','B') |
012 | SQL_STATEMENT | SQL statement can be used for custom DQ checks. | String | |
013 | SEVERITY | Severity level of the DQ check | String | E.g. High |
014 | RULE_EXECUTE_STATUS_CODE | Indicates the status of the code. 'Active' means the check will be scheduled to run. Inactive means the schedule will be skipped to execute. | String | E.g. Active, Inactive |
015 | LOWER_THRESHOLD_VALUE | Lower threshold value can be used to check the result of the DQ check | Integer | E.g. 96 |
016 | UPPER_THRESHOLD_VALUE | Upper threshold value can be used to check the result of the DQ check | Integer | E.g. 100 |
017 | INSERT_DATETIME | Date & Time the record was first inserted into the table | timestamp | E.g. 2022-12-12 17:00:21 UTC |
018 | UPDATE_DATETIME | Date & Time the record was last updated | timestamp | E.g. 2022-12-12 17:00:21 UTC |
# | Field Name | Description | Type | Example |
001 | DQ_RULE_RESULT_ID | Id of the run, this information come from Talend job run | Integer | E.g. 1 |
002 | RULE_ID | Reference to DQ_RULE_CONFOG.RULE_ID | Integer | E.g. 1 |
003 | DOMAIN_NAME | Name of the domain, the rule is related to . | String | E.g. Finance, HR, Industrial and R&I |
004 | DIMENSION_NAME | Dimension (Type) of DQ Check | String | E.g. Completeness, Conformity, Accuracy, Consistency, Timeliness, Integrity |
005 | TABLE_NAME | Name of the Table that holds the data on which Data Quality Rules need to execute | String | E.g. COMPANY_CODE |
006 | COLUMN_NAME | Column Name on which rule is executed | String | E.g. C_COMPCDE |
007 | STATUS | Status of the execution | String | E.g. Green, Amber, Red |
008 | RECORD_COUNT | Number of the records present in the table | Integer | E.g. 1543 |
009 | SUCCESS_PERCENTAGE | Count of the records that returned by the DQ check / RECORD_COUNT | Integer | E.g. 97 |
010 | FAILURE_PERCENTAGE | Count of the records that's not satisfied by the DQ check / RECORD_COUNT | Integer | E.g. 99 |
011 | INSERT_DATETIME | Date & Time the record was first inserted into the table | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
012 | UPDATE_DATETIME | Date & Time the record was last updated | Timestamp | E.g. 2022-12-12 17:00:21 UTC |
Not Null validation rule:
Verify the company code C_COMPCDE attribute is populated without null values.
As the rules is at a attribute level it will have below entry in DQ_CONFIG table
DQ_CHECK_ID | DQ_CHECK_NAME | DQ_CHECK_DESC | CHECK_STAGE |
1001 | NOT_NULL_CHECK | This check will validate the number | FIELD |
Changes to DQ_RULE_CONFIG table:
The rule is "Active" and scheduled to run on a daily basis as the RULE_EXECUTE_STATUS_CODE field is set as "Active". The Threshold pass percentage is set to 100 (Both Lower and Upper threshold pass percentage is set to 100), meaning that even one record does not meet the criteria, Even one record is populated as NULL, the rule will fail and the STATUS in DQ_RULE_AUDIT_RESULTS table will be set as RED.
As the CHECK is related to NOT NULL, Both FILTER_CONDITION and SQL_STATEMENT field values can be set as NA
DIMENSION_NAME | PROJECT_NAME | TABLE_NAME | COLUMN_NAME | DQ_CHECK_ID | FREQUENCY | SEVERITY | RULE_EXECUTE_STATUS_CODE | LOWER_THRESHOLD_VALUE | UPPER_THRESHOLD_VALUE |
predict-credit-mgt-v2-dev | COMPANY_CODE | C_COMPCDE | 1001 | Daily | High | Active | 100 | 100 |
Range check validation rule:
Verify the company code C_AUTHMA attribute has got values ECo or SCo only.
As the rules is at a attribute level it will have below entry in DQ_CONFIG table
DQ_CHECK_ID | DQ_CHECK_NAME | DQ_CHECK_DESC | CHECK_STAGE |
1002 | RANGE_CHECK | This check will be used to validate the field is with in the range or list of values | FIELD |
Changes to DQ_RULE_CONFIG table:
The rule is "Active" and scheduled to run on a daily basis as the RULE_EXECUTE_STATUS_CODE field is set as "Active". The Threshold pass percentage is set to 100% (GREEN) if the success percentage is 100% and the status will be set to AMBER if the success percentage is >=98% and <100% and RED if it is <98%.
As the CHECK is related to RANGE CHECK, SQL_STATEMENT field values can be set as NA but FILTER_CONDITION field will be populated as "C_AUTHMA in ('ECo','SC0')"
DIMENSION_NAME | PROJECT_NAME | TABLE_NAME | COLUMN_NAME | DQ_CHECK_ID | FREQUENCY | SEVERITY | RULE_EXECUTE_STATUS_CODE | LOWER_THRESHOLD_VALUE | UPPER_THRESHOLD_VALUE |
predict-credit-mgt-v2-dev | COMPANY_CODE | C_AUTHMA | 1002 | Daily | High | Active | 98 | 100 |
Custom check validation rule:
Verify the referential integrity between credit_mgmt code.C_COMPCDE attribute and company_code.C_COMPCDE.
As the rules is at a attribute level it will have below entry in DQ_CONFIG table
DQ_CHECK_ID | DQ_CHECK_NAME | DQ_CHECK_DESC | CHECK_STAGE |
1003 | CUSTOM_CHECK | This check will be used to carry custom checks at field level | FIELD |
Changes to DQ_RULE_CONFIG table:
The rule is "Active" and scheduled to run on a weekly basis as the RULE_EXECUTE_STATUS_CODE field is set as "Active". The Threshold pass percentage is set to 100% (GREEN) if the success percentage is equal or above 95% and the status will be set to AMBER if the success percentage is >=92% and <95% and RED if it is <92%.
As the CHECK is related to CUSTOM CHECK, SQL_STATEMENT field will be populated with the query to be executed.
DIMENSION_NAME | PROJECT_NAME | TABLE_NAME | COLUMN_NAME | DQ_CHECK_ID | FREQUENCY | SEVERITY | RULE_EXECUTE_STATUS_CODE | LOWER_THRESHOLD_VALUE | UPPER_THRESHOLD_VALUE |
predict-credit-mgt-v2-dev | COMPANY_CODE | C_AUTHMA | 1003 | Weekly | High | Active | 92 | 95 |
Reference: