Requirements
The following requirements define the conditions that must be met/followed for a data set to be considered of acceptable quality:
- Data completeness, e.g. % of populated data fields in ingested data, required non-null data fields etc.
- Data formatting, e.g. required format for time + date values, precision for numeric values etc.
- Data structure, e.g. expected data schema vs ingested data schema
- Naming convention of data files, data fields etc.
- Expected volume of data, e.g. minimum rows of data in files etc.
Metrics
The following (non-exhaustive) list of metrics should be captured:
- Data coverage, i.e. % of populated data fields in single data row
- Data field rejections, i.e. number of data fields in single data row that did not meet data quality requirements
- Data ingest rejections, i.e. number of data ingest attempts that did not meet data quality requirements
KPIs
The following (non-exhaustive) list of KPIs should be calculated and available to include/visualize in reports:
- Data coverage per file/batch, i.e. % of populated data fields in single data file or batch
- Data coverage per source, i.e. % of populated data fields of all data ingested from a single source
- Data field rejections per row, i.e. % of data fields in single data row that did not meet data quality requirements
- Data field rejections per file/batch, i.e. % of data fields in single data file or batch that did not meet data quality requirements
- Data field rejections per source, i.e. % of data fields of all data ingested from a single source that did not meet data quality requirements
- Data ingest rejections rate, i.e. % of data ingest attempts that did not meet data quality requirements
- Data ingest rejections rate per source, i.e. % of data ingest attempts from a single source that did not meet data quality requirements