Data curation is the process of collecting, organizing, and preserving data for future use. It is essential for ensuring the quality and usability of data, and it is becoming increasingly important as the volume and complexity of data continues to grow.
In the context of a data ocean, data curation is even more critical. A data ocean is a vast repository of data that is collected from a variety of sources. This data can be structured, unstructured, or semi-structured, and it can be of varying quality.
The goal of data curation in a data ocean is to ensure that the data is:
Data curation in a data ocean can be challenging, but it is essential for making the data valuable and accessible to users.
Data Normalization involves transforming data into a common format to enable seamless integration and analysis.
Data normalization is the process of organizing data in a consistent manner. This involves standardizing the data format, removing duplicate data, and identifying and correcting errors.
When data from various sources is aggregated, there's often a mismatch in formats, units, or encoding. Normalization resolves these disparities, ensuring consistency and reducing redundancy, making data integration and analytics more efficient, ensuring a single version of truth exists within the Data Ocean.
Data normalization can improve the efficiency of data processing and analysis, and it can also help to improve the quality of data.
Capitalization: Uniformly capitalize textual data.
Date Formatting: Standardize date formats to YYYY-MM-DD UDT.
Currency Conversion: Convert all currency to a standard unit.
Measurement Unit Standardization: Convert all measurements to a standard unit (e.g., kilometers, USD).
Some relevant metrics to implement in a monitoring Data Quality Dashboard:
Data Consistency Ratio: the level of uniformity in the dataset after normalization procedures have been applied
Efficiency Gained Post-Normalization: measures the improvement in data processing and management tasks after normalization has been implemented.
Data Redundancy Factor: Measure of duplicate data before and after normalization.
Normalization Time: Time required to normalize a dataset.
Normality Score: A composite score representing how well the data conforms to normalization rules.
Data normalization is currently carried out via the ETL (Extract, Transform, Load) tool, tailored individually to the requirements of each case.
The specific normalization procedures are outlined within the mapping rules established for every Business Entity pertinent to a particular Domain (see Data Mapping Rules in each Domain).
Standard data normalization practices currently in operation include:
A comprehensive list of over 20 potential rules is identified, with their implementation definitions outlined; ready to be used.
Data Validation (DV) is the process that ensures the data complies with the defined formats, rules, standards and business-specific constraints. It is the process of checking data for errors and omissions, of ensuring that the data is accurate, complete, and consistent.
This process is more concerned with validating data against specific criteria, such as format checks, value constraints, and relationships.
Data Validation can be achieved following several approaches:
Data Profiling: by profiling the incoming data to understand its structure, patterns, and anomalies. This includes examining data types, values, and ranges.
Rule-Based Validation: Defining and implement validation rules that data should adhere to. These rules can include format checks, value constraints, and referential integrity.
For example, ensuring that dates are in the correct format or that numeric values fall within specific ranges.
Statistical Analysis: Utilizing statistical methods to identify outliers and unusual data patterns. This can help in detecting potential issues.
Data Schema Validation: Ensuring that the incoming data aligns with the predefined schema and metadata. Any variances should be flagged.
Automated Testing: Implementing automated testing processes to continuously validate data as it enters the DW. Automated tests can run regularly to detect issues promptly.
It's crucial for building trust and reliability in data.
Unverified or incorrect data can lead to erroneous conclusions, and misleading insights, which in turn can have a significant adverse impact on business decisions.
It involves validating the data against quality standards and identifying any errors or inconsistencies.
This can be done manually or automatically using a variety of tools and techniques.
Type Checks: Validate the data type (text, integer, float, etc.).
Format Checks and Validation:
Validate text patterns like email, phone numbers, and dates.
Range Checks: Verify that numerical data lies within defined ranges.
Completeness Checks: Ensure all mandatory fields are filled.
Domain Checks: Ensure data belongs to a defined set of permissible values.
Rules and metrics that can be used for data validation include:
Data Validation practices in terms of Data Management:
Detection: DV focuses on detecting data anomalies, errors, and issues.
The goal of DV rules is to detect errors, anomalies, and inconsistencies in the data.
Detected issues are typically related to non-compliance with specific data standards and rules.
Compliance: It ensures that data adheres to defined rules and constraints, such as referential integrity checks.
DV rules include checks like format validation (e.g., email format), uniqueness validation (e.g., unique IDs), and structure validation (e.g., address format).
Immediate Feedback: When issues are detected, the primary action is to raise alerts or notifications and possibly reject or flag the non-compliant data.
When DV rules detect violations, the primary action is to provide immediate feedback, such as alerts or data rejection.
Data Cleansing: DV may involve basic data cleansing steps to make the data conform to standards.
DV rules are often applied during data ingestion and initial processing phases to prevent incorrect data from entering the system.
Some relevant metrics to implement in a monitoring Data Quality Dashboard:
Data Validation Success Rate or Validation Accuracy: The percentage of records that have been validated correctly (that meet all validation rules) out of the total records processed.
Data Rejection Rate: The percentage of records that were rejected during validation due to errors or not meeting predefined criteria.
Time Taken for Validation: The total duration required to complete the validation process for a batch of data or a single record.
Number of Manual Interventions Required: The count of instances where human input or correction was necessary during the data validation process.
Field-Level Compliance Rate: The proportion of individual data fields across all records that pass validation checks.
Failed Validation Alerts: The total number of automated notifications generated when data does not pass the validation process.
Data validation is an important part of data curation, as it helps to ensure that the data is accurate and complete.
Data validation within the Data Ocean framework is an ongoing area of research. Presently, it primarily relies on the ETL (Extract, Transform, Load) tool for real-time execution.
In this approach, data validation checks are seamlessly integrated into the ETL pipeline. This ensures that data quality issues are promptly detected and addressed during data ingestion and transformation. Real-time data validation enables immediate feedback and corrective actions, mitigating the impact of poor-quality data on downstream processes.
Another approach under consideration involves the use of a Batch Data Validation (Scheduled Moment in Time), where data undergoes scheduled checks outside the ETL process, typically on a weekly, or monthly basis, depending on the organization's needs.
Rather than presenting an alternative, this approach is viewed as an additional layer of validation, providing a double-check to ensure thorough data validation. This process is likely to be integrated into the Data Quality process.
It's worth noting that a comprehensive implementation proposal has already been developed, which includes design and architectural considerations.
In addition to this implementation proposal, another avenue being explored is the utilization of Open-Source tools like "Great Expectations" and "Google Data Validation Tool (DVT)". While these tools offer robust capabilities, they do require a certain level of effort for learning and implementation. Nevertheless, their potential to significantly enhance data validation processes is acknowledged.
Another approach being explored is the execution of Batch Data Validation (Scheduled Moment in Time), Batch data validation is an approach where data is checked and evaluated at scheduled intervals or moments in time, typically outside of the ETL process, In this approach, data quality checks are performed on a periodic basis, such as daily, weekly, or monthly, depending on the organization's requirements. this option would not present as an alternative, but as an extra layer of validation, doubling the effort to validate data. This procedure will probably be implemented in the context of Data Quality process.
Notably, a comprehensive implementation proposal has already been developed, complete with design and architectural considerations.
Alternatively to this implementation proposal , another avenue being explored involves the utilization of Open-Source tools such as "Great Expectations" and "Google Data Validation Tool (DVT)". While these tools offer robust capabilities, they do require a certain level of effort for learning and implementation. Despite this, their potential to greatly enhance data validation processes is recognized.
Furthermore, the approach is tailored to the specific needs of each case. Detailed validation procedures are delineated within the mapping rules established for each Business Entity associated with a specific Domain. For more information on validation within a particular Domain, refer to the corresponding Data Mapping Rules Document.
Data validation within the Data Ocean framework is currently undergoing investigation. Notably, a comprehensive implementation proposal for a has already been developed, complete with design and architectural considerations.
Standard data validation practices currently in operation include:
A comprehensive list of over 20 potential rules is identified, with their implementation definitions outlined; ready to be used.
ETL or Data Transformation Phase:
Embed referential integrity checks within the ETL or data transformation phase of your Data Curation Engine.
Ensure that relationships between related data elements are maintained during data processing.
Data Quality is a broader evaluation of the overall health and fitness of the data.
Referential integrity checks in the Data Quality phase are part of a holistic assessment that examines data for issues like completeness, accuracy, consistency, and relationships.
This phase considers the quality of data in a more comprehensive manner, looking at not just specific rules but the impact of relationships on the data's usefulness and reliability.
In essence, while Data Validation checks specifically ensure that relationships between data elements are maintained according to predefined constraints, Data Quality assessments examine these relationships as one aspect of a more comprehensive evaluation of data's overall quality.
Absolutely, you can think of it in those terms. Data Validation (DV) is primarily about the detection of issues, ensuring that data meets predefined criteria and standards, while Data Quality (DQ) extends beyond detection to taking actions to maintain and improve the overall quality of the data. Your interpretation aligns with the common practices in data management:
Data Quality Assessment:
Make referential integrity checks a key component of your overall data quality assessment within the Data Curation Engine.
Evaluate the correctness of relationships as one aspect of data quality.
Pre-Ingestion Checks:
Alerts and Notifications:
Set up alerts and notifications within the Data Curation Engine to trigger when referential integrity violations are detected.
Notify relevant stakeholders, including data stewards and source system owners, about issues in real-time.
Scheduled Integrity Scans:
Logging and Reporting:
Error Handling and Recovery: