Table of Contents
1. Introduction
Data curation is the process of collecting, organizing, and preserving data for future use. It is essential for ensuring the quality and usability of data, and it is becoming increasingly important as the volume and complexity of data continues to grow.
In the context of a data ocean, data curation is even more critical. A data ocean is a vast repository of data that is collected from a variety of sources. This data can be structured, unstructured, or semi-structured, and it can be of varying quality.
The goal of data curation in a data ocean is to ensure that the data is:
- Accurate: The data must be free of errors and omissions.
- Complete: The data must be comprehensive and cover all aspects of the domain of interest.
- Consistent: The data must be formatted and organized in a consistent manner.
- Reliable: The data must be trustworthy and reliable.
- Usable: The data must be easy to find, access, and understand.
Data curation in a data ocean can be challenging, but it is essential for making the data valuable and accessible to users.
2. Data Validation
Data validation is the process of checking data for errors and omissions, of ensuring that the data is accurate, complete, and consistent.
It involves validating the data against quality standards and identifying any errors or inconsistencies.
This can be done manually or automatically using a variety of tools and techniques.
Some rules and metrics that can be used for data validation include:
- Completeness: Ensure that all required data fields are present and contain valid values.
- Consistency: Ensure that the data is consistent across all sources and that there are no conflicting values.
- Accuracy: Ensure that the data is accurate and reflects the real-world values it represents.
- Timeliness: Ensure that the data is up-to-date and reflects the latest information.
Some common data validation rules include:
- The data must be within a specified range.
- The data must match a specific format.
- The data must be unique.
- The data must be consistent with other data.
- Uniqueness Check
- Boundary Values Validation
Data validation is an important part of data curation, as it helps to ensure that the data is accurate and complete.
Data validation within the Data Ocean framework is currently undergoing investigation. Notably, a comprehensive implementation proposal has already been developed, complete with design and architectural considerations.
Additionally, another avenue being explored involves the utilization of Open-Source tools such as "Great Expectations" and "Google Data Validation Tool (DVT)". While these tools offer robust capabilities, they do require a certain level of effort for learning and implementation. Despite this, their potential to greatly enhance data validation processes is recognized.
A comprehensive list of over 20 potential rules is identified, with their implementation definitions outlined; however, they have not been put into practice yet.
3. Data Normalization
Data normalization is the process of organizing data in a consistent manner. This involves standardizing the data format, removing duplicate data, and identifying and correcting errors.
Data normalization can improve the efficiency of data processing and analysis, and it can also help to improve the quality of data.
Data Normalization is presently being executed through the ETL tool and is applied on a case-by-case basis. These normalization processes are delineated in the mapping rules for each Business Entity.
Some common data normalization in place:
- Cast or Data Type conversion
- Format Data, in particular Dates to UTC
- Upper/lower conversion
- Trim data
- SGK creation
- "Ghost" records insertion
- Derived Column Creation
- Substitution of NULL by default value
A comprehensive list of over 20 potential rules is identified, with their implementation definitions outlined; ready to be used.
4. Data Quality
Data quality is a measure of the accuracy, completeness, and consistency of data.
High-quality data is essential for making informed decisions, and it is also important for ensuring the reliability of data-driven systems.
There are a number of factors that can affect data quality, including:
- The quality of the data collection process
- The quality of the data storage and processing systems
- The quality of the data governance processes
Data quality can be improved through a variety of measures, including:
- Implementing data validation and normalization procedures
- Enforcing data quality policies and standards
- Educating users about data quality
- Using data quality tools and techniques
At this stage, the validation of Data Quality is not currently within the immediate scope. Nonetheless, the intent remains to integrate these enhancements into the Data Curation layer in the forthcoming phases. It is noteworthy to emphasize that a separate project is already underway, actively tackling this particular aspect.
A comprehensive list of over 20 potential rules is identified, with their implementation definitions outlined; however, they have not been put into practice yet.
For more in-depth information on existing solution, please refer to the link Data Quality dashboard.
For more information on subject, please refer to the link Data Quality.
5. Future Actions
Some proposed actions.
Implement Data Quality within the Data Ocean ecosystem
Taking a cue from the Data Quality KPI Dashboard, a potential step forward to enhance data curation within the Data Ocean context is the introduction of a centralized data quality initiative. This initiative would have the responsibility of overseeing data quality throughout the entire Data Ocean ecosystem. Its primary role would involve identifying and promptly alerting stakeholders about any data quality concerns.
Establishment of a data quality initiative at the operational level
Another prospective strategy to consider is the establishment of a data quality initiative at the operational level, geared towards real-time data analysis and rectification. This approach reveals its significance in addressing data anomalies and discrepancies promptly, thereby maintaining the integrity of the information ecosystem.
This operational-level data quality initiative would involve deploying advanced algorithms and automated processes that continuously monitor incoming data streams. By leveraging real-time analytics, this system can instantaneously identify deviations from predefined data quality benchmarks. In the event of discrepancies, automated corrective measures can be applied, ranging from data enrichment through external sources to flagging erroneous entries for manual review.
A critical aspect of this initiative would be its proactive nature. Instead of relying solely on retrospective audits, it would function in an anticipatory mode, precluding the propagation of erroneous data into downstream processes. Timely alerts would be generated for immediate corrective actions, minimizing the risk of inaccurate insights, faulty decision-making, or downstream process disruptions.
Furthermore, such an operational-level data quality initiative would synergize with the existing data curation practices, forming a robust defense against data inconsistencies. This approach not only aligns with best practices in data governance but also positions the Data Ocean architecture for greater reliability and value generation.
To execute this initiative effectively, collaboration across cross-functional teams, including data engineers, analysts, and domain experts, is crucial. Additionally, the establishment of clear workflows, data quality metrics, and continuous performance monitoring mechanisms will be pivotal to ensure its success. By integrating real-time data quality assurance into the Data Ocean, this initiative can significantly elevate the overall data ecosystem's reliability and usability.
Existing SAP Info Steward could be used.
Select and implement a Data Validation Tool
Conclude the thorough analysis of the identified tools and choose one for conducting a Proof of Concept (POC).
References
- Data Curation: https://en.wikipedia.org/wiki/Data_curation
- Data Validation: https://en.wikipedia.org/wiki/Data_validation
- Data Normalization: https://en.wikipedia.org/wiki/Data_normalization
- Data Quality: https://en.wikipedia.org/wiki/Data_quality
- "Great Expectations": https://greatexpectations.io/
- "Google Data Validation Tool (DVT)": https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt