1. Introduction

Data curation is the process of collecting, organizing, and preserving data for future use. It is essential for ensuring the quality and usability of data, and it is becoming increasingly important as the volume and complexity of data continues to grow.

In the context of a data ocean, data curation is even more critical. A data ocean is a vast repository of data that is collected from a variety of sources. This data can be structured, unstructured, or semi-structured, and it can be of varying quality.

The goal of data curation in a data ocean is to ensure that the data is:

Data curation in a data ocean can be challenging, but it is essential for making the data valuable and accessible to users.

2. Data Normalization

2.1. Definition

Data Normalization involves transforming data into a common format to enable seamless integration and analysis.

Data normalization is the process of organizing data in a consistent manner. This involves standardizing the data format, removing duplicate data, and identifying and correcting errors.

2.2. Importance

When data from various sources is aggregated, there's often a mismatch in formats, units, or encoding. Normalization resolves these disparities, ensuring consistency and reducing redundancy, making data integration and analytics more efficient, ensuring a single version of truth exists within the Data Ocean.

Data normalization can improve the efficiency of data processing and analysis, and it can also help to improve the quality of data.

2.3. Typical Rules and Actions

  1. Capitalization: Uniformly capitalize textual data.

  2. Date Formatting: Standardize date formats to YYYY-MM-DD UDT.

  3. Currency Conversion: Convert all currency to a standard unit.

  4. Measurement Unit Standardization: Convert all measurements to a standard unit (e.g., kilometers, USD).

2.4. Metrics and KPIs

Some relevant metrics to implement in a monitoring Data Quality Dashboard:

  1. Data Consistency Ratio: the level of uniformity in the dataset after normalization procedures have been applied

  2. Efficiency Gained Post-Normalization: measures the improvement in data processing and management tasks after normalization has been implemented.

  3. Data Redundancy Factor: Measure of duplicate data before and after normalization.

  4. Normalization Time: Time required to normalize a dataset.

  5. Normality Score: A composite score representing how well the data conforms to normalization rules.

2.5. Data Ocean Enforced Rules

Data Normalization is presently being executed through the ETL tool and is applied on a case-by-case basis.

These normalization processes are delineated in the mapping rules for each Business Entity in a Domain.

Common data normalization in place:

A comprehensive list of over 20 potential rules is identified, with their implementation definitions outlined; ready to be used.

3. Data Validation

3.1. Definition

Data Validation is the process that ensures the data complies with the defined formats, rules, and business-specific constraints. It is the process of checking data for errors and omissions, of ensuring that the data is accurate, complete, and consistent.

3.2. Importance

It's crucial for building trust and reliability in data.

Unverified or incorrect data can lead to erroneous conclusions, and misleading insights, which in turn can have a significant adverse impact on business decisions.

It involves validating the data against quality standards and identifying any errors or inconsistencies.

This can be done manually or automatically using a variety of tools and techniques.

3.3. Typical Rules and Actions

  1. Type Checks: Validate the data type (text, integer, float, etc.).

  2. Format Checks:

  3. Range Checks: Verify that numerical data lies within defined ranges.

  4. Completeness Checks: Ensure all mandatory fields are filled.

  5. Uniqueness Check: Verify that primary keys or unique identifiers do not have duplicates.
  6. Consistency Check
  7. Domain Checks: Ensure data belongs to a defined set of permissible values.

Rules and metrics that can be used for data validation include:

3.4. Metrics and KPIs

Some relevant metrics to implement in a monitoring Data Quality Dashboard:

  1. Data Validation Success Rate or Validation Accuracy: The percentage of records that have been validated correctly (that meet all validation rules) out of the total records processed.

  2. Data Rejection Rate: The percentage of records that were rejected during validation due to errors or not meeting predefined criteria.

  3. Time Taken for Validation: The total duration required to complete the validation process for a batch of data or a single record.

  4. Number of Manual Interventions Required: The count of instances where human input or correction was necessary during the data validation process.

  5. Field-Level Compliance Rate: The proportion of individual data fields across all records that pass validation checks.

  6. Failed Validation Alerts: The total number of automated notifications generated when data does not pass the validation process.

2.5. Data Ocean Enforced Rules

Data validation is an important part of data curation, as it helps to ensure that the data is accurate and complete.

Data validation within the Data Ocean framework is currently undergoing investigation. Notably, a comprehensive implementation proposal has already been developed, complete with design and architectural considerations.