Introduction

Purpose of the Document

This wiki page serves as a comprehensive guide outlining the components and functionalities of the Data Ocean Architecture.

Following the structure proposed in the "Reference Architecture," this document will delve into how the architecture supports a scalable, organized, and secure system for handling a variety of data needs across the organization.

Target Audience

This document is intended for multiple audiences within the organization, including but not limited to:

Data Engineers: For understanding the workflow and where they contribute to the architecture.
Data Scientists: To comprehend how to access and interact with the data for analytical purposes.
Business Analysts: For knowing how the data flows and where they can extract the information they need for reports and dashboards.
Data Architects: Who are responsible for the overall structure and integrity of the data environment.
Technical Business Users: To gain insights into what data is available, how to access it, and under what conditions.
Data Governance Teams: For ensuring that the organization of data aligns with company policies and standards.

2. Data Flow

The Data Ocean Architecture is a sophisticated and comprehensive framework that facilitates efficient data processing and information dissemination within the organization.

This chapter delves into the intricate data flow within the Data Ocean Architecture, drawing insights from the Reference Architecture.

2.1. Overview

The architecture is composed of multiple interconnected blocks, each serving a distinct purpose in the data processing pipeline.

The Data Ocean Reference Architecture is designed to facilitate the smooth movement of data across various layers and components. The data flow within the Data Ocean Architecture traverses through a series of logically connected blocks, each contributing to the transformation of raw data into valuable insights. These blocks are strategically designed to ensure modularity, data integrity, and scalability.

This movement is guided by a well-structured flow that ensures data availability, quality, and accessibility throughout the data pipeline, ensuring timely information dissemination, serving diverse business needs within the organization.

The Data Flow within the Data Ocean Reference Architecture is structured to ensure efficient data processing, emphasizing the importance of improving how data is extracted, transformed, and loaded (ETL), using technology effectively and to its fullest potential. At the same time, it highlights the crucial need for maintaining control over data, making sure data persists reliably, and ensuring data remains accessible throughout the entire process.

2.1.1. General Requirements: ETL Process Reliability and Data Consistency

Reliable Repeatability: ETL processes must prioritize achieving a high level of reliability and repeatability. This focus ensures consistent and dependable data handling across all stages of the process.
Data Integrity and Consistency: The ETL process is obligated to uphold data integrity and consistency throughout its execution. At no point should the process allow data to become inconsistent, ensuring the highest quality of data remains maintained.
Transactional Execution: ETL processes must operate in a transactional manner. This means that the process should either succeed entirely, resulting in complete data transformation and loading, or it should fail completely, leaving no partial or erroneous data in its wake.
Criticality for End-User Data: The significance of adhering to transactional execution becomes even more pronounced when handling data intended for end-user consumption. Allowing for partial or inconsistent data in such scenarios is unacceptable, as it could lead to misinformation or operational disruptions.
Uncompromised Business Objects: The ETL process must safeguard business objects from the risk of empty or inconsistent states. The execution of ETL activities should never result in a business object being devoid of meaningful data or falling into a state of inconsistency.
End-User Facing Data: ETL activities involving data exposed to end users must adhere to the highest standards of reliability and data consistency. This ensures that the data presented to end users is accurate, trustworthy, and aligned with business objectives.

By adhering to these requirements, the ETL processes can maintain a high degree of reliability, transactional consistency, and data integrity across various stages, ultimately contributing to accurate decision-making and seamless business operations.

For more specific and complete rules, please see Data Engineer Guidelines.

2.2. Key Components of Data Flow

The data flow within the Reference Architecture involves several key components, as described in the Reference Architecture, including:

Data Sources: Both structured and unstructured data from various domains are extracted for processing.
Data Capturing: Data is captured through batch and streaming processes.
Lake House Architecture: Data undergoes storage, curation, and provisioning.
Data Science and Machine Learning: Analytical processes are conducted on the data.
- Self-service ETL: The architecture accommodates the capabilities; however, it does not explicitly endorse them due to their unique characteristics, which are closely intertwined with established best practices in Software Development and Architecture.
Data Management: Data undergoes cataloging, validation, in an orchestrated way.
Operations: Data security, workload management, environment management, and monitoring are applied.
Data Consumers: Data is accessed by BI tools and portals.

Data Sources

This block encompasses all external data sources, such as databases, APIs, web scraping, and files from both internal and external entities. The architecture emphasizes and strongly recommends the direct access to these sources whenever possible, to maintain control and governance over data quality and security.

About Files

The company ought to make efforts to reduce its reliance on files. Numerous departments exhibit an excessive dependency on files, leading to an overwhelming proliferation of files throughout the organization.

Ingesting files can be harmful if they are not completely managed by a job within the context of a specific known extraction process.

Some files come from unknown or difficult-to-identify sources, which can lead to compromised data quality, inconsistent data formats, and unanticipated changes in file structures and layouts, adding to the difficulty of efficiently troubleshooting issues. These inconsistencies might cause errors throughout the data extraction, transformation, and loading stages, resulting in incorrect data processing, data loss, or incomplete data sets.
Also, inaccuracies that occur can have a domino effect, influencing downstream analytics, reporting, and decision-making processes. In the end, this can have an influence on data accuracy and reliability, as well as create security and compliance issues.

It is important to note that this strategy will weaken the integration process by adding instability and risks that can result in failures and disruptions in the ingestion pipeline. At the very least, concreate actions should taken for mitigating these issues and ensuring the resilience and robustness of the solutions.

The goal should be to have robust pipelines and processes.

Data Extraction

Data is extracted from diverse sources using ETL processes.

The data extraction process involves retrieving data from various sources as outlined in the Reference Architecture. It's crucial to adhere to specific time constraints to access internal business applications, ensuring minimal impact on business systems. These extraction jobs should prioritize simplicity and speed, operating without dependencies beyond agreed-upon timing and source system limitations.

The preferred approach is to extract all available data (full table), unless certain constraints like resource limitations or time considerations arise (if feasible, full data loads are preferred to maintain data integrity). While techniques like Change Data Capture (CDC) can be considered for obtaining deltas, triggers or control columns in source tables are generally not deemed secure.

Data Validation and Extraction Strategy:

If full data extraction is feasible and any potential impact is identified (such as dealing with large files), options for calculating a Delta are explored. This could involve comparing previous and current files before loading or utilizing platform functionalities post-loading. In the absence of restrictions or adverse impacts, the full data file is loaded onto the platform. Maintaining low data latency is essential, even though real-time or streaming use cases are not currently in play, although the Data-ocean solution should support these in the future.

Please refer back to the detailed recommendations provided in the section on Extracting Data from Known Sources for more comprehensive guidance on this matter.

Data Capturing and Ingestion

The extracted data is directed into Cloud Storage, which serves as a designated landing zone. This zone is composed of controlled buckets that effectively prevent duplication and guarantee the secure storage of error-free data.

The integration process ensures that exceptions or errors are properly managed.

At the moment, the architecture's primary emphasis is on managing data in groups at scheduled intervals, which we refer to as batch processing. This approach involves handling data in chunks at specific times. However, it's important to note that the architecture is built with flexibility in mind. This means that it's well-prepared to support streaming data as well, which involves dealing with data in a continuous flow as it arrives. While the architecture currently leans more towards batch processing, it's designed to smoothly accommodate streaming capabilities whenever they become relevant or necessary.

Streaming data, characterized by its continuous, real-time nature, holds immense value in scenarios where timely insights and rapid actions based on dynamic data changes are critical. Applications spanning real-time monitoring, Internet of Things (IoT) devices, social media sentiment analysis, financial transactions, and more can greatly benefit from the immediate processing of streaming data.

The modular nature of the Data Ocean Architecture ensures that streaming data pipelines can be integrated without causing disruption to existing batch processing flows. This adaptability showcases the architecture's forward-looking approach, positioning it to seamlessly embrace emerging technologies and evolving data processing requirements.

While not the primary objective, this layer does offer the option of adopting a standard Data Lakehouse approach as a potential solution.

RAW Block

Afterward, the data is loaded into the RAW layer. This layer keeps the data in its original format, which means it can handle different types of data, including structured, unstructured, and semi-structured data.

For a more comprehensive understanding, it is recommended to refer to Chapter 4, with a specific focus on the section that pertains to the subject of data organization.

Curated Data Layer

Data transitioned from the RAW layer to the Curated Data Layer, operating in tandem with the curation block, undergoing essential processes like normalization, validation, and integration. This block serves as a vital checkpoint, ensuring data accuracy, conformity to standards, and alignment with technical prerequisites such as integrity and quality.

In specific scenarios, Self-Service ETL and Data Science Tools have access to this layer for particular use cases.

For a more comprehensive understanding, it is recommended to refer to Chapter 4, with a specific focus on the section that pertains to the subject of data organization.

Provisioning

This block focuses on provisioning detailed data models in an environment tailored for specific use cases.

Self-Service ETL, Data Science Tools, and in special circumstances, technical users can access and interact with this layer.

Typically, these processes should operate at the Data Product (DP) level. However, if the outcomes are relevant to the entire enterprise and should be accessible across the organization, they could be made available at the Domain level.

Domain Data Models

The Domain Data Models layer represents the enterprise-level data models, oriented to Domain.

For a more comprehensive understanding, it is recommended to refer to Chapter 4, with a specific focus on the section that pertains to the subject of data organization.

Data Products (DP)

Data flows to the Data Products block, meaning that Data Products are crafted from Domain data, providing tailored insights for specific business/user needs.

Each Data Product operates independently, ensuring flexibility and minimizing dependencies.

Self-Service ETL, Data Science Tools, BI tools though presentation views, and technical business users can access and interact with this layer for actionable insights.

For a more comprehensive understanding, it is recommended to refer to Chapter 4, with a specific focus on the section that pertains to the subject of data organization.

Data Management

Data Management is out of scope of this document

Operations

Operations is out of scope of this document

3. Data Integration: Principles and Best Practices

3.1. What is Data Integration?

Data integration refers to the process of combining data from disparate sources into a unified and coherent view. It involves harmonizing data formats, structures, and semantics to ensure consistent and accurate information is available for analysis, reporting, and decision-making.

This chapter provides an overview of Data integration from a general and best practices standpoint. For a more comprehensive understanding, please consult the section on Data Engineer Guidelines.

3.2. Goals of Data Integration

The primary goals of data integration in the Data Ocean architecture include:

Ensuring data consistency and accuracy across different sources.
Facilitating seamless data movement between various layers.
Providing a unified view of data for efficient analysis.
Reducing data redundancy and ensuring a single source of truth.

3.3. Benefits of Effective Data Integration

Effective data integration offers several benefits:

Enhanced data quality and reliability.
Improved decision-making based on accurate insights.
Increased operational efficiency through streamlined data processes.
Faster access to integrated data, saving time and effort.

3.4. Best Practices for Data Integration

To achieve successful data integration within the Data Ocean architecture, consider these best practices:

Define clear data integration goals and objectives.
Establish data governance to ensure data consistency and quality.
Utilize standardized data formats and semantics.
Implement data transformation processes to harmonize data.
Leverage automation tools and platforms for efficient integration.
Continuously monitor and validate integrated data for accuracy.

3.5. Data Integration Challenges and Solutions

Challenges in data integration may include data silos, schema differences, and varying data velocities. Solutions include:

Creating a data integration strategy and roadmap.
Implementing data virtualization to access data without physically moving it.
Adopting data cataloging and metadata management for better data discovery.

3.6. ETL Process Restructuring and Autonomy

Several key principles guide the restructuring of ETL processes to achieve greater autonomy and efficiency:

Extraction processes are designed to be independent of other pipeline processes and feed into the RAW and subsequent layers.
Transformation processes are decoupled from Extraction and Loading, enabling execution at any time of day for data preparation.
Loading processes for final EDW data models or Data Products are restructured to be independent of Transformation processes.
Extraction processes can be autonomously executed and shared across multiple downstream chains.
Transformation processes can operate independently of Loading processes, allowing heavy data preparation whenever needed, with data available in the EDW as required.
Load and data refresh occur by domain and are independent processes.
The ETL processes are restructured to be independent of other data pipeline processes, promoting autonomy, reusability, and process efficiency.

3.6.1. Idempotent ETL

The aim is to design ETL processes to be idempotent, ensuring that if the same operation is executed multiple times, the outcome remains consistent and doesn't produce unintended side effects.

3.7. ETL Management and Monitoring

The restructured ETL processes operate under orchestration, generating audit information. This information can be used to obtain job status, supervised under operations, in a monitored and secure environment.

4. Data Organization

Data Organization serves as a crucial element of the Reference Architecture, designed to align data processes across various business units effectively. This solution is structured around a layered approach, demarcating distinct stages of data processing, all orchestrated through the prism of transversal Domains or subjects.

The architecture's structure adheres to a Domain-Driven approach that aligns well with Solvay Atom's transformation goals and fosters data culture and accountability.

This chapter outlines the core components, the layered architecture, and the nine specific business Domains.

4.1. Architectural Layers

The reference architecture delineates a coherent framework composed of five distinctive layers. The Data Ocean architecture includes several layers to manage the complexity and demands of a data-driven organization. Each layer signifies a significant phase in the data journey, spanning from data ingestion to the creation of actionable insights:

4.1.1. Raw Data Layer (Ingestion Layer)

Also known as the Landing Area, this layer is the initial repository where data is ingested swiftly and efficiently. Data retains its native format at this stage, with no transformations applied, which ensures that the data can serve as a point-in-time archive.

The layer is hierarchically organized based on subject areas, data sources, and time of ingestion.

Access to this layer is restricted to prevent unauthorized or incorrect usage.

4.1.2. Normalized Data Layer (Staging - STG)

This layer serves as an intermediary to enhance performance in transferring data from the Raw layer to the Curated layer. The data is loaded in near raw format, primarily used for layout validation, basic data checks and control, housing data that is not yet ready for direct consumption.

Complex files, such as XML or JSON, are not processed in this layer.

To simplify data access, a relational layer is in place, where, for simple file formats, such as CSV files, a direct one-to-one conversion to table format is established, with additional control information and metadata. For more complex files, they are just included into a table format with specialized column data types, supplemented by the same additional control information and metadata.

4.1.3. Curated / Cleansed Data Layer (ODS)

Also referred to as the Conformed Layer, Curated Data is transformed into consumable datasets, structured for specific purposes and possible partitioning to a more granular level. Cleansing, validation, and normalization may be applied to enhance data quality. Preliminary transformations often take place at this stage to ensure high-quality, consistent data.

Within the Data Curation layer, the data undergoes a series of cleansing and transformation activities. These processes encompass vital steps including normalization, validation, and integration. The primary objectives are to ensure data accuracy, adherence to standards, and alignment with technical prerequisites such as data integrity and quality.

The comprehensive treatment of data involves several tasks. These tasks comprise standardizing dates to a consistent format, aligning string values for uniformity, validating and generating keys for efficient linking, and resolving complex structures, such as those found in JSON or XML hierarchies.

Despite the current absence of Data Quality verification within this stage, as it lies beyond the current scope, the intention remains to integrate these improvements into the Data Curation layer in the foreseeable future. It is worth noting, however, that an independent project is already actively addressing this particular aspect.

For more in-depth information, please refer to the section on Data Curation.

It's noteworthy that in cases where simple file formats like CSV are utilized, they maintain a direct one-to-one correspondence with their respective source data. However, when dealing with intricate file formats, a one-to-many relationship can emerge between the source and their table formats, due to the complexities involved.

Where possible, the curated data follows the same column naming convention as the source data.

In specific scenarios, for particular use cases, operational reporting could be created, when a more agile and expedient solution is required, or when it simplifies complexity and reduces the strain on operational system resources. Importantly, this alternative approach is designed to minimize any potential impact on operational systems.

4.1.4. Use-Case Oriented Layers (Domain/Product)

These specialized layers apply additional business logic to the data. They source data from the Cleansed layer and are enforced with any needed business logic or security measures. Data models to address business analytical requirements are also created here.

4.1.4.1. Use-Case Oriented: Domain (DM)

This layer presents a central enterprise-level repository, structured as per specific Domains or subjects. Each Domain represents a key area of business relevance. It acts as an Enterprise Data Warehouse (EDW), housing subject-oriented, integrated, time-variant, and nonvolatile collection of data that serves as a single version of truth for the organization, adhere to a Data-oriented philosophy, reflecting the true and real relationships among operational entities, mapping business entities closely to operational systems, remaining independent of particular user needs and independent of changing business requirements.

These models, will be structured based on Dimensional Modeling principles in a snowflake flavor. This design doesn't prioritize performance, usage simplicity or cater to specific user/business prerequisites, instead, it accentuates data-oriented attributes and facilitate cross-domain analysis. This approach enhances the resilience of represented entities and the resulting Data Model, while concurrently reducing costs and efforts associated with the development and maintenance of Data Processes (ETL) and Data Pipelines.

For more in-depth information, please refer to the section on Data Modeling at Domain level.

4.1.4.2. Use-Case Oriented: Data Product (DP)

These are the final insights derived from the architecture. Data Products are subject-specific, optimized for performance and tailored to cater to particular user or business needs.

This layer facilitates informed decision-making and empowers users with actionable insights.

The data model should be in aligned with the needs of Data Visualization Teams and adhere to any particular constraints associated with the BI tools employed. It is advisable for the model to be straightforward, easily comprehensible, and designed for performance and specific user/business requirements.

Each Data Product operates independently, ensuring flexibility and minimizing dependencies.

Data is served though presentation views to the BI tools, providing tailored insights for specific business/user needs.

For more in-depth information, please refer to the section on Data Modeling at DP level.

4.1.4.3. Use-Case Oriented Sandbox

This optional layer is proposed for advanced analysts and data scientists to conduct experiments.

This optional layer is suggested for advanced analysts and data scientists to run experiments. It's a space where they can conduct tests to find patterns, correlations, or validate machine learning models.

It can also serve the purpose of conducting complex analyses for technical Business Analysts who are restricted from directly altering Data Ocean Schemas - no one except the project initiatives within the scope of the Data Ocean implementation is permitted to create, modify, or write on Data Ocean Schemas.

4.2. Domain-Driven Architecture

The data organization within the architecture is designed around nine business domains:

2.2.1 Business Domains

HR (Human Resources): Focuses on employee data, covering aspects like recruitment, payroll, and performance metrics.
Procurement: Centralizes data related to vendor management, contracts, and procurement cycles.
Finance: Manages financial records, including budgets, income, expenses, and other fiscal reports.
Marketing & Sales: Addresses customer interaction data, sales metrics, and market analysis.
Supply Chain: Deals with logistical data concerning supply chain management, inventories, and distribution.
Structure & Shared Domain: Contains data shared across various business units and aspects related to the organizational structure.
Industrial: Houses data related to manufacturing, equipment health, and quality control.
R&I (Research & Innovation): Maintains data on R&D projects, patents, and scientific research.

2.2.2. Additional Domains

Technical Domain: This is where system metadata, context, and technical details are stored.
Common Domain: For data that is shared across all business units, such as common referential information.

2.2.3 Domain Responsibilities

Domains are responsible for creating and maintaining quality datasets.
Each domain must ensure their data meets specified standards such as being discoverable, addressable, and trustworthy.
Each Domain has an associated Data Architect

2.2.4 Roles Within Domains

Data Product Owner: Responsible for consumer satisfaction, quality of the domain datasets, and overall data lifecycle management.
Data Team: Focused on platform enhancements, monitoring, automation, and alerting.

2.2.5 Value and Benefits

Centers data acquisition, processing, and serving with domain experts
Decreases common data pain points like data cleansing and orientation
Supports the emergence of a data product focus

2.2.6 Capabilities and Infrastructure

The architecture is equipped with:

Scalable, secure, and governed storage
Encryption standards
Metadata management
Data pipeline orchestration
Unified Access Control
Monitoring, alerting, and logging
Self-service capabilities

2.2.7 Governance and Team Structure

Aims to reduce duplication of effort across domains
Provides essential shared services and tools
Focuses on delivering value while adhering to security and governance protocols

By understanding the layers and the domain-driven approach, you can appreciate how the Data Ocean Architecture enables data to be effectively managed, secured, and leveraged for organizational success.

Domain-Centric Approach

Within each Domain, the architecture embraces a standardized structure, implemented through distinct layers or schemas:

Staging/RAW (STG): Raw Data ingested from various sources is stored here in its original form. This layer is crucial for quick validation and control, housing data that is not yet ready for direct consumption.
Curated (ODS): Data transitions to the Curated Layer, where it's prepared for consumption. Cleansing, validation, and normalization may be applied to enhance data quality.
Domain Models/EDW (DM): The Enterprise Data Warehouse (EDW) embodies this layer, featuring comprehensive data models that adhere to a data-oriented philosophy. These models, structured based on Dimensional Modeling principles, facilitate cross-domain analysis.
Working Data Layer (WDL): This schema accommodates temporary tables and objects that support specific analytical or operational processes.
Data Presentation Layer (DPL): Designed as a pivotal abstraction layer, DPL is realized through views, decoupling data usage from its physical representation. These views offer security, access control, and a consistent data consumption experience.
DS_<DP_Identification>: These schemas ensure isolation and control for Data Products (DPs). They are designed for specific Data Products, granting them a secure and controlled environment to operate within.
Reference Data (RFD): This schema holds common reference data, fostering consistency and shared understanding across the organization.
DS_APP_<Application Code>: This schema accommodates configuration information tailored to specific web applications, ensuring a smooth consumption experience.

Data Access and Usage

The architecture offers carefully orchestrated access to data products:

Views for Data Access: Data Products interact with data through views, granting controlled access to data in Domains. These views enable column masking, filtering, and structural concealment to ensure data security.
Domain-Specific Schemas: Security and isolation are upheld by associating each Domain with specific schemas, granting tailored access to technical users and authorized stakeholders.
DS_DataOcean Schema: Simplifying data access, this schema hosts views that grant a seamless interaction with data, negating the need to provide full table or view names.

Conclusion

In conclusion, the Data Flow within the Data Ocean Reference Architecture is structured around a multi-layered approach, encompassing distinct phases of data processing, transversal Domains, and domain-centric schemas. This meticulously organized framework ensures data integrity, security, and tailored access, fostering data-driven insights and informed decision-making.

Page tree

Components and Functionality

Introduction

Purpose of the Document

Target Audience

2. Data Flow

2.1. Overview

2.1.1. General Requirements: ETL Process Reliability and Data Consistency

2.2. Key Components of Data Flow

Data Sources

About Files

Data Extraction

Data Validation and Extraction Strategy:

Data Capturing and Ingestion

Curated Data Layer

Provisioning

Domain Data Models

Data Products (DP)

Data Management

Operations

3. Data Integration: Principles and Best Practices

3.1. What is Data Integration?

3.2. Goals of Data Integration

3.3. Benefits of Effective Data Integration

3.4. Best Practices for Data Integration

3.5. Data Integration Challenges and Solutions

3.6. ETL Process Restructuring and Autonomy

3.6.1. Idempotent ETL

3.7. ETL Management and Monitoring

4. Data Organization

4.1. Architectural Layers

4.1.1. Raw Data Layer (Ingestion Layer)

4.1.2. Normalized Data Layer (Staging - STG)

4.1.3. Curated / Cleansed Data Layer (ODS)

4.1.4. Use-Case Oriented Layers (Domain/Product)

4.1.4.1. Use-Case Oriented: Domain (DM)

4.1.4.2. Use-Case Oriented: Data Product (DP)

4.1.4.3. Use-Case Oriented Sandbox

4.2. Domain-Driven Architecture

2.2.1 Business Domains

2.2.2. Additional Domains

2.2.3 Domain Responsibilities

2.2.4 Roles Within Domains

2.2.5 Value and Benefits

2.2.6 Capabilities and Infrastructure

2.2.7 Governance and Team Structure

Domain-Centric Approach

Data Access and Usage

Conclusion