Introduction

Purpose of the Document

This wiki page serves as a comprehensive guide outlining the components and functionalities of the Data Ocean Architecture.

Following the structure proposed in the "Reference Architecture," this document will delve into how the architecture supports a scalable, organized, and secure system for handling a variety of data needs across the organization.

Target Audience

This document is intended for multiple audiences within the organization, including but not limited to:


2. Data Flow 

2.1. Data Flow Overview

The Data Ocean Reference Architecture is designed to facilitate the smooth movement of data across various layers and components.

This movement is guided by a well-structured flow that ensures data availability, quality, and accessibility throughout the data pipeline, ensuring timely information dissemination, serving diverse business needs within the organization.

The Data Flow within the Data Ocean Reference Architecture is structured to ensure efficient data processing in a well-orchestrated flow, emphasizing the importance of improving how data is extracted, transformed, and loaded (ETL), using technology effectively and to its fullest potential. At the same time, it highlights the crucial need for maintaining control over data, making sure data persists reliably, and ensuring data remains accessible throughout the entire process.

It's recommended that ETL processes aim to be as reliably repeatable as possible, enhancing data consistency and dependability at every step.

An ETL process must always prevent data from becoming inconsistent and should function in a transactional manner, either completely succeeding or failing entirely. This aspect becomes even more critical when addressing operations that pertain to data intended for end-user consumption or data somehow exposed to end users. It remains of utmost importance that a Business Object remains devoid of any possibility of becoming empty or assuming an inconsistent state due to the execution of ETL activities.


2.2. Key Components of Data Flow

The data flow within the Reference Architecture involves several key components, including:

3. Data Integration: Principles and Best Practices

3.1. What is Data Integration?

Data integration refers to the process of combining data from disparate sources into a unified and coherent view. It involves harmonizing data formats, structures, and semantics to ensure consistent and accurate information is available for analysis, reporting, and decision-making.

3.2. Goals of Data Integration

The primary goals of data integration in the Data Ocean architecture include:

3.3. Benefits of Effective Data Integration

Effective data integration offers several benefits:

3.4. Best Practices for Data Integration

To achieve successful data integration within the Data Ocean architecture, consider these best practices:

3.5. Data Integration Challenges and Solutions

Challenges in data integration may include data silos, schema differences, and varying data velocities. Solutions include:

2: Data Organization

Data Organization serves as a crucial element of the Reference Architecture, designed to align data processes across various business units effectively.

The architecture's structure adheres to a Domain-Driven approach that aligns well with Solvay Atom's transformation goals and fosters data culture and accountability.

This chapter outlines the core components, the layered architecture, and the nine specific business Domains.

2.1 Architectural Layers

The Data Ocean architecture includes several layers to manage the complexity and demands of a data-driven organization:

2.1.1 Raw Data Layer (Ingestion Layer)

This is the foundational layer where data in its native format is ingested into the architecture. No data transformations occur at this stage, which ensures that the data can serve as a point-in-time archive. The layer is hierarchically organized based on subject areas, data sources, and time of ingestion. Access to this layer is restricted to prevent unauthorized or incorrect usage.

2.1.2 Normalized Data Layer (Staging)

This optional layer serves as an intermediary to enhance performance in transferring data from the Raw layer to the Curated layer. It stores data in an optimized format suitable for data cleansing and possible partitioning to a more granular level.

2.1.3 Cleansed Data Layer (ODS / Curated)

Here, data is transformed into consumable datasets, available either in files or tables. Before reaching this layer, the data undergoes a series of cleansing and transformation activities. It's also the most complex part of the architecture, as data is denormalized and different objects may be consolidated here.

2.1.4 Use-Case Oriented Layers (Domain/Product)

These specialized layers apply additional business logic or machine learning models to the data. They source data from the Cleansed layer and are enforced with any needed business logic or security measures. Data models to address business analytical requirements are also created here.

2.1.5 Use-Case Oriented Sandbox

This optional layer is for advanced analysts and data scientists to conduct experiments. Here, they can perform tests to find patterns, correlations, or to validate machine learning models.

2.2 Domain-Driven Architecture

The data organization within the architecture is designed around nine business domains:

2.2.1 Business Domains

  1. HR (Human Resources): Focuses on employee data, covering aspects like recruitment, payroll, and performance metrics.
  2. Procurement: Centralizes data related to vendor management, contracts, and procurement cycles.
  3. Finance: Manages financial records, including budgets, income, expenses, and other fiscal reports.
  4. Marketing & Sales: Addresses customer interaction data, sales metrics, and market analysis.
  5. Supply Chain: Deals with logistical data concerning supply chain management, inventories, and distribution.
  6. Structure & Shared Domain: Contains data shared across various business units and aspects related to the organizational structure.
  7. Industrial: Houses data related to manufacturing, equipment health, and quality control.
  8. R&I (Research & Innovation): Maintains data on R&D projects, patents, and scientific research.
2.2.2. Additional Domains
2.2.3 Domain Responsibilities
2.2.4 Roles Within Domains

2.2.5 Value and Benefits

2.2.6 Capabilities and Infrastructure

The architecture is equipped with:

2.2.7 Governance and Team Structure

By understanding the layers and the domain-driven approach, you can appreciate how the Data Ocean Architecture enables data to be effectively managed, secured, and leveraged for organizational success.