View Source

Introduction

Purpose of the Document

This wiki page serves as a comprehensive guide outlining the components and functionalities of the Data Ocean Architecture.

Following the structure proposed in the "Reference Architecture," this document will delve into how the architecture supports a scalable, organized, and secure system for handling a variety of data needs across the organization.

Target Audience

This document is intended for multiple audiences within the organization, including but not limited to:

Data Engineers: For understanding the workflow and where they contribute to the architecture.
Data Scientists: To comprehend how to access and interact with the data for analytical purposes.
Business Intelligence Analysts: For knowing how the data flows and where they can extract the information they need for reports and dashboards.
Data Architects: Who are responsible for the overall structure and integrity of the data environment.
Technical Business Users: To gain insights into what data is available, how to access it, and under what conditions.
Data Governance Teams: For ensuring that the organization of data aligns with company policies and standards.

2. Data Flow

2.1. Data Flow Overview

The Data Ocean Reference Architecture is designed to facilitate the smooth movement of data across various layers and components.

This movement is guided by a well-structured flow that ensures data availability, quality, and accessibility throughout the data pipeline, ensuring timely information dissemination, serving diverse business needs within the organization.

The Data Flow within the Data Ocean Reference Architecture is structured to ensure efficient data processing in a well-orchestrated flow, emphasizing the importance of improving how data is extracted, transformed, and loaded (ETL), using technology effectively and to its fullest potential. At the same time, it highlights the crucial need for maintaining control over data, making sure data persists reliably, and ensuring data remains accessible throughout the entire process.

It's recommended that ETL processes aim to be as reliably repeatable as possible, enhancing data consistency and dependability at every step.

An ETL process must always prevent data from becoming inconsistent and should function in a transactional manner, either completely succeeding or failing entirely. This aspect becomes even more critical when addressing operations that pertain to data intended for end-user consumption or data somehow exposed to end users. It remains of utmost importance that a Business Object remains devoid of any possibility of becoming empty or assuming an inconsistent state due to the execution of ETL activities.

Data Ocean > Components and Functionality > 2023-08-29 09_05_56-Data Ocean Architecture (Meeting)_v2.pptx - PowerPoint.png

2.2. Key Components of Data Flow

The data flow within the Reference Architecture involves several key components, including:

Data Sources: Both structured and unstructured data from various domains are extracted for processing.
Data Capturing: Data is captured through batch and streaming processes.
Lake House Architecture: Data undergoes storage, curation, and provisioning.
Data Science and Machine Learning: Analytical processes are conducted on the data.
Data Management: Data undergoes cataloging, validation, and orchestration.
Operations: Data security, workload management, environment management, and monitoring are applied.
Data Consumers: Data is accessed by BI tools and portals.

3. Data Integration: Principles and Best Practices

3.1. What is Data Integration?

Data integration refers to the process of combining data from disparate sources into a unified and coherent view. It involves harmonizing data formats, structures, and semantics to ensure consistent and accurate information is available for analysis, reporting, and decision-making.

3.2. Goals of Data Integration

The primary goals of data integration in the Data Ocean architecture include:

Ensuring data consistency and accuracy across different sources.
Facilitating seamless data movement between various layers.
Providing a unified view of data for efficient analysis.
Reducing data redundancy and ensuring a single source of truth.

3.3. Benefits of Effective Data Integration

Effective data integration offers several benefits:

Enhanced data quality and reliability.
Improved decision-making based on accurate insights.
Increased operational efficiency through streamlined data processes.
Faster access to integrated data, saving time and effort.

3.4. Best Practices for Data Integration

To achieve successful data integration within the Data Ocean architecture, consider these best practices:

Define clear data integration goals and objectives.
Establish data governance to ensure data consistency and quality.
Utilize standardized data formats and semantics.
Implement data transformation processes to harmonize data.
Leverage automation tools and platforms for efficient integration.
Continuously monitor and validate integrated data for accuracy.

3.5. Data Integration Challenges and Solutions

Challenges in data integration may include data silos, schema differences, and varying data velocities. Solutions include:

Creating a data integration strategy and roadmap.
Implementing data virtualization to access data without physically moving it.
Adopting data cataloging and metadata management for better data discovery.

2: Data Organization

Data Organization serves as a crucial element of the Reference Architecture, designed to align data processes across various business units effectively.

The architecture's structure adheres to a Domain-Driven approach that aligns well with Solvay Atom's transformation goals and fosters data culture and accountability.

This chapter outlines the core components, the layered architecture, and the nine specific business Domains.

Data Ocean > Components and Functionality > 2023-08-29 08_43_30-Data Ocean Architecture (Meeting) - Google Slides.png

2.1 Architectural Layers

The Data Ocean architecture includes several layers to manage the complexity and demands of a data-driven organization:

2.1.1 Raw Data Layer (Ingestion Layer)

This is the foundational layer where data in its native format is ingested into the architecture. No data transformations occur at this stage, which ensures that the data can serve as a point-in-time archive. The layer is hierarchically organized based on subject areas, data sources, and time of ingestion. Access to this layer is restricted to prevent unauthorized or incorrect usage.

2.1.2 Normalized Data Layer (Staging)

This optional layer serves as an intermediary to enhance performance in transferring data from the Raw layer to the Curated layer. It stores data in an optimized format suitable for data cleansing and possible partitioning to a more granular level.

2.1.3 Cleansed Data Layer (ODS / Curated)

Here, data is transformed into consumable datasets, available either in files or tables. Before reaching this layer, the data undergoes a series of cleansing and transformation activities. It's also the most complex part of the architecture, as data is denormalized and different objects may be consolidated here.

2.1.4 Use-Case Oriented Layers (Domain/Product)

These specialized layers apply additional business logic or machine learning models to the data. They source data from the Cleansed layer and are enforced with any needed business logic or security measures. Data models to address business analytical requirements are also created here.

2.1.5 Use-Case Oriented Sandbox

This optional layer is for advanced analysts and data scientists to conduct experiments. Here, they can perform tests to find patterns, correlations, or to validate machine learning models.

2.2 Domain-Driven Architecture

The data organization within the architecture is designed around nine business domains:

2.2.1 Business Domains

HR (Human Resources): Focuses on employee data, covering aspects like recruitment, payroll, and performance metrics.
Procurement: Centralizes data related to vendor management, contracts, and procurement cycles.
Finance: Manages financial records, including budgets, income, expenses, and other fiscal reports.
Marketing & Sales: Addresses customer interaction data, sales metrics, and market analysis.
Supply Chain: Deals with logistical data concerning supply chain management, inventories, and distribution.
Structure & Shared Domain: Contains data shared across various business units and aspects related to the organizational structure.
Industrial: Houses data related to manufacturing, equipment health, and quality control.
R&I (Research & Innovation): Maintains data on R&D projects, patents, and scientific research.

2.2.2. Additional Domains

Technical Domain: This is where system metadata, context, and technical details are stored.
Common Domain: For data that is shared across all business units, such as common referential information.

2.2.3 Domain Responsibilities

Domains are responsible for creating and maintaining quality datasets.
Each domain must ensure their data meets specified standards such as being discoverable, addressable, and trustworthy.
Each Domain has an associated Data Architect

2.2.4 Roles Within Domains

Data Product Owner: Responsible for consumer satisfaction, quality of the domain datasets, and overall data lifecycle management.
Data Team: Focused on platform enhancements, monitoring, automation, and alerting.

2.2.5 Value and Benefits

Centers data acquisition, processing, and serving with domain experts
Decreases common data pain points like data cleansing and orientation
Supports the emergence of a data product focus

2.2.6 Capabilities and Infrastructure

The architecture is equipped with:

Scalable, secure, and governed storage
Encryption standards
Metadata management
Data pipeline orchestration
Unified Access Control
Monitoring, alerting, and logging
Self-service capabilities

2.2.7 Governance and Team Structure

Aims to reduce duplication of effort across domains
Provides essential shared services and tools
Focuses on delivering value while adhering to security and governance protocols

By understanding the layers and the domain-driven approach, you can appreciate how the Data Ocean Architecture enables data to be effectively managed, secured, and leveraged for organizational success.