Table of Contents

1. Introduction

This page provides a comprehensive view of the reference architecture of the Data Ocean solution. It offers insights into the high-level block architecture diagram, the key components involved, and their interactions.

Understanding the reference architecture is crucial for gaining a holistic understanding of how the Data Ocean operates and supports the organization's data analytics initiatives.

2. Overview of the Data Ocean Reference Architecture

2.1. What is Reference Architecture?

Reference Architecture serves as a blueprint that outlines the structure and components of a system or solution.

In the context of the Data Ocean, the reference architecture provides a bird's eye view of the system's design and the relationships between its various components.

2.2. Benefits of Understanding the Reference Architecture

Understanding the reference architecture of the Data Ocean is vital for several reasons:

2.2.1. Benefits

Implementing the Reference Architecture offers numerous benefits for organizations:

3. High-Level Block Architecture Diagram

The high-level block architecture diagram (Figure) provides an overview of the Data Ocean's key components and their interactions.

It showcases the major building blocks of the system and illustrates how data flows through the various stages of ingestion, processing, and serving.

Fig: The Data Ocean vision is materialized on the Data Platform validated in the Apollo Project.

4. Key Components of the Data Ocean Solution

4.1. Data Sources (Outside of the Reference Architecture)

Data sources in a corporate data and analytics solution primarily comprise operational applications that directly support the business. These sources can include internal systems like SAP and CRM systems, as well as internal files and other systems.

Additionally, external databases, websites, APIs, web scraping, JSON, XML files, and other internal or external files may also serve as data sources within this context.

Business analysis and source system analysis play a crucial role in the success of the Data Ocean solution.

Combined business analysis and Source system analysis, empowers the organization to align their data requirements with business needs and optimize data integration into the Data Ocean.

4.2. Data Consumers (Outside of the Reference Architecture)

Data consumers refer to the various BI tools, dashboard applications, and analytic applications that utilize the data within the Data Ocean.

These tools are outside the scope of the reference architecture but play a crucial role in data consumption and analysis.

The inclusion of a semantic layer could significantly enhance data adoption by providing a unified and standardized view of data across the organization.

4.3. Data Capturing

Data Capturing and Ingestion block is a crucial component of the Data Ocean solution, responsible for collecting and ingesting data from diverse sources into the system.

This process involves extracting data from source systems, managing files, and loading them onto Cloud Storage for efficient storage management. It plays a vital role in ensuring the availability and accessibility of data within the Data Ocean.

It involves two primary approaches: batch processing and streaming processing.

4.3.1. Batch Processing

In a traditional company, batch processing is often the more common approach for data capturing. It involves collecting and processing data in large volumes at scheduled intervals.

Batch processing is well-suited for scenarios where data can be collected over a period of time and doesn't require real-time analysis.

Use cases for batch processing in a traditional business might involve analyzing sales data, customer demographics, inventory levels, or financial transactions. These use cases often rely on historical data and trends to inform strategic decision-making, as immediate insights are not as critical.

Examples of batch data sources include end-of-day extracts and integration of internal files, and relational databases.

Batch processing offers numerous benefits, including:

Overall, batch processing offers a cost-effective, scalable, and efficient approach for handling large volumes of data, simplifying data integration, optimizing resource usage, managing extraction windows, and supporting failure recovery.

4.3.2. Streaming Processing

While batch processing is common in traditional organizations, streaming processing has gained popularity with the rise of real-time data analytics and the Internet of Things (IoT).

Streaming processing is a data processing approach that involves capturing and analyzing data in real-time or near real-time as it is generated. This method is well-suited for situations that require immediate insights and responses, such as real-time monitoring, fraud detection, or predictive maintenance.

Streaming processing is particularly beneficial for applications that require monitoring and control of production lines and the factory floor, enabling timely actions and optimizations. It allows for the continuous analysis of data streams, facilitating rapid decision-making and proactive measures in industrial environments.

Streaming processing offers several advantages, including:

Examples of streaming data sources include IoT sensors, social media feeds, clickstream data, and real-time transaction data. In a traditional business, streaming processing might be applied to monitor production lines, track supply chain logistics, or identification of predictive maintenance in real-time.

It's important to note that while streaming processing offers real-time insights, not all business processes and teams require this level of immediacy. 

For many businesses, end-of-day extracts and batch processing can often provide sufficient data for their needs. This approach is particularly useful for monitoring long-term trends and making adjustments to long-term strategies. By analyzing data in batches, organizations can gain insights into the overall performance and trends over time, allowing them to make informed decisions and adapt their strategies accordingly. This method provides a comprehensive view of the business, enabling effective monitoring and adjustment of long-term goals.

The choice between batch and streaming processing depends on the specific business needs and the importance of real-time insights in driving decision-making processes.

4.4. Lake House

Includes the following Components

4.4.1. Storage

The Data Storage block in the Data Ocean serves as a repository for housing the raw data captured from various sources. It preserves the data in its original format (before it is integrated and transformed), enabling future integration and transformation. With its scalable storage capacity, it can accommodate and handle large volumes of raw data while ensuring data fidelity and security.

The raw data stored in the Data Storage block can come from various sources, such as operational systems, external data feeds, APIs, files, or streaming data sources, and it may include structured, unstructured, and semi-structured data.

The Data Storage block is a crucial component within the Data Ocean solution as it is responsible for maintaining raw data integrity and availability throughout the entire data lifecycle. Its primary function is to securely store the raw data, making it easily accessible and ready for subsequent processing and analysis. Additionally, the cloud-based storage solutions support the Data Lakehouse approach, allowing for direct analysis of the stored data, by combining the benefits of both data lakes and data warehouses, without the need for extensive transformation or pre-defined schemas. This flexibility and scalability empower organizations to leverage the full potential of their data, uncover hidden patterns, and make data-driven decisions that drive business success.

In the context of the Data Ocean, it is essential to recognize that after the raw data is stored in the Data Storage block, it undergoes subsequent processing stages. These stages, which take place in separate components or blocks within the Data Ocean solution, include data integration, transformation, and normalization. These processes refine the raw data, ensuring its quality and consistency, and prepare it for further analysis and consumption. By going through these subsequent stages, the data becomes more structured and suitable for effective analysis and utilization within the Data Ocean framework.

The storage component involves the management and organization of data within the Data Ocean. It encompasses the following:

4.4.2. Curation

The curation component of the Data Ocean solution encompasses various activities aimed at transforming, enriching, and preparing raw data for further analysis.

It includes the following key elements:

By incorporating these curation activities, the Data Ocean aims to ensure that the data is of high quality, reliable, and well-prepared for subsequent analysis and decision-making processes.

For more detail, please read the Data Curation chapter.

4.4.3. Provisioning

The provisioning component in the Data Ocean architecture focuses on ensuring the accessibility and usage of integrated, curated, and consumption-ready data.

It takes a use case-driven approach, allowing the organization to tailor data provisioning strategies to meet specific needs, requirements and objectives. This includes exposing optimized data structures and processing methods that are tailored to specific analytical needs or use cases. This approach facilitates efficient data consumption, exploration, and analysis, allowing the organization to address diverse business challenges and make data-driven decisions.

The generic use cases of provisioning, include batch processing, real-time processing, and optimized data storage.

4.4.3.1. Domains and Data Products

The Data Ocean architecture includes pre-determined provisions for two distinct use cases: the Domain data layer and the Data Product.

4.4.3.1.1. Domain

The Domain data layer in the Data Ocean architecture serves as a centralized, reliable, and authoritative data source, that is subject-oriented, data-oriented, integrated, time-variant, and nonvolatile. It serves as a foundational layer that ensures data consistency, reliability, and governance across the organization.

This domain-specific approach enables structured analysis, decision-making, and reporting, making it ideal for standardized and repeatable processes.

It adheres to the internal organizational structure and principles of Domain-Driven Design (DDD).

Major characteristics: 

In summary, the Domain data layer in the Data Ocean architecture serves as a centralized, reliable, and authoritative data source, aligning with the internal organization and adhering to the principles of Domain-Driven Design. It ensures that data related to each domain is consolidated, promotes a shared understanding of the data, and facilitates effective data management and integration within the organization.

4.4.3.1.2. Data Product

On the other hand, the Data Product use case emphasizes a more exploratory and iterative approach to data exploration and analysis. It is driven by user-defined requirements and focuses on delivering specific insights and solutions tailored to the needs of different stakeholders.

Data Products are designed to be more flexible and adaptable, accommodating evolving business needs and user preferences. They may be more volatile in nature, depending on the continuous interest and relevance of the insights they provide, and can be decommissioned when they no longer serve their purpose.

Data Products should focus more on performance, simplicity and user accessibility

4.4.3.2. Conclusion

Data Provisioning is about building the Data Models to support the Domain and the Data Products

it includes:

By incorporating both the Domain data layer and the Data Product use cases, the provisioning component of the Data Ocean architecture provides a comprehensive solution that meets the diverse data management needs of the organization. It enables a centralized, reliable, and authoritative data layer for structured analysis and decision-making while also facilitating the development of agile and adaptable data products that support exploratory and iterative approaches to uncovering insights and driving innovation.


-------------


The design patterns, organization, and standards enforced by the Lake House Architecture are crucial for achieving the desired scalability, reliability, and performance of the Data Ocean solution. By following these guidelines, organizations can ensure a solid foundation for their data management processes, enabling seamless data integration, advanced analytics, and data-driven decision-making.

Additionally, the Lake House Architecture addresses important aspects such as data quality and data security. Through its standardized processes and governance mechanisms, the architecture ensures that data is validated, cleansed, and secured, minimizing the risk of errors, inconsistencies, and unauthorized access.

Overall, the Reference Architecture provides a comprehensive blueprint for organizations to establish a robust and scalable data management solution. By adhering to the design patterns, organization, and standards set forth by the architecture, organizations can unlock the full potential of the Data Ocean and leverage data as a strategic asset for driving business growth and innovation.





---


Data Ingestion Layer

The data ingestion layer (Figure 1) consists of components responsible for collecting data from diverse sources and bringing it into the Data Ocean. It includes connectors, data pipelines, and integration tools that facilitate data acquisition and transformation.


Standardization and clear guidelines are essential for the success of the Lake House Architecture.

By establishing design patterns, organizational structure, and standards, the architecture ensures that the data solution is maintainable, scalable, optimized, well-governed, easily accessible, and leveraged for organizational advantage.

The architecture encompasses three key components: Curation, Storage, and Provisioning. By following the guidelines and best practices outlined in this reference architecture, the projects and initiatives can ensure the success of their Data Ocean implementation.


Conclusion

The Reference Architecture provides organizations with a comprehensive and scalable framework for building their Data Ocean solution. By following the guidelines and best practices outlined in this reference architecture, organizations can ensure data quality, security, and scalability, while enabling advanced analytics and data-driven decision-making. The architecture's modular and flexible nature allows for customization and adaptation to meet specific business requirements.