Table of Contents
1. Introduction
This page provides a comprehensive view of the reference architecture of the Data Ocean solution. It offers insights into the high-level block architecture diagram, the key components involved, and their interactions.
Understanding the reference architecture is crucial for gaining a holistic understanding of how the Data Ocean operates and supports the company's data analytics initiatives.
2. Overview of the Data Ocean Reference Architecture
2.1. What is Reference Architecture?
Reference Architecture serves as a blueprint that outlines the structure and components of a system or solution.
In the context of the Data Ocean, the reference architecture provides a bird's eye view of the system's design and the relationships between its various components.
2.2. Benefits of Understanding the Reference Architecture
Understanding and implementing the Reference Architecture offers numerous benefits for the company.
3. High-Level Block Architecture Diagram
The high-level block architecture diagram (Figure) provides an overview of the Data Ocean's key components and their interactions.
It showcases the major building blocks of the system and illustrates how data flows through the various stages of ingestion, processing, and serving.
Fig: The Data Ocean vision is materialised on the Data Platform validated in the Apollo Project.
4. Key Components of the Data Ocean Solution
4.2. Data Consumers (Outside of Reference Architecture)
4.3. Data Capturing
4.3.1. Batch Processing
In a traditional company, batch processing is often the more common approach for data capturing. It involves collecting and processing data in large volumes at scheduled intervals.
Batch processing is well-suited for scenarios where data can be collected over a period of time and doesn't require real-time analysis.
Overall, batch processing offers a cost-effective, scalable, and efficient approach for handling large volumes of data, simplifying data integration, optimising resource usage, managing extraction windows, and supporting failure recovery.
4.3.2. Streaming Processing
While batch processing is common in traditional organisations, streaming processing has gained popularity with the rise of real-time data analytics and the Internet of Things (IoT).
Streaming processing is a data processing approach that involves capturing and analysing data in real-time or near real-time as it is generated.
This method is well-suited for situations that require immediate insights and responses, such as real-time monitoring, fraud detection, or predictive maintenance.
4.4.1. Storage
The Data Storage block in the Data Ocean serves as a repository for housing the raw data captured from various sources. It preserves the data in its original format (before it is integrated and transformed), enabling future integration and transformation. With its scalable storage capacity, it can accommodate and handle large volumes of raw data while ensuring data fidelity and security.
The raw data stored in the Data Storage block can come from various sources, such as operational systems, external data feeds, APIs, files, or streaming data sources, and it may include structured, unstructured, and semi-structured data.
4.4.2. Curation
The curation component of the Data Ocean solution encompasses various activities aimed at transforming, enriching, and preparing raw data for further analysis.
It includes the following key elements:
- Data quality checks: This involves conducting thorough assessments to ensure the accuracy, consistency, and reliability of the data.
- By implementing data validation techniques, the company can identify and rectify any data anomalies or inconsistencies, ensuring the integrity of the data.
- Data validation: Applying validation rules and checks to ensure the accuracy, completeness, and consistency of the data, verifying that it meets predefined criteria and conforms to expected formats, structures, and business rules.
- Data cleansing processes: Duplicates, errors, and inconsistencies in the data can hinder accurate analysis.
- Data cleansing techniques are applied to remove such issues and ensure the data is clean, complete, and free from any redundancies or errors.
- Data standardisation techniques: Data often originates from different sources with varying formats and structures.
- Standardisation techniques are employed to transform the data into a consistent format, making it easier to integrate, compare, and analyse across different datasets.
- Data enrichment: To enhance the value and context of the data, integration with external sources and data augmentation techniques are applied. This process involves incorporating additional information, such as external data sets or third-party data sources, to enrich the existing data and provide a more comprehensive view for analysis.
By incorporating these curation activities, the Data Ocean aims to ensure that the data is of high quality, reliable, and well-prepared for subsequent analysis and decision-making processes.
For more detail, please read the Data Curation chapter.
4.4.3. Provisioning
The provisioning component in the Data Ocean architecture focuses on ensuring the accessibility and usage of integrated, curated, and consumption-ready data.
It takes a use case-driven approach, allowing the company to tailor data provisioning strategies to meet specific needs, requirements and objectives. This includes exposing optimised data structures and processing methods that are tailored to specific analytical needs or use cases. This approach facilitates efficient data consumption, exploration, and analysis, allowing the company to address diverse business challenges and make data-driven decisions.
4.4.3.1. Domains and Data Products
The Data Ocean architecture includes pre-determined provisions for two distinct use cases: the Domain data layer and the Data Product.
4.4.3.1.1. Domain
The Domain data layer in the Data Ocean architecture serves as a centralised, reliable, and authoritative data source, that is subject-oriented, data-oriented, integrated, time-variant, and nonvolatile. It serves as a foundational layer that ensures data consistency, reliability, and governance across the company.
This domain-specific approach enables structured analysis, decision-making, and reporting, making it ideal for standardised and repeatable processes.
It adheres to the internal organisational structure and principles of Domain-Driven Design (DDD).
In summary, the Domain data layer in the Data Ocean architecture serves as a centralised, reliable, and authoritative data source, aligning with the internal organisation and adhering to the principles of Domain-Driven Design. It ensures that data related to each domain is consolidated, promotes a shared understanding of the data, and facilitates effective data management and integration within the company.
4.4.3.1.2. Data Product
On the other hand, the Data Product use case emphasises a more exploratory and iterative approach to data exploration and analysis. It is driven by user-defined requirements and focuses on delivering specific insights and solutions tailored to the needs of different stakeholders.
Data Products are designed to be more flexible and adaptable, accommodating evolving business needs and user preferences. They may be more volatile in nature, depending on the continuous interest and relevance of the insights they provide, and can be decommissioned when they no longer serve their purpose.
Data Products should focus more on performance, simplicity and user accessibility
4.4.3.2. Conclusion
Data Provisioning is about building the Data Models to support the Domain and the Data Products.
By incorporating both the Domain data layer and the Data Product use cases, the provisioning component of the Data Ocean architecture provides a comprehensive solution that meets the diverse data management needs of the company. It enables a centralised, reliable, and authoritative data layer for structured analysis and decision-making while also facilitating the development of agile and adaptable data products that support exploratory and iterative approaches to uncovering insights and driving innovation.
4.5. Data Science and Machine Learning
4.5.1. Data Mining
Data mining is the process of uncovering patterns, hidden relationships, trends, correlations, and valuable insights within extensive datasets. It encompasses a range of techniques and algorithms that enable the extraction of meaningful information from different types of data sources, structured, unstructured, or semi-structured.
Data mining involves activities such as data exploration, preprocessing, and visualisation, which contribute to the overall data mining process, to support the overall analysis.
Implementing Data Mining as part of the Data Ocean initiative can provide significant business advantages for the company.
4.5.2. Machine Learning
Machine learning is a subset of data science that focuses on building algorithms and models that can learn from data and make predictions or take actions. It enables the Data Ocean to leverage the power of artificial intelligence and automation, allowing for the identification of patterns, anomalies, and trends in large volumes of data.
It covers model training, model evaluation, and model deployment processes, highlighting the integration of machine learning capabilities within the Data Ocean architecture. It can apply different machine learning techniques, such as supervised learning, unsupervised learning, and reinforcement learning.
4.5.3. Data Science
Data science is the practice of utilising scientific methods, algorithms, and statistical models to derive knowledge and valuable insights from data, enabling informed decision-making based on data. It encompasses activities such as data exploration, preprocessing, and modelling, utilising techniques like data mining and predictive analytics.
The complete lifecycle, including problem conceptualisation, data preparation, model creation, and evaluation, is covered by data science. The incorporation of data science frameworks and tools may also be explored in order to support sophisticated analytics and predictive modelling capabilities.
4.5.4. Conclusion and Use Cases
Data Science, Machine Learning, and Data Mining are interconnected fields that contribute to extracting knowledge and insights from data.
By incorporating advanced capabilities into the Data Ocean initiative, the company can leverage advanced analytics and automation to gain a competitive edge, drive innovation, and unlock new business opportunities in manufacturing, scientific investigations, sales and other business or production areas.
4.5.4.1. Use Cases
4.6. Data Management
In the context of a Data Ocean, data management involves the strategic and organised handling of vast amounts of data from diverse sources. It encompasses data organisation, storage, protection, governance, and lifecycle management.
4.6.1. Data Catalog
The data catalog is an important component of the Data Ocean architecture, providing a comprehensive inventory of available data assets within the company. It serves as a centralised repository for metadata, allowing users to discover, understand, and access the data they need.
The data catalogue provides detailed information about the data sources, data models, data lineage, and other relevant attributes. It enables data governance and facilitates data discovery, promoting data reuse and reducing redundancy.
4.6.2. Data Quality/Workflow
Data quality and workflow play a crucial role in ensuring the accuracy, consistency, and reliability of data within the Data Ocean.
It encompasses the processes, methodologies, and tools required to establish and maintain high-quality data throughout its lifecycle and that the Data Ocean solution operates with high-quality data.
Data quality focuses on establishing data quality standards, data profiling and assessment to understand the quality and characteristics of the data, data cleansing, and data validation processes to ensure accurate and reliable data. It involves understanding data characteristics, addressing anomalies through cleansing, and verifying data integrity.
Ongoing monitoring and improvement efforts are essential for maintaining data quality.
In summary, Data quality is key for a data-driven culture, reliable analytics, operational efficiency, and informed decision-making. By prioritising data quality, organisations empower stakeholders with trustworthy data, facilitating seamless workflows and enhancing decision-making capabilities. Robust data quality practices ensure data reliability and usability, enabling actionable insights and optimised operations.
4.6.3. Orchestration
Data orchestration refers to the coordination and management of various data processing tasks within the Data Ocean. It involves the use of orchestration tools and frameworks to automate and streamline data workflows. It covers the scheduling, sequencing, and dependency management of data processing tasks, ensuring efficient data movement and processing across different stages of the data pipeline.
Effective orchestration enhances the scalability, reliability, and performance of data processing operations.
4.6.4. Data Audit
Data audit involves tracking and monitoring data activities within the Data Ocean. It focuses on establishing robust audit trails, logging mechanisms, and monitoring tools to ensure data integrity and regulatory compliance. It includes the tracking of data access, data modifications, and data lineage, providing visibility into data usage and changes. Data audit enables the company to identify and rectify data anomalies, maintain data governance, and adhere to data privacy and security regulations.
4.7. Operations
Operations in the context of our Data Ocean encompass the critical aspects of managing and maintaining the infrastructure, security, and performance of the data ecosystem with a focus on ensuring the smooth operation and reliability of the Data Ocean solution.
Data security is a critical aspect of the Data Ocean architecture. It involves the implementation of data security measures to protect sensitive data from unauthorised access, loss, or breach. It covers encryption, access controls, authentication, and data privacy techniques, ensuring the confidentiality, integrity, and availability of data within the Data Ocean.
4.7.2. Workload Management
Workload and workflow management are critical aspects of efficient data processing in a data ecosystem. Workload management encompasses strategies for distributing tasks across computing resources, optimising resource allocation, and enhancing performance. Workflow management involves various activities such as data transformation, integration, and enrichment. Data transformation ensures standardisation and compatibility, while integration enables seamless data merging for comprehensive analysis. Data enrichment enhances data by incorporating additional relevant information. Effective workflow management streamlines processes, maximising data asset utilisation, and delivering timely, accurate, and valuable insights to end-users.
4.7.3. Environment Management
Environment management focuses on managing the infrastructure and software environments required for the Data Ocean. This sub-chapter covers aspects such as infrastructure provisioning, configuration management, and version control of software components. It ensures that the Data Ocean environment remains stable, up-to-date, and properly configured to support data processing and analytics operations.
4.7.4. Backup
Data backup is an essential component of data management, ensuring the protection and recoverability of data in the event of data loss or system failures. This sub-chapter discusses backup strategies, including regular backups, incremental backups, and off-site storage, to mitigate the risk of data loss and ensure data resilience.
4.7.5. CI/CD
Continuous Integration and Continuous Deployment (CI/CD) practices are employed to streamline the development, testing, and deployment of data-related components within the Data Ocean. This sub-chapter explores the integration of CI/CD pipelines to automate and accelerate the deployment of data pipelines, data transformations, and other data-related processes.
4.7.6. Monitoring
Monitoring plays a crucial role in ensuring the health, performance, and availability of the Data Ocean infrastructure and data processing workflows. This sub-chapter focuses on implementing monitoring tools and techniques to track system performance, detect anomalies, and proactively address potential issues. It covers aspects such as real-time monitoring, log analysis, and alerting mechanisms to ensure the reliability and stability of the Data Ocean environment.
By addressing these critical aspects, the organisation can establish a robust and efficient Data Ocean solution that supports their data-driven initiatives and enables them to derive maximum value from their data assets.
5. Conclusion
The Reference Architecture provides the organisation with a comprehensive and scalable framework for building their Data Ocean solution.
By following the guidelines and best practices outlined in this reference architecture, he organisation can ensure data quality, security, and scalability, while enabling advanced analytics and data-driven decision-making.
The architecture's modular and flexible nature allows for customisation and adaptation to meet specific business requirements.
