A Data Architecture consists of the following layers

  1. Business value (Reports, Exploratory Research, Insights, Actions): Data without direct or indirect business value is worthless. The costs of making data available, little as it might be, must be offset by the business value.
  2. Transformation (Rules): From raw data to information. In most cases raw data has little value. Only once it is distilled into key-performance-indicators (“KPIs”) and well documented attributes, its true value can be achieved.
  3. Data model (Navigation and self service): Defines the relationships between buckets of data and helps the user to navigate areas. The Business value of customer master data is little, the value of a list of sales orders is little as well. Looking at the relationship between the two helps to uncover new information, e.g. which region is prospering. A well designed and documented data model is the enabler for creating business value.
  4. Processing (Transformation flexibility): How sources provide data and how this is processed until the final consumer. The used architecture defines many qualities of the data consumption. For example, an Event Driven Architecture fosters low latency business decisions, a classic ETL based architecture is cheaper to build. The architecture also defines what kind of data and processing capabilities can be used. A SQL based architecture can cope with structured data only, a big data architecture with any kind of data.
  5. Consumption (APIs and Interfaces): How and for what purposes the data can be consumed. When building a Data Warehouse, the sole purpose is to enable self service analytics. With an Event Driven design, it can also be used for integrating systems asynchronous with each other, for alerting, business workflows,…
  6. Metadata (Data Governance): Data without context has no value. Only if one knows what a field means, how a KPI is defined, where to find certain data, its business meaning,.. data can be used. Metadata can be read (e.g. field name and data type), can be written down (e.g. documentation, business glossary), is created as by-product (e.g. the list of transformation steps a record did undergo, how often was data being used). Metadata can be of technical or business nature. Metadata can be about data sensitivity and permissions.
  7. Incentives (Data Producers): Why would somebody provide data or use it. Although there are cases where somebody needs certain data, providing data is usually a chore. Using data for the purpose of generating new insights cannot be enforced either. It would help to make the right decisions. Hence it is important to have incentives in producing and consuming the data. It should be easy to do so and there should be the self-motivation to be part of the process.


 

These layers build on each other.

  1. The most effort (company-wide) is to provide the data, but without data → no business value. Access to data is the foundation of everything. We have the situation at Syensqo at the moment where projects need data from other teams but fail to properly motivate them to provide it → Incentives.
  2. Now that all data is available, the result is a huge swamp of data. It is imperative to provide mechanisms for finding the desired data set by the users. This is achieved via a searchable catalog and the producers providing the information → Metadata.
  3. The best data available does not help, if it cannot be consumed as desired. The access method should support the desired style (push changes to inform about vs. pull data to query all or parts) and be an open interface like SQL to support as many tools as possible → Consumption.
  4. Similar the types of transformations possible. If the customer master data should be cleansed and addresses standardized, but there is no transformation option for that, the task cannot be achieved, the cleansed data cannot be provided and the query filtering on a city name does not return all records due to different spellings, falsifying the conclusions drawn from this query. The normal transformations must be simple, the complex possible → Processing.
  5. Another way to navigate within the pool of data sets is via relationships. The sales order has a relationship with customer and hence with buying location and other customer master data. Also, the data should be prepared in a way to help the query engine retrieving the data faster and hence will lower processing costs. Both are important properties for consumers to use the data. If the answer to a simple question takes 20 minutes, the question will probably not getting asked again - for the disadvantage of the company. Same if navigating the data can be done via searches only (Metadata) but not by moving along the relationships of data sets → Data Model.
  6. The current approach for many teams is for the business user to describe in a Jira ticket what he wants, a data architect translates it into a mapping document and the data engineer implements it. These are disjunct processes with a high potential of failure due to miscommunication or misunderstandings. Further more, even the business user does not know all the details right from the start. On the other hand, the higher the data quality is, the more insights will be derived for the benefit of the company. It should be the goal to enable the business user to provide the transformation rules in a way that can be executed by the code directly → Transformation.
  7. All of the above are steps to enable the users in making fact based decisions → Business value.


Displaying Screenshot 2025-03-27 at 19.58.56.jpeg

Most important, the Data Architecture must match the company culture and goals.

During the interviews, certain statements were heard repeatedly

  • We need data, e.g. SAP master data, and it is hard to get at the moment
  • Low latency is important
  • We have the unique requirements of …
  • We want to do some special processing
  • Things change often and quickly

This points towards an Event Driven architecture with a strong focus on Data Governance and a high degree of freedom for the users.

Why Data projects fail

There are typical pitfalls when it comes to data projects. Knowing these helps to avoid them and be successful with the project, short- and long-term.

  • Over-engineer: There are so many interesting technical solutions. Databases, ETL tools, data access options, cataloging tools, BI tools. All have pros and cons, thus one method is being used, then another is being added. What started with something simple, grows quickly it something unmanageable.
    Solution: Provide few, but powerful, interfaces to work with the data.
  • Force source systems: If a central data team requires the source systems to provide (access to the) data, why would they? This is additional work and risk. What is their benefit on helping here? Does the central data team even have a leverage to request that? Usually that works to some extend but the results are sub-par.
    Solution: Create situations where the source system wants to provide the data. One lever is if they have to provide the data once and from there all other teams can consume the data, instead of getting bothered by other teams on a daily basis.
  • Data sprawl: Team 1 reads data from a source system and it is used for something. Team 2 reads the same data and does something else with it. Team 3 requires the data as well. At the end one data element can be found in different places for different use cases. But there is no coherent view, data might be out of sync, and the work is repeated.
    Solution: Enable use cases, don't provide data per use case.
  • Low reaction time: If a user requires data, he wants to use it immediately. If the response is, will take two months, the user will either abandon the idea - for the disadvantage of the company - or workaround the IT team.
    Solution: Organize the teams, the data and the process to react to requests quickly.
  • Data swamp: Because the Enterprise Data Warehouse is so rigid, Data Lakes have been built. Simply upload all raw data into a central place and let the users do whatever they like. The result is that nobody knows what was provided already, hence data exists multiple times, nobody can find anything and even if, there is no trust in the data - it sounds right but does it really contain the data needed?
    Solution: Provide lots of metadata and a system to find the data.
  • Data Governance chores: Providing the data is more than enough work, now there is even more paperwork about its metadata? Where the data comes from, how frequently it gets updated, documentation at table and field level,... All of that has to be entered into the Data Governance tool - nobody has time for that.
    Solution: A lot can be created automatically, e.g. SAP has table and field comments. The creation of metadata should be part of the code loading the data.
  • Limited processing capabilities: The entire space of processing engines and tools is in an upheaval. If the platform requires all to use a single tool, e.g. the ETL tool from vendor A, users are bound to be unhappy.
    Solution: Provide standardize interfaces to read and load data, e.g. SQL. Then every tool that supports that interface can be used.

Project Charter

Hence the following project charter shall be defined.

Syensqo seeks to modernize its data platform and avoid the current issues, which are around data governance, finding the data and more data being available to all.

Guiding principles

  1. The platform should help, not create restrictions. It should be fun and easy to use.
  2. The platform should be useful for all scenarios, from reporting and analytics to application integration. One system for all patterns.
  3. The platform should offer incentives to provide data. The minimum is to show how often data has been used by others.