Status

WIP

StakeholdersAntoine Roy
Outcome
Due Date

 

OwnerIra Banerjee, Emma Glasson
Solution/Domain/Data Architect

The new CSRD Regulation significantly impacts the quantity and the auditability of the environmental KPI reporting.

Estimated 140 additional man days/year if continuing with the same way of working. Plus increased risk of non compliance.

Better data transparency and quality on Water, Waste &  Air Emissions for each Syensqo site. Increase efficiency, auditability of CSRD data collected from sites.

Fewer FTEs spent on mandatory reporting each year.

For GBU SPP, Sustainability Corporate Function  

Who Industrial Function, GBUs, DT Digital Operations/Operational Performance
This solution is  a Power BI dashboard and a Power Apps application

That will provide automatized KPIs of the SERF report on a daily basis

Unlike existing SERF report manually fed by each site on a yearly basis

Our solution structured data visualization and input, enhancing Data automation & data transparency.

Indicators to accomplish: Water, Waste &  Air Emissions KPIs, Syensqo One Planet KPIs.

High Level Design (Phase I)

Logical layered viewApplication boundaries view



Low Level Design

https://app.diagrams.net/#G1UKjr26qZvsofBSolmenH_exOav_Yo5fO#%7B%22pageId%22%3A%22MNNbgHG-fPXtdiaO8sDC%22%7D 

<<Contribution from Azure Architect: Scott O´Neill scott.oneill-ext@syensqo.com>>

  • Add IP addresses/ports
    • range IP origin from Azure
    • range IP destination for StarTek
    • range IP destination for GCP/Labware
  • Public/Private endpoints
  • Network Gateway - Backbone
  • Security Group
  • Service Account
  • VPC´s
  • Local Networks
  • Workspace´s name


Components - Responsibility

  1. Dataverse: A storage responsible to be a virtual layer between Lakehouse data sources (labware & startek), also to persist user content (comments, value change proposals, pdf data);
  2. MS Fabric: Data engine responsible to fetch data from data sources (GCP:Labware & AWS:Startek), combine that with user content in order to generate the KPI´s; 
  3. Lakehouse: Centralized data storage for all structured, semi-structured, and unstructured data; leveraging Medallion architecture for data organization; so data engines can read/write data at scale;
  4. Power Apps: Custom application framework to create workflow automation, asynchronous processing, ease integration with Azure Tech stack and been the user interface for triggering such processing;
  5. Power BI: Tool responsible to enable user to evaluate the data for analytics purpose;
  6. AI Builder: SaaS Azure solution which enables specific purpose AI agents for evaluating data input to generate expected output. In this case, a pre-trained model is used for ingesting user PDF and extract specific data points for been persisted in Dataverse; The PDF´s to be uploaded can be seen here, and its raw size is up to 1MB, where the relevant content for extraction are the tables.

Architectural Significant Requirements <<Q&A, assessment>>


ASR
1. What persona types will interact with the system?
	1.1 Is there a need to implement any ACL (authorization) on the features?
	1.2 So a need to differentiate between user profiles to grant access.
	1.3 Define personas, evaluate usage of AccessFlagTool vs AD/Entra
		1.3.1 4 personas *** see document

2. User file (pdf) uploading
	2.1 The time for data extraction of PDF uploaded can not be predicted. With that, what the maximum time accepted for data extraction?
	2.2 Do users need to evaluate/validate the extracted data right immediate when the file is uploaded and processed?
	2.3 PDF metric value is also available for users change? Is there a need to make user aware of the origin/data source which is been changed?
	2.4 Is it possible to foresee other kind of file (unstructured data) interest to ingest to feed the KPI processing? Being user uploading or system to system integration. 
		2.4.1 +Datasource - user upload, manual trigger: Invoices, excel; +SAP
	2.5 PDF: Emission report analysis - from specialized companies; more granular complementary to Labware & Startek data regarding emission.
	2.6 Is there any usage on the original file right after the data points extraction? Even from any other application apart from CSRD?
	2.7 Batch load: future for invoices

3. Users can add comments to the records next to the metric users change when validating a batch. No for retrieval, only comment. Only visible in the power app.
	3.1 Is there any intention to process/parse/extract meaning on the users comment?
	3.2 Is there any intention to perform full-text search on the comments?
	3.3 Do comment be shown both in the table UI (PowerApp) and in the reporting (PowerBI)?
	3.4 Is there any need to control who can see what from whom on the comments?

4. User need to evaluate loaded/processed data for change/replace result values with correct ones manually
	4.1 Does it make sense to show user the source (PDF, Labware, Startek)

5. Users define the Substitution/Aggregation rule (Boolean syntaxes) to automate changes on the metrics
	5.1 Should users be notified when a matching happens and when the processing is finished?
	5.2 Should users be aware of which data is affected by the business rule automation changes?
	5.3 What type of users will be responsible to design such business rule?
	5.4 What if the automation performs wrong changes on the data?
	5.5 Should user be able to compare versions?
		5.5.1 Rollback version or re-process?

6. KPI´s should only process on user Validated batches?
	6.1 Then the final KPI in PowerBI reflects what users have already validated?
	6.2 Then user validation triggers the KPI processing for the final output

7. Data fetch from Labware and Startek latency
	7.1 What the minimum accepted time for data fetching from data sources (minute, hour, week,..)

8. Visualization should accommodate data context for data points where it lacks a fixed schema - pivoting rows to column
	8.1 Power Apps not able to draw dynamically the schemaless structure for substances: there are static templates (?David/PowerApps specialist to validate)
		8.1.1 What similar use case could come in the near future? (Emma, David)
		8.1.2 One Lakehouse table per substance is not a storage-efficient approach. (David review)

9. Reporting (PowerBI)
	9.1 Is there any need for the auditing where PowerBI would be the interface to be used?
	9.2 Should user actions and system actions logs of the data (the lineage) be shown as report?
	9.3 Will the processed output KPI´s data be used by any other system/application apart from CSRD context?
	9.4 Should the reports show the most granular data, for instance the comments from users?
	9.5 How complex the reports can be in terms of allowing the user to dynamically combine columns for aggregation/sorting/filter - complex analysis?
		9.5.1 Consider any opportunity for pre-processing at scale to avoid much complexity and resource consumption at analytics level.
	9.6 What kind of processing/pre-processing

10. Notification System
	10.1 Users should be notified about background/asynchronous processing affecting the KPI data

11. Where are the user changes persisted?
	11.1 Comments, value changes:
	11.2 AI Builder/pdf output:
	11.3 What kind of processing is done over user content?

12. User need to download spreadsheet from reports/validation
	12.1 Download via PowerBI.

13. How should auditors perform auditing on the data?
	13.1 By querying tables with validated data via Dataverse.


Sites impacted

  • Tavaux

Data Model

  • Labware: sample results, product, specifications
  • Star Tek: Flowmeters and Quantity totalizer
  • System trail:
    • User action logs: timestamp, user id, action
    • System automation action logs: timestamp, user id, action
  • User Content
    • Uploaded PDF files:
    • Comments
    • Metric values changes proposal 


Design Decisions


  • Sensitivity Points
    • The volume of data (Labware, Startek)
  • Risks
  • Non Risks
  • Architectural Approaches
    • Multi-model storage
    • Medallion architecture
      • Shared ingestion mechanisms: job_raw_labware | topic_raw_labware; job_raw_startek | topic_raw_startek
      • Shared storage artifacts: bronze.raw_labware; bronze.raw_startek
  • Trade-offs (alternatives)
    • Questions/Concerns
      • Does it make sense to store user content in Dataverse? Does it make sense to replicate user content from Dataverse to Lakehouse? Is there any strategical benefit (at the application level roadmap level, sustainability domain level, enterprise architecture level) for leveraging Dataverse for such purpose? 
        • Keep more than one data repository for the same business context introduce architectural and data governance complexity.
      • How can user keep track of background processing affecting his/her own data work context? 
      • What the usage of the user content data (comments, change proposal, uploaded files)? Is there any usage outside the CSRD application context? (Some data points trigger the KPI pipeline; others, do not). 
    • Alternatives
      • Message Driven for user content: Could Power Apps read original data from Lakehouse for user evaluation, collects user change proposal associated with original record ID and sends it as a message to EventHub, then a consumer would sink that to Lakehouse next to original records?
        • Pros:
        • Cons:   
      • Multi-model storage - Separation of Concern/Responsibility Segregation: For the need of user take actions that DOES NOT trigger the KPI pipeline processing (nor affect KPI processing) like adding comments and/or annotations, manage workflow approval process, log trail activities, or apply any ACL workspace control amongst user profiles (personas should be defined), a relational database (CosmosDB, Azure SQL?) at Golden layer been the last stage for KPI results visualization (via PowerBI) in combination with those user actions could better address.
        • Pros:
        • Cons:


Target Architecture (Phase II)

Loading...


https://lucid.app/lucidchart/b7c02cfa-265a-4c9c-97ef-beb40bd7ef84/edit?page=0_0#

Core Building Blocks - Responsibility

  1. Event-Hub: a native data-streaming service responsible to get data from producer as self-contained message or as trigger for data collector perform the data logistic and make that available for been sink into Data Platform for data consumers usage. 
  2. Notification Hubs: component responsible to observe system and users sync/asynchronous event actions and make them readable for user, to bring awareness and improve traceability from the user perspective over the data changes; 
  3. Schema Registry: It ensures data quality, consistency, and safe evolution in event-driven or streaming systems.
  4. API Gateway: Uniform interface, to isolate internal complexity, and supports DevSecOps practices by centralizing access control and observability. In the architecture plays core responsibility to make it transparent which technology leverages specific responsibility from the user perspective, so to ease plug-and-play approach for components - for instance the "AI Doc Extractor".
  5. AI Doc Extractor: Component responsible to extract specified data points from a document uploaded by user. AI capability allows fine tuning the model to capture data points with better accuracy. 
  6. SQL Endpoint: Interface for fetching data leveraging SQL query engine. 
  7. Power BI: Analytic tool to allow user evaluate structured and cleaned data for visualization analysis and report download.
  8. Power App: Application context for custom capabilities been deployed leveraging low-code capabilities for end user interaction and also for mid level workload scheduled processing tasks.

References

Detailed Data Architecture

Data & Integration & Network

Dataverse integration with Microsoft Fabric https://www.youtube.com/watch?v=bgcNsqp92YE

PowerApps/Dataverse https://learn.microsoft.com/en-us/power-apps/maker/data-platform/data-platform-intro 

AI Builder https://learn.microsoft.com/en-us/ai-builder/overview

Tech Spec diagrams https://app.diagrams.net/#G1xhYLK5o4cYyKx6jod3TGL_WFdx9bu2Zi#%7B%22pageId%22%3A%22RwbzW0ZmPPmPHFFOWm3V%22%7D

SIP Process https://docs.google.com/spreadsheets/d/1j1i-8BkX8YWf2xtX-JaDT4R827aBO59L/edit?gid=1422442667#gid=1422442667

Target Architecture Presentation https://docs.google.com/presentation/d/1KosHyUYtOmcDwbKHYxspv6O_RdMUjdJ5PMsY33yeD_I/edit#slide=id.g31035d80c67_0_0

Dataverse and Azure SQL https://community.dynamics.com/blogs/post/?postid=22ce1a68-bb3f-4139-ae19-2a1d660286b1

One-Pager | IT12892 -CSRD Reporting Automation  | One-Pagers https://docs.google.com/presentation/d/1YSiN71DA8U8ekkRJeV_HAyrV3DXU7G_3wXNoYy2Vt7E/edit#slide=id.g307aa5ad372_0_638

Medallion architecture https://learn.microsoft.com/en-us/azure/databricks/lakehouse/medallion

Power Apps integration with Event Hub https://www.carlosag.net/PowerApps/Connectors/Event-Hubs





  • No labels