Architecture Data Flow
Schema showing the different STEPS of the application flow - with the data involved at each step
DataApp Flow
Steps descriptions
Describe the data and process involved at each step
Data Transformation
Description
Changes made in the data transformation step since October 23, 2023:
In Python Recipe: Calc_out_to_CHC → C0142ASolvay_SajaXAINVRIO is now feed from: GN_IOT_LT00139.PV tag from MES Server
- In Python Recipe: compute_MES_raw_water_prep → C0143ASolvay_HispavicXAINCANL is now feed from: GN_IOT_LT00137.PV tag from MES Server. In this particular value there is also an update. In the old version of production data there is a -1.42 adjustment. So that the value is not corrected anymore. We remove (-1.42) is actual version (request WO0000000483037).
ETL
Tools
Dataiku
Access rights
Dataiku sFTP connections are defined.
Web scraping doesn't require any access validation.→ Since October 23, 2023, all the data comes directly from MES. So Webscraping step is therefore no longer applicable.
MES service account is required. Service account used in CL/FL Torrelavega can be used for this project.
Validation process
Multiple check are done on the input datasets:
- Existence of data on the website: error handling is defined to ensure we always have the data.
- Data from sFTP is not validated, as provided by external company.
- Existence of data from sFTP: validated. If no data is received for more than 24 hours, the email notification is sent to the external company.
Configuration
Process steps is fully automated, a date argument is automatically fetched day by day. No configuration is required.
Source locations
Data is coming from the MES and sFTP directly. Everything is computed in Dataiku.
Source formats
Aspentech sql, csv files from sFTP and csv/html from webscraping.
Destination locations
sFTP server and google spreadhsheet.
Destination formats
csv and a table.
Sizing
2.46 KB per file. 1 file for every 1 hour.
Scheduling
Extraction is done every hour. Similarly, reporting is done when the transformation is finished.
Timing
Full process doesn't exist. Incremental process takes around 1.5 minutes.
Criticality
High
Logging
Dataiku
River Data Collection
Description
What does it do ?
ETL
Tools
Dataiku
Access rights
None
Validation process
Existence of data on the website: error handling is defined to ensure we always have the data.
Configuration
Process steps is fully automated
Source locations
Source formats
csv/html from webscraping.
Destination locations
google spreadhsheet.
Destination formats
pandas table
Sizing
River level: 671 data point, 15.9+ KB
Scheduling
Extraction is done every 15 minutes.
Timing
20 seconds
Criticality
High
Logging
Dataiku
