Raw Water Torrelavega - 4 - Data Application

Architecture Data Flow

Schema showing the different STEPS of the application flow - with the data involved at each step

DataApp Flow

Steps descriptions

Describe the data and process involved at each step

Data Transformation

Description

Changes made in the data transformation step since October 23, 2023:
In Python Recipe: Calc_out_to_CHC → C0142ASolvay_SajaXAINVRIO is now feed from: GN_IOT_LT00139.PV tag from MES Server
In Python Recipe: compute_MES_raw_water_prep → C0143ASolvay_HispavicXAINCANL is now feed from: GN_IOT_LT00137.PV tag from MES Server. In this particular value there is also an update. In the old version of production data there is a -1.42 adjustment. So that the value is not corrected anymore. We remove (-1.42) is actual version (request WO0000000483037).

ETL

Tools

Dataiku

Access rights

Dataiku sFTP connections are defined.
Web scraping doesn't require any access validation.→ Since October 23, 2023, all the data comes directly from MES. So Webscraping step is therefore no longer applicable.
MES service account is required. Service account used in CL/FL Torrelavega can be used for this project.

Validation process

Multiple check are done on the input datasets:
Existence of data on the website: error handling is defined to ensure we always have the data.
Data from sFTP is not validated, as provided by external company.
Existence of data from sFTP: validated. If no data is received for more than 24 hours, the email notification is sent to the external company.

Configuration

Process steps is fully automated, a date argument is automatically fetched day by day. No configuration is required.

Source locations

Data is coming from the MES and sFTP directly. Everything is computed in Dataiku.

Source formats

Aspentech sql, csv files from sFTP and csv/html from webscraping.

Destination locations

sFTP server and google spreadhsheet.

Destination formats

csv and a table.

Sizing

2.46 KB per file. 1 file for every 1 hour.

Scheduling

Extraction is done every hour. Similarly, reporting is done when the transformation is finished.

Timing

Full process doesn't exist. Incremental process takes around 1.5 minutes.

Criticality

High

Logging

Dataiku

River Data Collection

Description

What does it do ?

ETL

Tools

Dataiku

Access rights

None

Validation process

Existence of data on the website: error handling is defined to ensure we always have the data.

Configuration

Process steps is fully automated

Source locations

https://www.chcantabrico.es/

Source formats

csv/html from webscraping.

Destination locations

google spreadhsheet.

Destination formats

pandas table

Sizing

River level: 671 data point, 15.9+ KB

Scheduling

Extraction is done every 15 minutes.

Timing

20 seconds

Criticality

High

Logging

Dataiku

Page tree

Raw Water Torrelavega - 4 - Data Application

Architecture Data Flow

DataApp Flow

Steps descriptions

Data Transformation

Description

Tools

Access rights

Validation process

Configuration

Source locations

Source formats

Destination locations

Destination formats

Sizing

Scheduling

Timing

Criticality

Logging

River Data Collection

Description

Tools

Access rights

Validation process

Configuration

Source locations

Source formats

Destination locations

Destination formats

Sizing

Scheduling

Timing

Criticality

Logging