Architecture Data Flow
DataPrep Flow
Schema showing the different STEPS of the application flow - with the data involved at each step
Steps descriptions
Describe the data and process involved at each step
Weather Forecast Extraction
Description
What is it ?
Tools
The tool used
Docker + Python Script + ECWMF grib system library
Access rights
Is there any credentials used ? Where are they stored ?
No credentials
Data are ONLY available for the current day publicly.
To access past forecasts, one need premium subscription
Source
Location
Where is this data collected ?
Endpoint Public data from Meteo France
https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=130&id_rubrique=51
Format
The format of the source data
Grib file : https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them
Destination
Location
Where is this data stored ?
In the Data Bank :
Format
The format of the data once extracted
Time series with Date, Hour, Latitude, Longitude, and multiple weather parameter for each
Sizing
Expected data volume for :
full process
incremental process
full process : ~100GB for ~778M rows
Incremental (daily): ~10MB for ~570k rows
Assessment
How to validate that the generated output is valid
~571k rows should be stored each day :
SELECT count(*) FROM `prj-bda-databank-dev.weather_forecasts.arpege_01` WHERE Date = "2022-04-15"
-- => 571 291 rowsScheduling
Is there an automatic schedule ? At what frequency ? What is the trigger ?
Yes
Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules
Timing
The average time expected for :
full process
incremental process
Full process : Not possible
Incremental process : ~70 minutes
Criticality
High / Medium / Low
High
Logging
Logging location
Energy Data Extraction
Description
What is it ?
Energy data collected internally at Solvay
Tools
The tool used
Dataiku
Access rights
Is there any credentials used ? Where are they stored ?
Connections credentials are provided by the Dataiku Operation Team
Source
Location
Where is this data collected ?
Energy Databases :
Oracle IRM database
MS SQL ENERGY database
Format
The format of the source data
SQL databases
Destination
Location
Where is this data stored ?
In the Data Bank :
Format
The format of the data once extracted
energy.epex_price : SQL Database, time serie
history_production: SQL Database, time serie, can be update in past
energy_db_mapping: SQL Database
rte_imbalance: SQL Database, time serie
Sizing
Expected data volume for :
full process
incremental process
energy.epex_price
Full : ~81k rows
history_production
Full : ~1M rows
- Incremental
- >= 5000 rows by day
- on Sunday reprocess the last 31 days, to get any change in past
energy_db_mapping
Full : ~8k rows
rte_imbalance
Full : ~87k rows
Assessment
How to validate that the generated output is valid
energy.epex_price
column
DATEUCT: Most recent date must be current date + 1 day, so the day after
history_production
column
Date: Most recent date must be current date
energy_db_mapping
column
VersionDate: Most recent date must be current date
rte_imbalance
Record count must be similar to ~87k rows
Scheduling
Is there an automatic schedule ? At what frequency ? What is the trigger ?
Yes
Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules
Timing
The average time expected for :
full process
incremental process
Full process : Less than 10 minutes
Incremental : Not possible
Criticality
High / Medium / Low
Medium
Logging
Logging location
