Schema showing the different STEPS of the application flow - with the data involved at each step

Describe the data and process involved at each step
What is it ?
The tool used
Docker + Python Script + ECWMF grib system library
Is there any credentials used ? Where are they stored ?
No credentials
Data are ONLY available for the current day publicly.
To access past forecasts, one need premium subscription
Where is this data collected ?
Endpoint Public data from Meteo France
https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=130&id_rubrique=51
The format of the source data
Grib file : https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them
Where is this data stored ?
In the Data Bank :
The format of the data once extracted
Time series with Date, Hour, Latitude, Longitude, and multiple weather parameter for each
Expected data volume for :
full process
incremental process
full process : ~100GB for ~778M rows
Incremental (daily): ~10MB for ~570k rows
How to validate that the generated output is valid
~571k rows should be stored each day :
SELECT count(*) FROM `prj-bda-databank-dev.weather_forecasts.arpege_01` WHERE Date = "2022-04-15"
-- => 571 291 rowsIs there an automatic schedule ? At what frequency ? What is the trigger ?
Yes
Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules
The average time expected for :
full process
incremental process
Full process : Not possible
Incremental process : ~70 minutes
High / Medium / Low
High
Logging location
What is it ?
Energy data collected internally at Solvay
The tool used
Dataiku
Is there any credentials used ? Where are they stored ?
Connections credentials are provided by the Dataiku Operation Team
Where is this data collected ?
Energy Databases :
Oracle IRM database
MS SQL ENERGY database
The format of the source data
SQL databases
Where is this data stored ?
In the Data Bank :
The format of the data once extracted
energy.epex_price : SQL Database, time serie
history_production: SQL Database, time serie, can be update in past
energy_db_mapping: SQL Database
rte_imbalance: SQL Database, time serie
Expected data volume for :
full process
incremental process
energy.epex_price
Full : ~81k rows
history_production
Full : ~1M rows
energy_db_mapping
Full : ~8k rows
rte_imbalance
Full : ~87k rows
How to validate that the generated output is valid
energy.epex_price
column DATEUCT : Most recent date must be current date + 1 day, so the day after
history_production
column Date : Most recent date must be current date
energy_db_mapping
column VersionDate : Most recent date must be current date
rte_imbalance
Record count must be similar to ~87k rows
Is there an automatic schedule ? At what frequency ? What is the trigger ?
Yes
Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules
The average time expected for :
full process
incremental process
Full process : Less than 10 minutes
Incremental : Not possible
High / Medium / Low
Medium
Logging location