View Source

Architecture Data Flow

DataPrep Flow

Schema showing the different STEPS of the application flow - with the data involved at each step

Data Solutions - BD&A Team > Aggregat - 3 - Technical - Data Preparation > Capture d’écran du 2022-07-07 14-29-23.png

Steps descriptions

Describe the data and process involved at each step

Weather Forecast Extraction

Description

What is it ?

Weather forecasts for France

Tools

The tool used

Docker + Python Script + ECWMF grib system library

Access rights

Is there any credentials used ? Where are they stored ?

No credentials

Data are ONLY available for the current day publicly.

To access past forecasts, one need premium subscription

Source

Location

Where is this data collected ?

Endpoint Public data from Meteo France

https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=130&id_rubrique=51

Format

The format of the source data

Grib file : https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

Time series with Date, Hour, Latitude, Longitude, and multiple weather parameter for each

Sizing

Expected data volume for :
full process
incremental process

full process : ~100GB for ~778M rows
Incremental (daily): ~10MB for ~570k rows

Assessment

How to validate that the generated output is valid

~571k rows should be stored each day :

SELECT count(*) FROM `prj-bda-databank-dev.weather_forecasts.arpege_01` WHERE Date = "2022-04-15"
-- => 571 291 rows

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :
full process
incremental process

Full process : Not possible
Incremental process : ~70 minutes

Criticality

High / Medium / Low

High

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines

Energy Data Extraction

Description

What is it ?

Energy data collected internally at Solvay

Tools

The tool used

Dataiku

Access rights

Is there any credentials used ? Where are they stored ?

Connections credentials are provided by the Dataiku Operation Team

Source

Location

Where is this data collected ?

Energy Databases :

Oracle IRM database
MS SQL ENERGY database

Format

The format of the source data

SQL databases

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

energy.epex_price : SQL Database, time serie
history_production: SQL Database, time serie, can be update in past
energy_db_mapping: SQL Database
rte_imbalance: SQL Database, time serie

Sizing

Expected data volume for :
full process
incremental process

energy.epex_price
- Full : ~81k rows
history_production
- Full : ~1M rows
- Incremental
  - >= 5000 rows by day
  - on Sunday reprocess the last 31 days, to get any change in past
energy_db_mapping
- Full : ~8k rows
rte_imbalance
- Full : ~87k rows

Assessment

How to validate that the generated output is valid

energy.epex_price
- column DATEUCT : Most recent date must be current date + 1 day, so the day after
history_production
- column Date : Most recent date must be current date
energy_db_mapping
- column VersionDate : Most recent date must be current date
rte_imbalance
- Record count must be similar to ~87k rows

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :
full process
incremental process

Full process : Less than 10 minutes
Incremental : Not possible

Criticality

High / Medium / Low

Medium

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines