Aggregat - 3 - Technical - Data Preparation

Architecture Data Flow

DataPrep Flow

Schema showing the different STEPS of the application flow - with the data involved at each step

Steps descriptions

Describe the data and process involved at each step

Weather Forecast Extraction

Description

What is it ?

Weather forecasts for France

Tools

The tool used

Docker + Python Script + ECWMF grib system library

Access rights

Is there any credentials used ? Where are they stored ?

No credentials

Data are ONLY available for the current day publicly.

To access past forecasts, one need premium subscription

Source

Location

Where is this data collected ?

Endpoint Public data from Meteo France

https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=130&id_rubrique=51

Format

The format of the source data

Grib file : https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

Time series with Date, Hour, Latitude, Longitude, and multiple weather parameter for each

Sizing

Expected data volume for :
full process
incremental process

full process : ~100GB for ~778M rows
Incremental (daily): ~10MB for ~570k rows

Assessment

How to validate that the generated output is valid

~571k rows should be stored each day :

SELECT count(*) FROM `prj-bda-databank-dev.weather_forecasts.arpege_01` WHERE Date = "2022-04-15"
-- => 571 291 rows

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :
full process
incremental process

Full process : Not possible
Incremental process : ~70 minutes

Criticality

High / Medium / Low

High

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines

Energy Data Extraction

Description

What is it ?

Energy data collected internally at Solvay

Tools

The tool used

Dataiku

Access rights

Is there any credentials used ? Where are they stored ?

Connections credentials are provided by the Dataiku Operation Team

Source

Location

Where is this data collected ?

Energy Databases :

Oracle IRM database
MS SQL ENERGY database

Format

The format of the source data

SQL databases

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

energy.epex_price : SQL Database, time serie
history_production: SQL Database, time serie, can be update in past
energy_db_mapping: SQL Database
rte_imbalance: SQL Database, time serie

Sizing

Expected data volume for :
full process
incremental process

energy.epex_price
- Full : ~81k rows
history_production
- Full : ~1M rows
- Incremental
  - >= 5000 rows by day
  - on Sunday reprocess the last 31 days, to get any change in past
energy_db_mapping
- Full : ~8k rows
rte_imbalance
- Full : ~87k rows

Assessment

How to validate that the generated output is valid

energy.epex_price
- column DATEUCT : Most recent date must be current date + 1 day, so the day after
history_production
- column Date : Most recent date must be current date
energy_db_mapping
- column VersionDate : Most recent date must be current date
rte_imbalance
- Record count must be similar to ~87k rows

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :
full process
incremental process

Full process : Less than 10 minutes
Incremental : Not possible

Criticality

High / Medium / Low

Medium

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines

Page tree

Aggregat - 3 - Technical - Data Preparation

Architecture Data Flow

DataPrep Flow

Steps descriptions

Weather Forecast Extraction

Description

Tools

Access rights

Source

Location

Format

Destination

Location

Format

Sizing

Assessment

Scheduling

Timing

Criticality

Logging

Energy Data Extraction

Description

Tools

Access rights

Source

Location

Format

Destination

Location

Format

Sizing

Assessment

Scheduling

Timing

Criticality

Logging