Architecture Data Flow

DataPrep Flow

Schema showing the different STEPS of the application flow - with the data involved at each step

Steps descriptions

Describe the data and process involved at each step

Weather Forecast Extraction

Description

What is it ?

Weather forecasts for France

Tools

The tool used

Docker + Python Script + ECWMF grib system library

Access rights

Is there any credentials used ? Where are they stored ?

No credentials

Data are ONLY available for the current day publicly.

To access past forecasts, one need premium subscription

Source

Location

Where is this data collected ?

Endpoint Public data from Meteo France

https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=130&id_rubrique=51

Format

The format of the source data

Grib file : https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

Time series with Date, Hour, Latitude, Longitude, and multiple weather parameter for each

Sizing

Expected data volume for :

Assessment

How to validate that the generated output is valid

~571k rows should be stored each day :

SELECT count(*) FROM `prj-bda-databank-dev.weather_forecasts.arpege_01` WHERE Date = "2022-04-15"
-- => 571 291 rows

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :

Criticality

High / Medium / Low

High

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines

Energy Data Extraction

Description

What is it ?

Energy data collected internally at Solvay

Tools

The tool used

Dataiku

Access rights

Is there any credentials used ? Where are they stored ?

Connections credentials are provided by the Dataiku Operation Team

Source

Location

Where is this data collected ?

Energy Databases :

Format

The format of the source data

SQL databases

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

Sizing

Expected data volume for :

Assessment

How to validate that the generated output is valid

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :

Criticality

High / Medium / Low

Medium

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines