Architecture Data Flow

DataPrep Flow

Schema showing the different STEPS of the application flow - with the data involved at each step

Steps descriptions

Describe the data and process involved at each step

Weather Forecast Extraction

Description

What is it ?

Weather forecasts for France

Tools

The tool used

Docker + Python Script + ECWMF grib system library

Access rights

Is there any credentials used ? Where are they stored ?

No credentials

Data are ONLY available for the current day publicly.

To access past forecasts, one need premium subscription

Source

Location

Where is this data collected ?

Endpoint Public data from Meteo France

https://donneespubliques.meteofrance.fr/?fond=produit&id_produit=130&id_rubrique=51

Format

The format of the source data

Grib file : https://confluence.ecmwf.int/display/CKB/What+are+GRIB+files+and+how+can+I+read+them

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

Time series with Date, Hour, Latitude, Longitude, and multiple weather parameter for each

Sizing

Expected data volume for :

  • full process

  • incremental process

  • full process : ~100GB for ~778M rows

  • Incremental (daily): ~10MB for ~570k rows

Assessment

How to validate that the generated output is valid

~571k rows should be stored each day :

SELECT count(*) FROM `prj-bda-databank-dev.weather_forecasts.arpege_01` WHERE Date = "2022-04-15"
-- => 571 291 rows

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :

  • full process

  • incremental process

  • Full process : Not possible

  • Incremental process : ~70 minutes

Criticality

High / Medium / Low

High

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines

Energy Data Extraction

Description

What is it ?

Energy data collected internally at Solvay

Tools

The tool used

Dataiku

Access rights

Is there any credentials used ? Where are they stored ?

Connections credentials are provided by the Dataiku Operation Team

Source

Location

Where is this data collected ?

Energy Databases :

  • Oracle IRM database

  • MS SQL ENERGY database

Format

The format of the source data

SQL databases

Destination

Location

Where is this data stored ?

In the Data Bank : https://console.cloud.google.com/bigquery?project=prj-bda-databank-dev

Format

The format of the data once extracted

  • energy.epex_price : SQL Database, time serie

  • history_production: SQL Database, time serie, can be update in past

  • energy_db_mapping: SQL Database

  • rte_imbalance: SQL Database, time serie

Sizing

Expected data volume for :

  • full process

  • incremental process

  • energy.epex_price

    • Full : ~81k rows

  • history_production

    • Full : ~1M rows

    • Incremental
      • >= 5000 rows by day
      • on Sunday reprocess the last 31 days, to get any change in past
  • energy_db_mapping

    • Full : ~8k rows

  • rte_imbalance

    • Full : ~87k rows

Assessment

How to validate that the generated output is valid

  • energy.epex_price

    • column DATEUCT : Most recent date must be current date + 1 day, so the day after

  • history_production

    • column Date : Most recent date must be current date

  • energy_db_mapping

    • column VersionDate : Most recent date must be current date

  • rte_imbalance

    • Record count must be similar to ~87k rows

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

Yes

Gitlab triggers the extraction : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipeline_schedules

Timing

The average time expected for :

  • full process

  • incremental process

  • Full process : Less than 10 minutes

  • Incremental : Not possible

Criticality

High / Medium / Low

Medium

Logging

Logging location

Gitlab CI : https://gitlab.solvay.com/solvay-it-dataops/data-ingestion/ses-agregat-dataprep/environments/dataprep_pipeline_test_env/-/pipelines

  • No labels