Architecture Data Flow

DataApp Flow

Schema showing the different STEPS of the application flow - with the data involved at each step



Steps descriptions

Describe the data and process involved at each step

SES Aggregat

Description

What does it do ?




Forecast energy production for solar and wind park.

  1. Processing step : All datas sources are imported from the Data Bank into the project :

  2. Modelization step :

    1. Solar : Datas are prepared to make a train dataset for solar energy production forecast. All production history is used to train a model. Then the model is used to predict on current day datas to predict the next 48 hours of energy production.

    2. Wind : Same principle with wind energy production. Each model solar and wind is executed in parallel.

  3. Aggregation step : A sequence of aggregation an stacking is performed by park.

  4. Dashboard step : Datas are prepared to be send to a Dashboard for visualization.

  5. Output step : Final Datas are send to the output database.

Tools

The tool used

Dataiku DSS, Python, GitLab

Access rights

Is there any credentials used ? Where are they stored ?

No credentials except for Data Bank access. Database connection settings are already set up in Dataiku DSS

Validation process

How to validate that this output match business expectations

Multiple check are done on output dataset :

Configuration

which kind of information or configuration this step uses ?

Process steps is fully automated, a date argument is automatically fetched day by day. No configuration is required.

Source locations

Where is the data coming from ?

Data is coming from the DataBank on GCP. The input tables are pre-computed from the DataPrep project automatically. There is also 2 configuration files coming from Google sheet where business rules are defined

Source formats

The format of the source datasets

SQL Database from GBQ in connection CX_GBQ_A_PRJ-BDA-DATABANK-DEV_ALL :

File csv for google sheet sources :

Destination locations

Where is the data stored ?

Output data is stored in GBQ

Destination formats

The format of the data output

SQL Database from GBQ in connection CX_GBQ_A_SOLVAY-ENERGY-AGGREGATION-DEV_DATAIKU :

Sizing

Expected data volume

Data Volume :

Scheduling

Is there an automatic schedule ? At what frequency ? What is the trigger ?

2 automatic schedule :

Timing

The average time expected for :

There are only incremental scenario:

Criticality

High / Medium / Low

Medium

Logging

Logging location