High Level Project Architecture

Here is a suggested template: 

https://drive.google.com/file/d/1R9Y2e5TI_pmrghmnEquAOS-1Lq7Pdofw1O61ZHr4g1s/view


Architecture Data Flow

Here is a suggested template for Data Model + Data Mapping :

https://docs.google.com/spreadsheets/d/1bD8AIgsNUI2sgANoOEKTuHBlkxhsNVTD8cmOPYEloLw


DataPrep Flow

Schema showing the different STEPS of the application flow - with the data involved at each step


  

Steps descriptions

Describe the data and process involved at each step

Data Transformation

Description

Data from Sinfin, MES and CHCantabrico are aggregated and sent to Spanish Government through sFTP sevrer. → S ince October 23, 2023, all the data comes directly from MES. So Webscraping step is therefore no longer applicable.

Tools

WebMethods provides the sFTP server. Dataiku transforms and sends data to sFTP.

Access rights

Source

Location
Format

Destination

Location
Format

csv files.

Sizing

Expected data volume for :

Full process: doesn't exist. 

River level: 671 data point, 15.9+ KB

Incremental process: 

Assessment

MES data: data preparation is done only when new data is available. 

sFTP data: data preparation is done only when new data is uploaded. If some time stamps are missing, they are interpolated. If no data is uploaded for more than a day, email notification is sent to developers and SINFIN representatives on a daily basis. 

Chcantabrico data: Data is appended to the historical data in the Dataiku dataset. We collect data from the website. When data is not available directly, the data is extracted from another page on the website. → Not needed anymore since October 23, 2023. The data comes directly from MES.
Finally we try later that day.

Scheduling

Extraction is done every hour. Similarly, reporting is done when the transformation is finished.

Timing

The average time expected for :

Full process doesn't exist. Incremental process takes around 1.5 minutes.

Criticality

High

Logging

Dataiku logging is stored for 2 days.

River Data Collection

Description

Data from Chcantabrico.es is collected and stored on GSheet

Tools

Dataiku 

Access rights

Dataiku collects the data. Google Service Account is stored in the Dataiku folder. 

Source

Location

Website. Source page may change, but quite unlikely. If so, the project must be updated.

Format

pd DataFrame 

Destination

Location

GSheets: 1Sja2VlbUmya2Fa340mD3uT9DEIh3wjq_vTxu26IkVQc

Format

table

Sizing

Expected data volume for :

River level: 671 data point, 15.9+ KB

Assessment

Chcantabrico data: Data is appended to the historical data in the Dataiku dataset. We collect data from the website. When data is not available directly, the data is extracted from another page on the website. → Not needed anymore since October 23, 2023. The data comes directly from MES.
Finally we try later that day.

Scheduling

Extraction is done every 15 minutes.

Timing

The average time expected for :

Full process takes 20 seconds.

Criticality

High

Logging

Dataiku logging.