GSC Dashboard - 4. Data Loading with Talend

4.1 - Talend Integration

Source data integration with Talend ETL tool

Xtract Server (WBP):
- Talend connects to the Xtract server and trigger extraction job using HTTP calls.
- Xtract generates results as CSV files stored in memory.
- Talend retrieves files, apply some transformation and then loads them into Google Cloud Storage.
Google Sheets (Hedges, wap solid fuels, CO2):
- Talend integrates with Google Sheets to extract some static files that will be use to apply mappings or lookup to BW tables.
- It retrieves this data from Google Sheets.

Data Transformation and Loading to Google BigQuery:

Once data from all sources is available in Google Cloud Storage as files, Talend proceeds with data transformation and loading.
Talend performs data transformations as needed, including cleansing, mapping, and structuring the data for consistency.
The transformed data is loaded into various stages, operational data stores (ODS), and data mart tables within Google BigQuery.
These tables are organized to facilitate efficient querying and reporting for energy optimization purposes.

By utilizing Talend for data extraction, transformation, and loading (ETL), the web app ensures that data from diverse sources is collected, processed, and structured for analysis and reporting within Google BigQuery, enabling users to make informed decisions based on up-to-date and accurate data.

4.2 - Source Data Extraction

Main jobs for source extraction

F100_GSC_DATA_EXTRACT
J130_mainExtract_Files_Src_to_GCS
J120_GSC_mainExtract_BWQ_Data
J120_GSC_Extract_BW_data_Source_14_EXT
J110_GSC_mainExtract_Gsheet_data

--to the top ↑--

Job description by steps

Job design

In the job F100_PUR_DATA_EXTRACT:
1. A parallel execution of all data sources is launched

The flow triggers the extraction of all sources needed for the Global Supply chain dashboard.

	Job description by steps	Job design
Main jobs for source extraction	J110_GSC_mainExtract_Gsheet_data	--to the top ↑--
	Parallel extraction of all Google Sheet sources: A parallel extraction of all Google Sheet is launched Simple transformation are done to align with specifications Upload files to GCS The files are uploaded into the GSC folder of the target bucket

	Job description by steps	Job design
Main jobs for source extraction	J130_GSC_mainExtract_Files_Src_to_GCS	--to the top ↑--
	Filename creation: A specific filename following our standard naming convention is created to properly upload files into Cloud Storage Parallel GCS upload Each Google Sheet table is read Simple transformation are done to align with specifications resulting in a CSV stored in the Tmp folder Each file is uploaded into the GSC folder of the target bucket

	Job description by steps	Job design
Main jobs for source extraction	J120_GSC_Extract_BW_data_Source_14_EXT	--to the top ↑--
	Login to Salesforce: The job starts by establishing a connection to Salesforce using the account sbs-is-appli-qlikview.support@solvay.com.crm Extraction of the Salesforce objects The list of Salesforce objects is extracted in sequence Depending on the source some mappings can be applied Each source is uploaded to GCS as a CSV file

	Job description by steps	Job design
Main jobs for source extraction	J120_GSC_mainExtract_BWQ_Data_Source	--to the top ↑--
	Parellel execution of Xtract jobs: The job triggers in parallel all the jobs needed to extract data from BW.

4.2 - Load Staging and ODS

	Job description by steps	Job design
Main jobs for loading Staging	F200_GSC_Gcs_To_Stg J200_GSC_Gcs_To_Stg	--to the top ↑--
	First the job retrieves the value of the Business date from the file business_date.csv file. This is to target only the files of the current execution in GCS. The flow is used to call the job J200_GSC_Gcs_To_Stg.
J200_GSC_Gcs_To_Stg	The job launches a parallel load per source type into the Staging tables A list of the possible CSV file with the associated Staging table name, and separator is used to create a sequence Each staging table is truncated before a new upload The job gets all the files starting with the desired prefix and for each of them calls the standard job J001_GCS_to_STAGING_LOCAL

Main jobs for loading the ODS

F300_GSC_STG_TO_ODS
J300_GSC_STG_TO_ODS

--to the top ↑--

Job description by steps

Job design

The flow is used to call the job J300_GSC_STG_TO_ODS

Parallel loading of the ODS table by source type
1. A parallel loading of ODS table is done by source type (BW, Google Sheet, Salesforce)
Load in sequence all ODS tables of a source
1. Given a source type the list of staging and ODS tables is hardcoded.
2. ODS table is truncated
3. Th standard job J001_STAGING_TO_ODS is called to load the ODS table

Given the fact that staging and ods are identical and tables are always truncated, this step is unnecessary.

4.2 - Prepare DM files

	Job description by steps	Job design
Main jobs for building DM	F400_GSC_ODS_to_Prep_DM_Files F400_PURCHASING_STEP F410_GSC_ODS_to_DTM F420_GSC_ODS_to_DTM_Part_1 F420_GSC_ODS_to_DTM_Part_5 F420_GSC_ODS_to_DTM_Part_6 F430_GSC_ODS_to_DTM	--to the top ↑--
	This job is used to call a series of subjob that creates files needed to create the dimensions and fact tables

Prepare DM file

F410_GCS_ODS_to_DTM

--to the top ↑--

Job description by steps

Job design

The job computes steps 19-28 of the specification document

The job reads data from the ODS table and apply the necessary mappings;
The job output PIVOT and FACT files which are stored on the remote engine and will be uploaded to to the DM in the J500 jobs.
The jobs creates the table FACT_gsc_target

Main jobs for source extraction

F430_GCS_ODS_to_DTM

--to the top ↑--

Job description by steps

Job design

The job computes steps 41-49 of the specification document

The job reads data from the ODS table and apply the necessary mappings;
The job output PIVOT and FACT files which are stored on the remote engine and will be uploaded to to the DM in the J500 jobs.
The job also builds the files necessary to create the airfreight KPI following these specifications

Main jobs for source extraction

F420_GSC_ODS_to_DTM_PART_1

--to the top ↑--

Job description by steps

Job design

The job computes steps 33,34, and 35-49 of the specification document

The job reads data from the ODS table and apply the necessary mappings;
The job output PIVOT and FACT files which are stored on the remote engine and will be uploaded to to the DM in the J500 jobs.

Main jobs for source extraction

F420_GSC_ODS_to_DTM_PART_5

--to the top ↑--

Job description by steps

Job design

The job computes steps 33,34, and 60-69 of the specification document

The job reads data from the ODS table and apply the necessary mappings;
The job output PIVOT and FACT files which are stored on the remote engine and will be uploaded to to the DM in the J500 jobs.

The job uses the tUnpivotRow component (you can find it here).

4.3 - Compute Perfect Order Rate

Main jobs for source extraction

F420_GSC_ODS_to_DTM_PART_6

--to the top ↑--

Job description by steps

Job design

The job creates the table FACT_POR which is displayed by the Perfect Order rate dashbaord.

The job reads data from the ODS table and apply the necessary mappings. The main ODS tables are:
1. ODS_0000_F025_F_W_GSC_MVSDSO45_SD0001
2. ODS_0000_F026_F_W_GSC_Mvsdso51_0001
3. ODS_0000_F020_F_W_GSC_Sf_Complaint_Case
Each step outputs a FACT file which is stored in the Tmp folder
The files are stacked to create a unique output bulk file which is used to create the FACT_POR table.

4.4 - Load to DM (calculations and transformations)

	Job description by steps	Job design
Main jobs for source extraction	F501_GSC_Files_To_FACT 501_GSC_ODS_to_DTM_FACT	--to the top ↑--
	This jobs is used to call the job J501_GSC_ODS_to_DTM_FACT
501_GSC_ODS_to_DTM_FACT	First the job truncates the table DM.FACT_gsc_tmp The job reads all files whose name starts with FACT_FILE inside the InOut folder and uploads them into the FACT_gsc_tmp table. If files is successfully inserted the local file is deleted from the remote engine The job truncates the table DM.FACT_gsc The job copies data from FACT_gsc_tmp into FACT_gsc The job truncates the table DM.FACT_gsc_tmp

	Job description by steps	Job design
Main jobs for source extraction	J500_PURCHASING_PIVOT_TO_DTM	--to the top ↑--
	The job starts by truncating the table BDG_purchasing_tmp Read Pivot files: The job iterqates over all files of the Pivot folder (whose name starts with PIVOT_TABLE_ ). Each file is processed and a lookup is done with the two following Google Sheets to get a uniform GBU text. An BigQueryBulkOutput file is created Upload data in the BDG_purchasing_tmp table: The job upload all bulk files into Cloud Storage and each file is appended into the BDG_purchasing_tmp table The job truncates the table BDG_purchasing. . The job inserts copies data into from the BDG_purchasing_tmp table. During the copy the columns mont_year and vc_year are converted into string.

	Job description by steps	Job design
Main jobs for source extraction	J510_PURCHASING_FACT_TO_DTM	--to the top ↑--
	The job starts by truncating the table FACT_purchasing_tmp Read Pivot files: The job iterqates over all files of the Pivot folder (whose name starts with FACT_TABLE_ ). Each file is processed to have the desired schema of the fact table An BigQueryBulkOutput file is created Upload data in the FACT_purchasing_tmp table: The job upload all bulk files into Cloud Storage and each file is appended into the FACT_purchasing_tmp table The job truncates the table FACT_purchasing. . The job inserts copies data into from the FACT_purchasing_tmp table

	Job description by steps	Job design
Main jobs for source extraction	J520_PURCHASING_SPEND_TO_DTM	--to the top ↑--
	Read SPEND_TARGET_1 file: The job reads the file SPEND_TARGET_1.csv in the folder Spend_Targe A BigQueryBulkOutput file is created Upload the file bulk file in Google Cloud Storage The job uses the bulk file to overwrite the values in the table DIM_spend_target.

4.5 - Scheduling and Automation

TMC - PL_QV_TO_TABLEAU_PURCHASING - Daily at 5 AM.

4.6 - Data Validation

Responsible & contact points:

Application Owner - Mathieu Pourqué
Delivery Manager - Donia Rachidi
Project Manager - Vitaly
Data Architect - Joao Fonseca
Tableau Dev - Maxime Marboeuf
Data Engineer - Virgil Lissassi; replaced by Matteo Menghetti

Page tree

GSC Dashboard - 4. Data Loading with Talend

4.1 - Talend Integration

4.2 - Source Data Extraction

F100_GSC_DATA_EXTRACT

J130_mainExtract_Files_Src_to_GCS

J120_GSC_mainExtract_BWQ_Data

J120_GSC_Extract_BW_data_Source_14_EXT

J110_GSC_mainExtract_Gsheet_data

J110_GSC_mainExtract_Gsheet_data

J130_GSC_mainExtract_Files_Src_to_GCS

J120_GSC_Extract_BW_data_Source_14_EXT

J120_GSC_mainExtract_BWQ_Data_Source

4.2 - Load Staging and ODS

F300_GSC_STG_TO_ODS

J300_GSC_STG_TO_ODS

4.2 - Prepare DM files

F400_GSC_ODS_to_Prep_DM_Files

F430_GCS_ODS_to_DTM

F420_GSC_ODS_to_DTM_PART_1

F420_GSC_ODS_to_DTM_PART_5

4.3 - Compute Perfect Order Rate

F420_GSC_ODS_to_DTM_PART_6

4.4 - Load to DM (calculations and transformations)

F501_GSC_Files_To_FACT

J500_PURCHASING_PIVOT_TO_DTM

J520_PURCHASING_SPEND_TO_DTM

4.5 - Scheduling and Automation

4.6 - Data Validation