4. Data Loading with Talend

4.1 - Talend Integration

Source data integration with Talend ETL tool

FTP Server (Meteologica):
- Talend connects to the FTP server where files containing current and future electricity prices for Spain, Italy, France, and Germany are located.
- Talend retrieves these files from the FTP server.
- The retrieved files are then loaded into Google Cloud Storage.
Postgres Database (Vendohm):
- Talend establishes a direct connection with the Vendohm PostgreSQL database that stores sensor values identified by specific curve IDs.
- It extracts relevant data from the PostgreSQL database.
- This extracted data is loaded into Google Cloud Storage as files.
Google Sheets (Hedges, Marginal cost, CO2):
- Talend integrates with Google Sheets, where marginal costs, CO2 costs, and hedging information are stored.
- It retrieves this data from Google Sheets.
- Similar to the other sources, this data is also loaded into Google Cloud Storage as files.

Data Transformation and Loading to Google BigQuery:

Once data from all three sources (FTP server, PostgreSQL database, Google Sheets) is available in Google Cloud Storage as files, Talend proceeds with data transformation and loading.
Talend performs data transformations as needed, including cleansing, mapping, and structuring the data for consistency.
The transformed data is loaded into various stages, operational data stores (ODS), and data mart tables within Google BigQuery.
These tables are organized to facilitate efficient querying and reporting for energy optimization purposes.

By utilizing Talend for data extraction, transformation, and loading (ETL), the web app ensures that data from diverse sources is collected, processed, and structured for analysis and reporting within Google BigQuery, enabling users to make informed decisions based on up-to-date and accurate data.

4.2 - Source Data Extraction

	Job description by steps	Job design
Main jobs for source extraction	J001_FTP_to_GCS-METEOLOGICA_FRANCE J002_FTP_to_GCS-METEOLOGICA_ITALY J003_FTP_to_GCS-METEOLOGICA_GERMANY J004_FTP_to_GCS-METEOLOGICA_SPAIN J005_FTP_to_GCS-METEOLOGICA_SPAIN_ENS J006_FTP_to_GCS-METEOLOGICA_SPAIN_OBS	--to the top ↑--
	FTP Server Connection: Establishes a secure connection to the Meteologica FTP server. Retrieves files from a predefined folder located on the FTP server. File Consolidation and Renaming: After fetching the files, the job consolidates and renames them according to predefined naming conventions. This step ensures uniformity and consistency in file naming for further processing. Creating File Backups: As a best practice for data integrity, the job generates copies of the original files. These copies are stored in an archive folder on the remote engine for backup and historical reference. Checking for Previously Processed Files: The job performs a check to determine if the pulled files have already been processed. This verification is crucial to prevent the reprocessing of files from previous days, as older files may still exist in the folder. Data Extraction and Transformation: For files that have not been processed earlier, the job proceeds with data analysis. It extracts a datetime value from each file, located at a specified position within the file. Using this datetime information, the job calculates a new column, enabling precise timestamping of the data. CSV Output and Google Storage: The processed data, now enriched with the new timestamp column, is converted into a CSV format. The resulting CSV output files are then pushed to Google Cloud Storage for further storage and access. File Deletion: Once the files have been successfully processed and their data stored in Google Cloud Storage, they are deleted from the remote engine. This deletion step helps manage storage resources efficiently and ensures that only processed data is retained. This Talend job ensures the efficient handling of data from the Meteologica FTP server, including consolidation, backup, and validation steps. It also extracts and enriches the data, making it ready for use in downstream processes, ultimately contributing to the accuracy and usability of the energy optimization project's data.

Main jobs for source extraction

J007_GSheet_to_GCS_HEDGING
J008_GSheet_to_GCS_MARGINAL_COST

Job description by steps

Job design

Google Sheets Connection:
1. The job initiates a connection to a specified Google Sheets file.
2. It targets a specific sheet within the Google Sheets document, identified by its unique ID.
Data Extraction and Formatting:
1. The job extracts data from the designated Google Sheets sheet.
2. The extracted data is converted into a CSV format with a predefined fixed schema.
3. The schema format is consistent and predefined to ensure data uniformity and structure.
4. The resulting CSV file includes a predefined file name that incorporates the date of extraction.
Copying to Google Storage:
1. After the successful extraction and formatting of the data, the job copies the CSV file to Google Cloud Storage.
2. This step serves as a secure and reliable means of storing the processed data in the cloud.
File Deletion:
1. Following the successful copying of the CSV file to Google Cloud Storage, the original file located on the remote engine is deleted.
2. This cleanup step helps manage storage resources efficiently and maintains data integrity.

This Talend job efficiently extracts, formats, and securely stores data from a Google Sheets document into Google Cloud Storage, ensuring that the data is readily available for further analysis and processing while adhering to a fixed schema and naming conventions.

Main jobs for source extraction

J009_GSheet_to_GCS_MARGINAL_COST

Job description by steps

Job design

Google Sheets Connection:
1. The job establishes a connection to a specified Google Sheets file.
2. It focuses on a particular sheet within the Google Sheets document, identified by its unique ID.
Data Extraction and Formatting:
1. The job extracts data from the designated Google Sheets sheet.
2. The extracted data is transformed into a CSV format, adhering to a predefined, fixed schema.
3. The schema format is consistent and predefined to ensure uniformity and data structure.
4. The resulting CSV file follows a predefined naming convention, which incorporates the date of extraction.
Data Reading:
1. The job involves two distinct data reading steps.
  1. First Reading: The job reads only a single cell within the file, specifically the cell containing the year value. This is set to capture year-related data.
  2. Second Reading: The second read of the file is reserved for capturing the remaining data, defined by the schema and format. This step comprehensively reads the data specified in the schema.
Copying to Google Storage:
1. After successfully extracting and formatting the data, the job copies the CSV file to Google Cloud Storage.
2. This step serves as a secure and reliable means of storing the processed data in the cloud.
File Deletion:
1. Following the successful copying of the CSV file to Google Cloud Storage, the original file located on the remote engine is deleted.
2. This cleanup step helps manage storage resources efficiently and maintains data integrity.

This Talend job effectively extracts, formats, and securely stores data from a Google Sheets document into Google Cloud Storage. It features specialized data reading steps, ensuring the capture of year-related data separately, while adhering to predefined schemas and naming conventions. The deletion of the original file enhances data management and resource optimization.

Main jobs for source extraction

J010_GSheet_to_GCS_VENDOHM

Job description by steps

Job design

PostgreSQL Database Connection:
1. The job establishes a secure connection to a PostgreSQL database.
2. This connection is secured using certificates and a client key provided in the remote engine.
Data Extraction from PostgreSQL:
1. The job retrieves data from a specified table in the PostgreSQL database.
2. The schema for this data extraction process is predefined and consistent.
Data Formatting and CSV Output:
1. The extracted data is formatted and transformed into CSV format.
2. The resulting CSV file is named according to a predefined convention, which includes the table name and the date of extraction.
3. This structured naming ensures clarity and organization.
Storage in Google Cloud Storage (GCS):
1. The formatted CSV file is securely stored in Google Cloud Storage.
2. It is placed within a designated "Vendohm" folder in the storage location.
File Deletion and Connection Closure:
1. After a successful data extraction and storage process, the original file located on the remote engine is deleted.
2. Additionally, the connection to the PostgreSQL database is closed, ensuring proper resource management.

This Talend job effectively connects to a PostgreSQL database, retrieves data from a specified table, and securely stores the formatted data in CSV format within Google Cloud Storage. It employs robust security measures and follows strict naming conventions for data organization, with an emphasis on data integrity and resource optimization.

4.3 - Data Loading to Google BigQuery

Main jobs for source extraction

J001_Extraction_til_ODS_METEOLOGICA_FRANCE
J002_Extraction_til_ODS_METEOLOGICA_ITALY
J003_Extraction_til_ODS_METEOLOGICA_GERMANY
J004_Extraction_til_ODS_METEOLOGICA_SPAIN
J005_Extraction_til_ODS_METEOLOGICA_SPAIN_ENS
J006_Extraction_til_ODS_METEOLOGICA_SPAIN_OBS

Job description by steps

Job design

Job Initialization and Logging:

Job Run Initialization (Subjob 1):
1. The job starts with the first subjob, where the job ID is retrieved, and the extraction date is calculated.
2. This crucial step initializes the job and captures essential metadata.

Data Extraction and Processing:

Files Extraction from FTP Server (Subjob 2):
1. In the next subjob, data files are extracted from the Meteologica FTP server.
Error Handling and Logging (Subjob 3 - On Failure):
1. If the data extraction subjob (Subjob 2) encounters an issue or does not finish successfully, the job transitions to a subjob for writing error logs.
2. Logging is essential for tracking and diagnosing issues in the data extraction process.
Data Loading and Transformation (Subjob 4 - On Success):
1. When the data extraction subjob (Subjob 2) successfully completes, the job proceeds with data loading and transformation.
2. In this stage, each extracted file is processed one by one.
3. For each file, a subjob is called to load data from the CSV file into a stage table, preparing it for further processing.
4. Subsequently, data from the stage table is loaded into an operational data store (ODS) table.
5. All necessary parameters, such as table names and connection parameters, are provided to ensure accurate data extraction and loading.

Logging and Reporting:

Logging (Subjob 5):
1. At the end of the data loading and transformation process, a subjob is called to write logs.
2. Logging captures critical information about the data processing, ensuring transparency and traceability.

This Talend job orchestrates the data extraction, loading, and transformation process. It begins with initialization and metadata capture, proceeds to extract data from the FTP server, handles errors if they occur, and performs data loading and transformation for energy optimization purposes. Detailed logging and error handling mechanisms enhance job monitoring and maintain data integrity.

Main jobs for source extraction

J007_Extraction_til_ODS_GSHEET_HEDGING
J008_Extraction_til_ODS_GSHEET_MARGINAL
J009_Extraction_til_ODS_METEOLOGICA_CO2

Job description by steps

Job design

Job Initialization and Logging:

File Name Configuration (Global Variable):
1. The job begins by configuring the file name as a global variable.
2. This variable is used throughout the job to ensure consistency in file handling.
Job Run Initialization (Subjob 1):
1. In the first subjob, the job ID is retrieved, and the extraction date is calculated.
2. This step sets the stage for the data processing and captures essential metadata.

Data Extraction from Google Sheets:

Files Extraction from Google Sheets (Subjob 2):
1. The next subjob is responsible for extracting data from Google Sheets.
2. It targets the specified file based on the global variable (file name) configured earlier.
Error Handling and Logging (Subjob 3 - On Failure):
1. In case the data extraction subjob (Subjob 2) encounters any issues or doesn't finish successfully, the job transitions to a subjob for writing error logs.
2. Logging is crucial for tracking and diagnosing problems during the data extraction process.

Data Loading and Transformation:

Data Loading and Transformation (Subjob 4 - On Success):
1. When the data extraction subjob (Subjob 2) successfully completes for the provided file name, the job proceeds with data loading and transformation.
2. A subjob is called to load data from the CSV file into a stage table, preparing it for further processing.
3. Subsequently, data from the stage table is loaded into an operational data store (ODS) table.
4. All necessary parameters, such as table names and connection parameters, are provided to ensure accurate data extraction and loading.

Logging and Reporting:

Logging (Subjob 5):
1. At the end of the data loading and transformation process, a subjob is called to write logs.
2. Logging captures crucial information about the data processing, enhancing transparency and traceability.

This Talend job streamlines the data extraction, loading, and transformation process. It begins with file name configuration, proceeds with metadata capture, extracts data from Google Sheets, handles errors if they occur, and performs data loading and transformation, all while maintaining detailed logs for monitoring and maintaining data integrity.

Main jobs for source extraction

J010_Extraction_til_ODS_VENDOHM

Job description by steps

Job design

Global Variable Configuration:

File Names and Table Names Configuration (Global Variables):
1. The job starts by configuring global variables for file names and table names.
2. These variables are populated by reading a list of all tables and files that subjobs need to load.
3. This dynamic configuration ensures that the job adapts to the specific files and tables required for processing.

Data Extraction and Processing in Iterations:

Iteration Loop (One by One):
1. The job processes tables and files iteratively, one by one, based on the configured global variables.
Job Run Initialization (Subjob 1):
1. Within each iteration, the first subjob is called to retrieve the job ID and calculate the extraction date.
2. This initialization step captures vital metadata for the ongoing data processing.
Files Extraction from PostgreSQL Database (Subjob 2):
1. In the next subjob, data files are extracted from the PostgreSQL database.
2. Each iteration targets a specific file based on the global variables (file names) configured earlier.
Error Handling and Logging (Subjob 3 - On Failure):
1. If the data extraction subjob (Subjob 2) encounters any issues or doesn't finish successfully, the job transitions to a subjob for writing error logs.
2. Logging is crucial for tracking and diagnosing problems during the data extraction process.
Data Loading and Transformation (Subjob 4 - On Success):
1. When the data extraction subjob (Subjob 2) successfully completes for the provided file name, the job proceeds with data loading and transformation.
2. A subjob is called to load data from the CSV file into a stage table, preparing it for further processing.
3. Subsequently, data from the stage table is loaded into an operational data store (ODS) table.
4. All necessary parameters, such as table names and connection parameters, are provided to ensure accurate data extraction and loading.

Logging and Reporting:

Logging (Subjob 5):
1. At the end of each iteration, a subjob is called to write logs.
2. Logging captures critical information about the data processing, ensuring transparency and traceability for each processed file and table.

This Talend job is designed for dynamic data extraction and processing, adapting to different tables and files. It initializes each iteration with metadata capture, extracts data from the PostgreSQL database, handles errors when necessary, performs data loading and transformation with parameterized configurations, and maintains detailed logs for monitoring and data integrity across multiple files and tables.

Page tree

4.1 - Talend Integration

4.2 - Source Data Extraction

4.3 - Data Loading to Google BigQuery

4.4 - Load to DM (calculations and transformations)

4.5 - Error Handling

4.6 - Scheduling and Automation

4.7 - Performance Optimization

4.8 - Data Validation

4.9 - Dependencies and Order of Operations

4.10 - Versioning and Change Management