ELN API • Oracle Views • Datalab → Talend → GCP BigQuery
Project Name | Each ELN product follows a distinct IDBS access rights workflow |
ETL | Talend |
Status | Deployed in PROD |
1. Project Overview
This document serves as the official technical handover wiki for the ETL pipeline that ingests data from multiple source systems.
The ELN API, Oracle Database Views, and the Datalab Platform loads the processed data into Google Cloud Platform (GCP) BigQuery tables using Talend as the ETL orchestration tool.
Objectives
- Centralise laboratory and analytical data from heterogeneous source systems into a unified GCP data warehouse.
- Enforce project-level access control for ELN data by routing each collaborator's data to their designated pipeline scope.
- Provide a reliable, maintainable, and documented ETL process for the incoming team to facilitate the flow migration from Talend to Microsoft Fabric.
Scope
- All active Talend jobs connecting to ELN API, Oracle Views, and DataLab.
- Access management rules governing ELN products assignment per collaborator.
- GCP BigQuery datasets and tables that serve as the final landing zone.
2. Architecture & Data Flow
The ETL pipeline follows a classic ETL pattern orchestrated by Talend. The high-level data flow is described below:
[ELN API] ─┐
[Oracle Views ] ──► [Talend ETL Jobs] ──► [GCP BigQuery Tables]
[Datalab Platform ]─┘
Flow Steps
- Extract: Talend connects to each source (ELN REST API, Oracle JDBC, Datalab (GCP) ) and extracts raw data according to the job configuration.
- Filter / Route: For ELN sources, the flow applies a project-based access filter, ensuring each collaborator's data is scoped to their assigned ELN project(s) only.
- Transform: Data is cleaned, typed, and mapped to the target GCP schema within the Talend job components.
- Load: Transformed data is written to the corresponding BigQuery dataset and table using the GCP BigQuery connector.
3. Source Systems
Source System | Type | Description |
ELN API | REST API | Electronic Lab Notebook – exposes project data per collaborator access level |
Oracle Views | Relational DB Views | Pre-aggregated relational data extracted via JDBC connections |
Datalab Platform | Internal Data Platform | Analytical datasets and processed outputs from the Datalab environment |
3.1 ELN API
Authentication:
- REST / HTTP: Protocol
- Authentication Type : Basic HTTP
- URL: https://mylab.syensqo.com:8443
- Login: (Contact Admin)
- Pasword: (Contact Admin)
Note : Get the containers IDs list from the Oracle views and use it as a parameter variable.
3.2 Oracle DB
Authentication:
(TEST)
- IDBS_DB_Database = ELNTEST.eua.solvay.com
- IDBS_DB_Host = ACEW1TELNBDB02.eua.solvay.com
- IDBS_DB_Port = 1521
- IDBS_DB_UserName = (Contact Admin)
- IDBS_DB_Password = (Contact Admin)
(PROD)
- IDBS_DB_Database = ELNPROD.eua.solvay.com
- IDBS_DB_Host = ACEW1TELNBDB02.eua.solvay.com
- IDBS_DB_Port = 1521
- IDBS_DB_UserName = (Contact Admin)
- IDBS_DB_Password = (Contact Admin)
View:
- IDBS_EWB_SEC.HISTORY_ENTRIES.TARGET_ENTITY_ID AS Experiment_ID
- IDBS_EWB_SEC.HISTORY_ENTRIES.ENTITY_PATH AS FOLDERPATH
- IDBS_EWB_SOLVAY_CUSTOM.DM_UserAccessPath_Full
- DBS_EWB_SOLVAY_CUSTOM.DM_EntityRolePermissions_Full
Note:
The Oracle database connection is enabled for Talend Remote Engines. However, when connecting to the database via a VDI, the connection may be disabled.
3.3 Datalab
- Source: GCP Bigquery table
- Project: datagrow
- Table and filter: application_use WHERE user_status='active' AND ROLE IN ('admin', 'lab_manager')
3.4 GCP
4. ELN Projects & Collaborator Access Control
The ELN system hosts multiple independent research projects. The Talend ETL flow enforces a project-level access policy: each collaborator is assigned to one or more specific ELN projects, and the pipeline only extracts and loads the data belonging to that collaborator's assigned projects.
6. Target – GCP BigQuery
Error Handling Strategy
- All jobs implement a tDie component on the error branch to terminate gracefully and log a meaningful message.
- API connectivity failures trigger a tWarn followed by a retry loop.
- Data validation errors (null primary keys, type mismatches) are routed to a reject tLogRow component and written to a dedicated GCP error table.
- Job execution logs are visible in TMC.
Scheduling
- ELN → GCP job: Daily at 02:00 AM (server local time)
- Oracle → GCP job: Daily at 03:00 AM
- Datalab → GCP job: Daily at 03:00 AM