ELN API • Oracle Views • Datalab → Talend → GCP BigQuery

Project Name	Each ELN product follows a distinct IDBS access rights workflow
ETL	Talend
Status	Deployed in PROD

1. Project Overview

This document serves as the official technical handover wiki for the ETL pipeline that ingests data from multiple source systems.

The ELN API, Oracle Database Views, and the Datalab Platform loads the processed data into Google Cloud Platform (GCP) BigQuery tables using Talend as the ETL orchestration tool.

Objectives

Centralise laboratory and analytical data from heterogeneous source systems into a unified GCP data warehouse.
Enforce project-level access control for ELN data by routing each collaborator's data to their designated pipeline scope.
Provide a reliable, maintainable, and documented ETL process for the incoming team to facilitate the flow migration from Talend to Microsoft Fabric.

Scope

All active Talend jobs connecting to ELN API, Oracle Views, and DataLab.
Access management rules governing ELN products assignment per collaborator.
GCP BigQuery datasets and tables that serve as the final landing zone.

2. Architecture & Data Flow

The ETL pipeline follows a classic ETL pattern orchestrated by Talend. The high-level data flow is described below:

[ELN API] ─┐

[Oracle Views ] ──► [Talend ETL Jobs] ──► [GCP BigQuery Tables]

[Datalab Platform ]─┘

Flow Steps

Extract: Talend connects to each source (ELN REST API, Oracle JDBC, Datalab (GCP) ) and extracts raw data according to the job configuration.
Filter / Route: For ELN sources, the flow applies a project-based access filter, ensuring each collaborator's data is scoped to their assigned ELN project(s) only.
Transform: Data is cleaned, typed, and mapped to the target GCP schema within the Talend job components.
Load: Transformed data is written to the corresponding BigQuery dataset and table using the GCP BigQuery connector.

3. Source Systems

Source System	Type	Description
ELN API	REST API	Electronic Lab Notebook – exposes project data per collaborator access level
Oracle Views	Relational DB Views	Pre-aggregated relational data extracted via JDBC connections
Datalab Platform	Internal Data Platform	Analytical datasets and processed outputs from the Datalab environment

3.1 ELN API

Authentication:

- - - REST / HTTP: Protocol
    - Authentication Type : Basic HTTP
    - URL: https://mylab.syensqo.com:8443
    - Login: (Contact Admin)
    - Pasword: (Contact Admin)

Note : Get the containers IDs list from the Oracle views and use it as a parameter variable.

3.2 Oracle DB

Authentication:

(TEST)

- - - IDBS_DB_Database = ELNTEST.eua.solvay.com
    - IDBS_DB_Host = ACEW1TELNBDB02.eua.solvay.com
    - IDBS_DB_Port = 1521
    - IDBS_DB_UserName = (Contact Admin)
    - IDBS_DB_Password = (Contact Admin)

(PROD)

- - - IDBS_DB_Database = ELNPROD.eua.solvay.com
    - IDBS_DB_Host = ACEW1TELNBDB02.eua.solvay.com
    - IDBS_DB_Port = 1521
    - IDBS_DB_UserName = (Contact Admin)
    - IDBS_DB_Password = (Contact Admin)

View:

- - - IDBS_EWB_SEC.HISTORY_ENTRIES.TARGET_ENTITY_ID AS Experiment_ID
    - IDBS_EWB_SEC.HISTORY_ENTRIES.ENTITY_PATH AS FOLDERPATH
    - IDBS_EWB_SOLVAY_CUSTOM.DM_UserAccessPath_Full
    - DBS_EWB_SOLVAY_CUSTOM.DM_EntityRolePermissions_Full

Note:

- - - The Oracle database connection is enabled for Talend Remote Engines. However, when connecting to the database via a VDI, the connection may be disabled.

3.3 Datalab Platform

[Datalab API / SFTP export / direct DB connector – specify]: Access method
[Service account / credentials location]: Authentication
[CSV / Parquet / JSON – specify per dataset]: Data format

3.4 GCP

4. ELN Projects & Collaborator Access Control

The ELN system hosts multiple independent research projects. The Talend ETL flow enforces a project-level access policy: each collaborator is assigned to one or more specific ELN projects, and the pipeline only extracts and loads the data belonging to that collaborator's assigned projects.

4.2 Access Control Mapping

The table below maps each collaborator to their accessible ELN projects, Oracle schemas, and the resulting GCP target dataset:

Collaborator / Role	ELN Project(s)	Oracle Schemas	GCP Target Dataset
[Collaborator A]	ELN-PRJ-001	[Schema_X]	gcp_dataset_001
[Collaborator B]	ELN-PRJ-001	[Schema_Y]	gcp_dataset_001
[Collaborator C]	ELN-PRJ-002	[Schema_Z]	gcp_dataset_002

4.3 How Access Is Enforced in Talend

A context variable (COLLABORATOR_PROJECT_ID) is set at job launch time — either via the job scheduler or a parent orchestration job.
The tELTInput / tRESTClient component includes a dynamic filter parameter (?project_id=<COLLABORATOR_PROJECT_ID>) in the API request URL.
The GCP target dataset is derived from the same context variable, ensuring data isolation at write time.
Attempting to run a job without a valid project assignment will raise a controlled exception and halt execution.

5. Talend ETL Jobs

5.1 Job Inventory

Job Name	Source	Target	Schedule / Trigger
JOB_ELN_TO_GCP	ELN API	GCP BigQuery	Daily – 02:00 AM
JOB_ORA_VIEWS_TO_GCP	Oracle Views	GCP BigQuery	Daily – 03:00 AM
JOB_DATALAB_TO_GCP	Datalab Platform	GCP BigQuery	On-demand / Event

5.2 Job Structure & Key Components

Used for ELN API calls; configured with dynamic URL parameters for project filtering.: tRESTClient
Used for Oracle Views; JDBC URL and credentials stored in Talend Context Groups.: tJDBCInput
Core transformation component – handles data mapping, type casting, and business rules.: tMap
Target writer; configured with GCP service account JSON and dataset/table parameters.: tBigQueryOutput
Logging components present in all jobs for runtime diagnostics.: tLogRow / tWarn
Used to paginate ELN API responses and process large result sets.: tFlowToIterate + tLoop

5.3 Context Groups

All sensitive parameters (credentials, URLs, project IDs) are managed through Talend Context Groups. Do not hardcode credentials in job components.

ENV_DEV: Development environment context.
ENV_PROD: Production environment context.
GCP_CONTEXT: GCP project ID, dataset names, service account path.
ELN_CONTEXT: ELN API base URL, API key, endpoint paths.
ORACLE_CONTEXT: JDBC URL, schema, read-only credentials.

6. Target – GCP BigQuery

8. Error Handling & Monitoring

8.1 Error Handling Strategy

All jobs implement a tDie component on the error branch to terminate gracefully and log a meaningful message.
API connectivity failures trigger a tWarn followed by a retry loop (max 3 retries, 30s backoff).
Data validation errors (null primary keys, type mismatches) are routed to a reject tLogRow component and written to a dedicated GCP error table.
Job execution logs are stored at [log path / GCP Cloud Logging bucket].

8.2 Monitoring & Alerting

Job execution status is tracked via [Talend Management Console / custom log table in BigQuery].
Alerts are sent to [email / Slack channel] on job failure.
A daily reconciliation query compares row counts between source and target to detect data loss.

8.3 Common Errors & Resolutions

ELN 401 Unauthorized: API key expired – rotate key in Secret Manager and update context.
ORA-12170 TNS Connect Timeout: Oracle host unreachable – verify network / VPN connectivity.
GCP 403 Forbidden: Service account permissions issue – verify BigQuery IAM roles.
Talend OutOfMemoryError: Reduce tMap buffer size or enable tBufferOutput pagination.

9. Deployment & Scheduling

9.1 Deployment Steps

Export the Talend job as a standalone archive (.zip) from Talend Studio.
Transfer the archive to the execution server at [/opt/talend/jobs/].
Set the appropriate Context Group (ENV_PROD) and verify all context variables.
Run a smoke test with a single ELN project before full production run.
Register jobs in the scheduler (cron / Talend Management Console).

9.2 Scheduling

ELN → GCP job: Daily at 02:00 AM (server local time)
Oracle → GCP job: Daily at 03:00 AM
Datalab → GCP job: Daily at 03:00 AM