...
ELN API • Oracle Views • Datalab → Talend → GCP BigQuery
Project Name | Each ELN product follows a distinct IDBS access rights workflow |
ETL | Talend |
Status | Deployed in PROD |
1. Project Overview
This document serves as the official technical handover wiki for the ETL pipeline that ingests data from multiple source systems.
The ELN API, Oracle Database Views, and the Datalab Platform loads the processed data into Google Cloud Platform (GCP) BigQuery tables using Talend as the ETL orchestration tool.
Objectives
...
Objectives
The ELN system hosts multiple independent research projects. The Talend ETL flow enforces a project-level access policy: each collaborator is assigned to one or more specific ELN projects, and the pipeline only extracts and loads the data belonging to that collaborator's assigned projects
...
.
Scope
- All active Talend jobs connecting to ELN API, Oracle Views, and DataLab.
- Access management rules governing ELN products assignment per collaborator.
- GCP BigQuery datasets and tables that serve as the final landing zone.
2. Architecture & Data Flow
The ETL pipeline follows a classic ETL pattern orchestrated by Talend. The high-level data flow is described below:
[ELN API] ─┐
[Oracle Views ] ──► [Talend ETL Jobs] ──► [GCP BigQuery Tables]
[Datalab Platform ]─┘
Flow Steps
- Extract: Talend connects to each source (ELN REST API, Oracle JDBC, Datalab (GCP) ) and extracts raw data according to the job configuration.
- Filter / Route: For ELN sources, the flow applies a project-based access filter, ensuring each collaborator's data is scoped to their assigned ELN project(s) only.
- Transform: Data is cleaned, typed, and mapped to the target GCP schema within the Talend job components.
- Load: Transformed data is written to the corresponding BigQuery dataset and table using the GCP BigQuery connector.
3. Source Systems
Source System | Type | Description |
ELN API | REST API | Electronic Lab Notebook – exposes project data per collaborator access level |
Oracle Views | Relational DB Views | Pre-aggregated relational data extracted via JDBC connections |
Datalab Platform | Internal Data Platform | Analytical datasets and processed outputs from the Datalab environment |
3.1 ELN API
Authentication:
(TEST)
- REST / HTTP: Protocol
- Authentication Type : Basic HTTP
- URL: https://mylab-test.syensqo.com:8443
- Login: (Contact Admin)
- Password: (Contact Admin)
(PROD)
- REST / HTTP: Protocol
- Authentication Type : Basic HTTP
- URL: https://mylab.syensqo.com:8443
- Login: (Contact Admin)
- PaswordPassword: (Contact Admin)
Note :
- Get the
...
- list of container IDs
...
- from the Oracle
...
- view, then iterate over it as
...
- an input variable in the URL.
- Container IDs are should be available in this view :IDBS_EWB_SEC.HISTORY_ENTRIES.ENTITY_PATH
Exemple :
- https://mylab-test.syensqo.com:8443/ewb/services/1.0/feeds/entities/ (ContainerID_List)
3.2 Oracle DB
Authentication:
...
- IDBS_EWB_SEC.HISTORY_ENTRIES.TARGET_ENTITY_ID AS Experiment_IDID
- IDBS_EWB_SEC.HISTORY_ENTRIES.ENTITY_PATH AS FOLDERPATH
- IDBS_EWB_SOLVAY_CUSTOM.DM_UserAccessPath_Full AS PERMISSIONS
- DBS_EWB_SOLVAY_CUSTOM.DM_EntityRolePermissions_Full AS ROLES
Note:
The Oracle database connection is enabled for Talend Remote Engines. However, when connecting to the database via a VDI, the connection may be disabled.
3.3
...
- [Datalab API / SFTP export / direct DB connector – specify]: Access method
- [Service account / credentials location]: Authentication
- [CSV / Parquet / JSON – specify per dataset]: Data format
3.4 GCP
4. ELN Projects & Collaborator Access Control
The ELN system hosts multiple independent research projects. The Talend ETL flow enforces a project-level access policy: each collaborator is assigned to one or more specific ELN projects, and the pipeline only extracts and loads the data belonging to that collaborator's assigned projects.
4.2 Access Control Mapping
The table below maps each collaborator to their accessible ELN projects, Oracle schemas, and the resulting GCP target dataset:
Collaborator / Role | ELN Project(s) | Oracle Schemas | GCP Target Dataset |
[Collaborator A] | ELN-PRJ-001 | [Schema_X] | gcp_dataset_001 |
[Collaborator B] | ELN-PRJ-001 | [Schema_Y] | gcp_dataset_001 |
[Collaborator C] | ELN-PRJ-002 | [Schema_Z] | gcp_dataset_002 |
4.3 How Access Is Enforced in Talend
- A context variable (COLLABORATOR_PROJECT_ID) is set at job launch time — either via the job scheduler or a parent orchestration job.
- The tELTInput / tRESTClient component includes a dynamic filter parameter (?project_id=<COLLABORATOR_PROJECT_ID>) in the API request URL.
- The GCP target dataset is derived from the same context variable, ensuring data isolation at write time.
- Attempting to run a job without a valid project assignment will raise a controlled exception and halt execution.
5. Talend ETL Jobs
5.1 Job Inventory
Job Name | Source | Target | Schedule / Trigger |
JOB_ELN_TO_GCP | ELN API | GCP BigQuery | Daily – 02:00 AM |
JOB_ORA_VIEWS_TO_GCP | Oracle Views | GCP BigQuery | Daily – 03:00 AM |
JOB_DATALAB_TO_GCP | Datalab Platform | GCP BigQuery | On-demand / Event |
5.2 Job Structure & Key Components
- Used for ELN API calls; configured with dynamic URL parameters for project filtering.: tRESTClient
- Used for Oracle Views; JDBC URL and credentials stored in Talend Context Groups.: tJDBCInput
- Core transformation component – handles data mapping, type casting, and business rules.: tMap
- Target writer; configured with GCP service account JSON and dataset/table parameters.: tBigQueryOutput
- Logging components present in all jobs for runtime diagnostics.: tLogRow / tWarn
- Used to paginate ELN API responses and process large result sets.: tFlowToIterate + tLoop
5.3 Context Groups
Datalab
- Source: GCP Bigquery table
- Project: gcp-sqo-datalab-(*)
- Dataset: bq_ds_datagrow_dev_ads_static
- Table: application_use
- filter: user_status='active' AND ROLE IN ('admin', 'lab_manager')
3.4 Target table:
Bigquery: DM.ELN_IDBS_AccessRights
Talend tasks:
(Master)
(Get Experiments)
(Cross experiments with collaborators and roles )
Error Handling
All sensitive parameters (credentials, URLs, project IDs) are managed through Talend Context Groups. Do not hardcode credentials in job components.
- ENV_DEV: Development environment context.
- ENV_PROD: Production environment context.
- GCP_CONTEXT: GCP project ID, dataset names, service account path.
- ELN_CONTEXT: ELN API base URL, API key, endpoint paths.
- ORACLE_CONTEXT: JDBC URL, schema, read-only credentials.
6. Target – GCP BigQuery
8. Error Handling & Monitoring
...
Strategy
- All jobs implement a tDie component on the error branch to terminate gracefully and log a meaningful message.
- API connectivity failures trigger a tWarn followed by a retry loop
(max 3 retries, 30s backoff).Data validation errors (null primary keys, type mismatches) are routed to a reject tLogRow component and written to a dedicated GCP error table- .
- Job execution logs are
stored at [log path / GCP Cloud Logging bucket].
8.2 Monitoring & Alerting
- Job execution status is tracked via [Talend Management Console / custom log table in BigQuery].
- Alerts are sent to [email / Slack channel] on job failure.
- A daily reconciliation query compares row counts between source and target to detect data loss.
8.3 Common Errors & Resolutions
- ELN 401 Unauthorized: API key expired – rotate key in Secret Manager and update context.
- ORA-12170 TNS Connect Timeout: Oracle host unreachable – verify network / VPN connectivity.
- GCP 403 Forbidden: Service account permissions issue – verify BigQuery IAM roles.
- Talend OutOfMemoryError: Reduce tMap buffer size or enable tBufferOutput pagination.
9. Deployment & Scheduling
9.1 Deployment Steps
- Export the Talend job as a standalone archive (.zip) from Talend Studio.
- Transfer the archive to the execution server at [/opt/talend/jobs/].
- Set the appropriate Context Group (ENV_PROD) and verify all context variables.
- Run a smoke test with a single ELN project before full production run.
- Register jobs in the scheduler (cron / Talend Management Console).
...
- visible in TMC.
Scheduling
- ELN → GCP job: Daily at 02:00 AM (server local time)
- Oracle → GCP job: Daily at
03- 02:00 AM
- Datalab → GCP job: Daily at
03- 02:00 AM
4. Contacts & responsibilities:
- antoine.rey@syensqo.com / ELN Administrator
- prasanth.gnanasekar@syensqo.com / Data Engineering - Flow Maintenance
5. Annexe document:
This annex document could be helpful if you are looking for more details on access rights related to ELN. (link)


