ELN API • Oracle Views • Datalab → Talend → GCP BigQuery
Project Name | Each ELN product follows a distinct IDBS access rights workflow |
ETL | Talend |
Status | Deployed in PROD |
1. Project Overview
This document serves as the official technical handover wiki for the ETL pipeline that ingests data from multiple source systems.
The ELN API, Oracle Database Views, and the Datalab Platform loads the processed data into Google Cloud Platform (GCP) BigQuery tables using Talend as the ETL orchestration tool.
Objectives
- Centralise laboratory and analytical data from heterogeneous source systems into a unified GCP data warehouse.
- Enforce project-level access control for ELN data by routing each collaborator's data to their designated pipeline scope.
- Provide a reliable, maintainable, and documented ETL process for the incoming team to facilitate the flow migration from Talend to Microsoft Fabric.
Scope
- All active Talend jobs connecting to ELN API, Oracle Views, and DataLab.
- Access management rules governing ELN products assignment per collaborator.
- GCP BigQuery datasets and tables that serve as the final landing zone.
2. Architecture & Data Flow
The ETL pipeline follows a classic ETL pattern orchestrated by Talend. The high-level data flow is described below:
[ELN API] ─┐
[Oracle Views ] ──► [Talend ETL Jobs] ──► [GCP BigQuery Tables]
[Datalab Platform ]─┘
Flow Steps
- Extract: Talend connects to each source (ELN REST API, Oracle JDBC, Datalab (GCP) ) and extracts raw data according to the job configuration.
- Filter / Route: For ELN sources, the flow applies a project-based access filter, ensuring each collaborator's data is scoped to their assigned ELN project(s) only.
- Transform: Data is cleaned, typed, and mapped to the target GCP schema within the Talend job components.
- Load: Transformed data is written to the corresponding BigQuery dataset and table using the GCP BigQuery connector.
3. Source Systems
Source System | Type | Description |
ELN API | REST API | Electronic Lab Notebook – exposes project data per collaborator access level |
Oracle Views | Relational DB Views | Pre-aggregated relational data extracted via JDBC connections |
Datalab Platform | Internal Data Platform | Analytical datasets and processed outputs from the Datalab environment |
3.1 ELN API
Authentication:
Note : Get the containers IDs list from the Oracle views and use it as a parameter variable.
3.2 Oracle DB
Authentication:
(TEST)
(PROD)
View:
- IDBS_EWB_SEC.HISTORY_ENTRIES.TARGET_ENTITY_ID AS Experiment_ID
- IDBS_EWB_SEC.HISTORY_ENTRIES.ENTITY_PATH AS FOLDERPATH
- IDBS_EWB_SOLVAY_CUSTOM.DM_UserAccessPath_Full
- DBS_EWB_SOLVAY_CUSTOM.DM_EntityRolePermissions_Full
Note:
3.3 Datalab Platform
- [Datalab API / SFTP export / direct DB connector – specify]: Access method
- [Service account / credentials location]: Authentication
- [CSV / Parquet / JSON – specify per dataset]: Data format
3.4 GCP
4. ELN Projects & Collaborator Access Control
The ELN system hosts multiple independent research projects. The Talend ETL flow enforces a project-level access policy: each collaborator is assigned to one or more specific ELN projects, and the pipeline only extracts and loads the data belonging to that collaborator's assigned projects.
4.2 Access Control Mapping
The table below maps each collaborator to their accessible ELN projects, Oracle schemas, and the resulting GCP target dataset:
Collaborator / Role | ELN Project(s) | Oracle Schemas | GCP Target Dataset |
[Collaborator A] | ELN-PRJ-001 | [Schema_X] | gcp_dataset_001 |
[Collaborator B] | ELN-PRJ-001 | [Schema_Y] | gcp_dataset_001 |
[Collaborator C] | ELN-PRJ-002 | [Schema_Z] | gcp_dataset_002 |
4.3 How Access Is Enforced in Talend
- A context variable (COLLABORATOR_PROJECT_ID) is set at job launch time — either via the job scheduler or a parent orchestration job.
- The tELTInput / tRESTClient component includes a dynamic filter parameter (?project_id=<COLLABORATOR_PROJECT_ID>) in the API request URL.
- The GCP target dataset is derived from the same context variable, ensuring data isolation at write time.
- Attempting to run a job without a valid project assignment will raise a controlled exception and halt execution.
5. Talend ETL Jobs
5.1 Job Inventory
Job Name | Source | Target | Schedule / Trigger |
JOB_ELN_TO_GCP | ELN API | GCP BigQuery | Daily – 02:00 AM |
JOB_ORA_VIEWS_TO_GCP | Oracle Views | GCP BigQuery | Daily – 03:00 AM |
JOB_DATALAB_TO_GCP | Datalab Platform | GCP BigQuery | On-demand / Event |
5.2 Job Structure & Key Components
- Used for ELN API calls; configured with dynamic URL parameters for project filtering.: tRESTClient
- Used for Oracle Views; JDBC URL and credentials stored in Talend Context Groups.: tJDBCInput
- Core transformation component – handles data mapping, type casting, and business rules.: tMap
- Target writer; configured with GCP service account JSON and dataset/table parameters.: tBigQueryOutput
- Logging components present in all jobs for runtime diagnostics.: tLogRow / tWarn
- Used to paginate ELN API responses and process large result sets.: tFlowToIterate + tLoop
5.3 Context Groups
All sensitive parameters (credentials, URLs, project IDs) are managed through Talend Context Groups. Do not hardcode credentials in job components.
- ENV_DEV: Development environment context.
- ENV_PROD: Production environment context.
- GCP_CONTEXT: GCP project ID, dataset names, service account path.
- ELN_CONTEXT: ELN API base URL, API key, endpoint paths.
- ORACLE_CONTEXT: JDBC URL, schema, read-only credentials.
6. Target – GCP BigQuery
8. Error Handling & Monitoring
8.1 Error Handling Strategy
- All jobs implement a tDie component on the error branch to terminate gracefully and log a meaningful message.
- API connectivity failures trigger a tWarn followed by a retry loop (max 3 retries, 30s backoff).
- Data validation errors (null primary keys, type mismatches) are routed to a reject tLogRow component and written to a dedicated GCP error table.
- Job execution logs are stored at [log path / GCP Cloud Logging bucket].
8.2 Monitoring & Alerting
- Job execution status is tracked via [Talend Management Console / custom log table in BigQuery].
- Alerts are sent to [email / Slack channel] on job failure.
- A daily reconciliation query compares row counts between source and target to detect data loss.
8.3 Common Errors & Resolutions
- ELN 401 Unauthorized: API key expired – rotate key in Secret Manager and update context.
- ORA-12170 TNS Connect Timeout: Oracle host unreachable – verify network / VPN connectivity.
- GCP 403 Forbidden: Service account permissions issue – verify BigQuery IAM roles.
- Talend OutOfMemoryError: Reduce tMap buffer size or enable tBufferOutput pagination.
9. Deployment & Scheduling
9.1 Deployment Steps
- Export the Talend job as a standalone archive (.zip) from Talend Studio.
- Transfer the archive to the execution server at [/opt/talend/jobs/].
- Set the appropriate Context Group (ENV_PROD) and verify all context variables.
- Run a smoke test with a single ELN project before full production run.
- Register jobs in the scheduler (cron / Talend Management Console).
9.2 Scheduling
- ELN → GCP job: Daily at 02:00 AM (server local time)
- Oracle → GCP job: Daily at 03:00 AM
- Datalab → GCP job: Daily at 03:00 AM