You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 16 Next »

This page presents the general architecture for data development. It's responsable to describe the data extraction, ingestion, preparation, integration and share. Not all the domains are using this architecture. Please, refers to the subpages to know the details and the differences. 

Summary

Macro Diagram - ETL Phases 

1 - Data Ingestion

Data ingestion is the process of transporting data from one or more sources to a target site for further processing and analysis. For this project data is extracted or copied to Talend server as files using Talend.

Data Sources

ELN

Many spreadsheets are extracted from Electronic Laboratory Notebook (ELN) which is responsable to gather lab documents and result tests. It gives scientists a common view of data across disparate research areas, enabling complete visibility of research information.

ELN data is extracted in JSON format using an available API. 

Related documents: 

ZIFO - API Documentation for extracting E-Workbook Spreadsheet Data

Labware

LabWare LIMS is a laboratory information management system (LIMS) that is "configurable or pre-configured for laboratories of all types and sizes." The software consists of the core LIMS application with access to LabWare's library of LIMS software modules.

Instruments/Cyclers

The instruments files are available in different formats (txt, csv, xml, etc.) usually on Lab server folders or Google drive. The rules to structure the files in columns are specific for every workflow/method. 

2 - Data Preparation

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. Often involves reformatting data and making corrections to data.

Talend jobs are responsable for cleaning and structuring the data. 


In this process, we also define if a source file should go to Accepted Files or Rejected Files bucket. The files with the expected structure go to Accepted and the files that we cannot load on BigQuery for a structure or data type issue go to Rejected.  The process will not stop in case of rejected files.

3 - Data Integration and Enrich

Data enrichment refers to the process of appending or otherwise enhancing collected data with relevant context obtained from additional sources. 

4 - Data Presentation

All the reports, dashboards and users should access the data using this layer. Data marts (DM) with views will be created by subject, limiting access to non treated data. 

5- Data Visualization

Check with the Data Viz team the specific documentation.

  • No labels