...
The following document was created at the beginning of the project and needs to be update but the core idea is still valide:
Raw data extraction
Raw Raw data for the tests of Analytical lab is files for all labs are extracted from the servers presented in the following table:https://docs.google.com/spreadsheets/d/1z5AndNQxOsRI8O5TI3m16hkF7uN47ntAPFBuJ9A0CLA/edit#gid=0spreadsheet: Asset Tag Spreadsheet
The table below presents all the Analytical tests that are currently included in the Datalake, the source of data and the raw data files format for each test:
Test | Data Sources | File format |
CPS | Raw data files | csv |
Granulometrie | Raw data files | xls |
|
| csv |
|
| xls |
Porohg | Raw data files | xls |
Sedigraph | Raw data files | xls |
Mesure Carbone | Raw data files | |
|
|
|
Talend jobs
Python scripts and Talend jobs
Data is processed using Talend and Python. Talend jobs regarding Application Lab can be found in the Talend project RnI_Silica_Analytical
Job F000_Orch_Flow_Synthesis is the orchestrator and will call successively all the other jobs a determined order.
Python scripts
A copy of the Python scripts Python scripts for downloading, parsing and computing new fields and columns for the Analytical Application Lab raw data and ELN data can be found here:
1nO0jMmGDLLAOvKGoogle Drive Live Link url https://drive.google.com/drive/folders/
UUFB92To63ak3RHYG1F7ESi1oZFuis3kq9d_
Talend data preparation for the Analytical Lab has the same structure as for Synthesis Lab.
...
However it's recommended to get the last version of the scripts available in the production path: D:\DATA\PROD\RnI\Silica\Permanent. For the moment, git is not in place for this project.
Analytical Tests files
CPS
For each raw data file, the Python script parse_CPS_analysis.py extracts or calculates the following columns and generates a .csv file:
Column name | Formula |
sample_id | extracted from the raw data file |
date_measure | extracted from the raw data file name. It's the date in which the file was created/generated |
diameter_microns | extracted from the raw data file |
frequency | extracted from the raw data file |
frequency_normalized | We first compute:
Then: [frequency_normalized] = [frequency] * 100 / max_frequency |
area | For each row we compute:
Then: area =((([diameter_microns][idx] - [diameter_microns][idx+1]) * min_frequency) + (([diameter_microns][idx] - [diameter_microns][idx+1]) * (max_frequency - min_frequency))) / 2 |
aggregate | aggregate = [area][idx] + [aggregate][idx-1] |
aggregate_normalized | We first compute:
Then: aggregate_normalized = 100 - [aggregate][idx] * 100 / max_aggregate |
...
Value | Formula |
sample_id | extracted from the raw data file name |
date_measure | extracted from the raw data file. It's the date in which the file was created/generated |
d10 | Inferred using linear regression, by taking the nearest smaller than 10 values of the [aggregate_normalized] column |
d10 | Inferred using linear regression, by taking the nearest smaller than 16 values of the [aggregate_normalized] column |
d50 | Inferred using linear regression, by taking the nearest smaller than 50 values of the [aggregate_normalized] column |
d84 | Inferred using linear regression, by taking the nearest smaller than 84 values of the [aggregate_normalized] column |
d90 | Inferred using linear regression, by taking the nearest smaller than 90 values of the [aggregate_normalized] column |
ld | (d84 - d16) / d50 |
mode | We first compute:
Then: mode = 1000 * diameter_at_max_frequency |
...
Kinetics files are google spreadsheets that are listed in the google spreadsheet “Dispersibilité 190W”, in the sheet called “Samples”:
https://docs.google.com/spreadsheets/d/1kiPlW5-_a-Eo8fTyP9LmnOsKEZt3gkjw4fYk1TRpHM4/edit#gid=4663305
Sample IDs are taken from the column “Liste des échantillons analysés” and the unique key corresponding to the sample is taken from the column “URL”.
...
- record_nb
- date_heure
- Dx_10
- Dx_16
- Dx_50
- Dx_84
- Dx_90
- D_4_3
- D_3_2
- mode
- results_below_1
- results_below_40
- results_below_50
- results_below_75
- mode_0
- mode_1
- mode_2
- mode_count
- mode_percentage_0
- mode_percentage_1
- mode_percentage_2
- obscuration_laser_percentage
- obscuration_bleue_percentage
ICP (processes not in production as at 22/09/2022)
For both ICP MS and ICP OES, along with raw data files, we need to extract data from the google spreadsheet called “Registre Appareil - ANA-RIC-Paris Balance/micro-onde N°0003531” (feuille 1):
...
Column | Formula |
sample_id | extracted from the raw data file name |
FD | extracted from the raw data file |
concentration_1 | extracted from the raw data file |
concentration_2_read | extracted from the raw data file |
masse_echantillon | |
volume_liquide | |
concentration_2_computed | ([concentration_1] * [FD] * volume_liquide * masse_echantillon) /1000 |
low_limit | 0.3 * the value of the first etalon |
high_limit | 1.65 * the value of the last etalon |
warning | For each element:
|
The table ICP_MS_summary contains, for each sample, the value of “concentration_2_computed” for each element of the periodic table of elements (i.e., Al, Ti, Ca, Fe, Mg, Na, S, K, Zn, Sn, Li, P, As, Cd, Ce, Co, Cr, Cu, Hg, Mn, Ni, Pb, Sb, Se, W, Zr, Mo, La, Ba, Bi, V, Be, B, Si, Sc, Ga, Ge, Rb, Sr, Y, Nb, Ru, Rh, Pd, Ag, In, Te, Cs, Hf, Ta, Re, Os, Ir, Pt, Au, Tl, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu).
...
For each sample, the Python script parse_InfraRouge.py creates a .csv file containing the following columns extracted from the raw data file:
- wave_length
- absorbance
Mesure
...
Carbone (not being used as at 22/09/22)
For each sample, the Python script parse_Mesure_Carbone.py creates a .csv file containing the following columns extracted from the raw data file:
...
Column | Formula |
sample_id | extracted from the raw data file name |
date_measure | extracted from the raw data file. It's the date in which the file was created/generated |
total_volume | Value of the column [cumulative_pore_volume] at the end of the first intrusion step |
v1_micron | We first:
Then: v1_micron = total_volume - cumulative_pore_volume_m |
v1_nano | We first:
Then: v1_nano = total_volume - cumulative_pore_volume_n |
total_volume2 | Value of the column (cumulative_pore_volume) at the end of the second intrusion step |
vol_agg_micron | We infer the value of vol_agg_micron by using linear regression with the two points of “diameter” that are closest to 1000 (just before and just after) during the second intrusion. |
vol_agg_nano | We infer the value of vol_agg_micron by using linear regression with the two points of “diameter” that are closest to 100 (just before and just after) during the second intrusion. |
v2_micro | total_volume2 - vol_agg_micro |
v2_nano | total_volume2 - vol_agg_nano |
mode1 | Maximum value of the column [smoothed_pore_volume] where the values of the column [diameter] are <= 100 during the first intrusion |
mode1 | Maximum value of the column [smoothed_pore_volume] where the values of the column [diameter] are <= 100 during the second intrusion |
A1 | |
B1 | |
ldp1 | (B1 – A1) / mode1 |
A2 | |
B2 | |
ldp2 | (B2 – A2) / mode2 |
...
For each sample, the Python script parse_sedigraph_analysis.py creates a .csv file containing the following columns extracted or computed from the raw data file:
Column | Formula |
sample_id | extracted from the raw data file |
date_measure | extracted from the raw data file. It's the date in which the file was created/generated |
diameter | extracted from the raw data file |
mass_percent |
...
| extracted from the raw data file | |
differential | [mass_percent][idx] - [mass_percent][idx+1] |
For each sample, the Python script compute_sedigraph_analysis.py creates one table containing the following computed values:
- d1_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 1.
- d05_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 0.5.
- d03_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 0.3.
Handling duplicate data in BigQuery
(NEEDS CORRECTION)
For Analytical Lab, we extract raw data files that were added in the last 24H.
If an ancient file was replaced by a new one, the last added file will also be processed and added to BigQuery as if it was a new one. Therefore, we need to handle duplicate data by suppressing ancient versions of the same file.
In order to do so, all the tables of the Analytical Lab should contain a column named “processing_date” (the date when the raw data file was processed and added to the corresponding table of Big Query).
A simple way for suppressing the ancient version of a file is to create a new table in BigQuery by using the following query adapted to the concerned table:
...
- .