Specific technical documentation for the Analytical Lab tests

Summary

Talend Documentation

The following document was created at the beginning of the project and needs to be update but the core idea is still valide:

Silica Analytical Talend

Raw data extraction

Raw data for the tests of Analytical lab is extracted from the servers presented in the following table:

https://docs.google.com/spreadsheets/d/1z5AndNQxOsRI8O5TI3m16hkF7uN47ntAPFBuJ9A0CLA/edit#gid=0

The table below presents all the Analytical tests that are currently included in the Datalake, the source of data and the raw data files format for each test:

Test	Data Sources	File format
CPS	Raw data files
Granulometrie	Raw data files
ICP MS	Raw data files
ICP OS	Raw data files
Porohg	Raw data files
Sedigraph	Raw data files
Mesure Carbone	Raw data files	.pdf
Infra Rouge	Raw data files	out of scope

Python scripts and Talend jobs

Python scripts for downloading, parsing and computing new fields and columns for the Analytical Lab can be found here:

https://drive.google.com/drive/folders/1nO0jMmGDLLAOvK_UUFB92To63ak3RHYG

Talend data preparation for the Analytical Lab has the same structure as for Synthesis Lab.

Analytical Tests

CPS

For each raw data file, the Python script parse_CPS_analysis.py extracts or calculates the following columns and generates a .csv file:

Column name	Formula
sample_id	extracted from the raw data file
date_measure	extracted from the raw data file. It's the date in which the file was created/generated
diameter_microns	extracted from the raw data file
frequency	extracted from the raw data file
frequency_normalized	We first compute: max_frequency = Maximum of the column [frequency] Then: [frequency_normalized] = [frequency] * 100 / max_frequency
area	For each row we compute: min_frequency =The smallest value among [frequency][idx] and [frequency][idx+1] max_frequency = The greatest value among [frequency][idx] and [frequency][idx+1] Then: area =((([diameter_microns][idx] - [diameter_microns][idx+1]) * min_frequency) + (([diameter_microns][idx] - [diameter_microns][idx+1]) * (max_frequency - min_frequency))) / 2
aggregate	aggregate = [area][idx] + [aggregate][idx-1]
aggregate_normalized	We first compute: max_aggregate = Maximum value of the column [aggregate] Then: aggregate_normalized = 100 - [aggregate][idx] * 100 / max_aggregate

For each sample, the Python script compute_CPS_analysis.py computes the following values (more details can be found in the python script comments):

Value	Formula
sample_id	extracted from the raw data file
date_measure	extracted from the raw data file. It's the date in which the file was created/generated
d10	Inferred using linear regression, by taking the nearest smaller than 10 values of the [aggregate_normalized] column
d10	Inferred using linear regression, by taking the nearest smaller than 16 values of the [aggregate_normalized] column
d50	Inferred using linear regression, by taking the nearest smaller than 50 values of the [aggregate_normalized] column
d84	Inferred using linear regression, by taking the nearest smaller than 84 values of the [aggregate_normalized] column
d90	Inferred using linear regression, by taking the nearest smaller than 90 values of the [aggregate_normalized] column
ld	(d84 - d16) / d50
mode	We first compute: diameter_at_max_frequency = The value of the column [diameter] corresponding to the maximum of the column [frequency] Then: mode = 1000 * diameter_at_max_frequency

Granulométrie

For granulométrie, two types of raw data files are included in the datalake:

Kinetics

Kinetics files are google spreadsheets that are listed in the google spreadsheet “Dispersibilité 190W”, in the sheet called “Samples”:

https://docs.google.com/spreadsheets/d/1kiPlW5-_a-Eo8fTyP9LmnOsKEZt3gkjw4fYk1TRpHM4/edit#gid=4663305

Sample IDs are taken from the column “Liste des échantillons analysés” and the unique key corresponding to the sample is taken from the column “URL”.

Data for each sample is extracted from the corresponding google spreadsheet and, once the data was extracted, the corresponding cell of the column H of the spreadsheet “Dispersibilité 190W” is updated with the value “done”. If an error occurred for one of the samples and we need to reprocess the corresponding file, we need to suppress the string “done” in column H and rerun the code.

For each sample, the python script parse_granulo_cinetique.py extracts the following columns and creates à .csv file:

sample_id
record_nb
time
energy_eff_g_silica_J_g
Dx_10
Dx_16
Dx_50
Dx_84
Dx_90
D_4_3
D_3_2
mode
results_below_1
results_below_40
results_below_50
results_below_75
mode_0
mode_1
mode_2
mode_count
mode_percentage_0
mode_percentage_1
mode_percentage_2
obscuration_laser_percentage
obscuration_bleue_percentage

Differential and Cumulative Curves

For each sample, the python script parse_granulo_curbes.py extracts the following columns and creates three different tables:

First Table: Differential curves – from the first sheet of the excel file we extract the following columns:

no_enregistrement (4_avg, 1, 2, 3)
classes_de_taille
densite_volume

Second Table: Cumulative curves – from the second sheet of the excel file we extract the following columns:

no_enregistrement (4_avg, 1, 2, 3)
classes_de_taille
densite_volume

Third Table: Table of results – from the third sheet of the excel file, we extract the following columns: Note!!! The Table of results has the same structure as the table of Kinetics. Therefore, these two tables are merged in Big Query.

record_nb
date_heure
Dx_10
Dx_16
Dx_50
Dx_84
Dx_90
D_4_3
D_3_2
mode
results_below_1
results_below_40
results_below_50
results_below_75
mode_0
mode_1
mode_2
mode_count
mode_percentage_0
mode_percentage_1
mode_percentage_2
obscuration_laser_percentage
obscuration_bleue_percentage

ICP

For both ICP MS and ICP OES, along with raw data files, we need to extract data from the google spreadsheet called “Registre Appareil - ANA-RIC-Paris Balance/micro-onde N°0003531” (feuille 1):

https://docs.google.com/spreadsheets/d/1JRPpjUQdxHKyer7RfvotVq1vTmW1zCD-mykWoqQAND4/edit#gid=0

This will allow us to extract, for each sample_id, the fields volume_liquide and masse_echantillon. These values will be used for computing the concentration of each element from each sample.

ICP MS

For each raw data file, the python script parse_ICP_masse.py creates two .csv tables: ICP_MS_details and ICP_MS_summary.

The table ICP_MS_details contains the following fields:

Column	Formula
sample_id	extracted from the raw data file name
FD	extracted from the raw data file
concentration_1	extracted from the raw data file
concentration_2_read	extracted from the raw data file
masse_echantillon
volume_liquide
concentration_2_computed	([concentration_1] * [FD] * volume_liquide * masse_echantillon) /1000
low_limit	0.3 * the value of the first etalon
high_limit	1.65 * the value of the last etalon
warning	For each element: “warning” 🡪 if the value of “concentration_1” is higher than or equal to high_limit +/- high_limit * 0.1; “warning+++” 🡪 if the value of “concentration_1” is higher than or equal to high_limit; “warning+++” 🡪 if the value of “concentration_1” is lower than or equal to low_limit.

The table ICP_MS_summary contains, for each sample, the value of “concentration_2_computed” for each element of the periodic table of elements (i.e., Al, Ti, Ca, Fe, Mg, Na, S, K, Zn, Sn, Li, P, As, Cd, Ce, Co, Cr, Cu, Hg, Mn, Ni, Pb, Sb, Se, W, Zr, Mo, La, Ba, Bi, V, Be, B, Si, Sc, Ga, Ge, Rb, Sr, Y, Nb, Ru, Rh, Pd, Ag, In, Te, Cs, Hf, Ta, Re, Os, Ir, Pt, Au, Tl, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu).

ICP OES

For each raw data file, the Python script parse_ICP_optique.py creates two .csv tables: ICP_OES_details and ICP_OES_summary, by following the same rules as for ICP MS:

The table ICP_OES_details contains the same fields as the table ICP_MS_details, with one exception: In the ICP_OES_details table, we have a supplementary column called “percentage”. This column is filled only for two elements (i.e., Y and Co), whose values are expressed as a percentage.

The table ICP_OES_summary contains the same fields as the table ICP_MS_summary.

InfraRouge (removed from project scope)

For each sample, the Python script parse_InfraRouge.py creates a .csv file containing the following columns extracted from the raw data file:

wave_length
absorbance

Mesure Carbone

For each sample, the Python script parse_Mesure_Carbone.py creates a .csv file containing the following columns extracted from the raw data file:

carbon_concentration
sulfur_concentration

Porohg

For each sample, the python script parse_porohg_analysis.py creates a .csv file containing the following columns extracted from the raw data file:

sample_id
diameter
cumulative_pore_volume
smoothed_pore_volume
cumulative_pore_area

For each sample, the Python script compute_porohg_analysis.py creates two tables:

The First Table contains the columns extracted from the raw data files and the following computed column:

The step column is computed by applying the following rules:
- “first intrusion” – from the beginning of the file and as long as the diameter values are decreasing;
- “extrusion” – from the end of the first intrusion step and as long as the diameter values are increasing;
- “second intrusion” – from the end of the extrusion step and as long as the diameter values are decreasing
- “end” – from the end of the second intrusion and until the end of the file

The Second Table contains the following computed values:

Column	Formula
sample_id	extracted from the raw data file
date_measure	extracted from the raw data file. It's the date in which the file was created/generated
total_volume	Value of the column [cumulative_pore_volume] at the end of the first intrusion step
v1_micron	We first: Infer the value of cumulative_pore_volume_m by using linear regression with the two points of the column [diameter] that are closest to 1000 (just before and just after) during the first intrusion. Then: v1_micron = total_volume - cumulative_pore_volume_m
v1_nano	We first: Infer the value of cumulative_pore_volume_n by using linear regression with the two points of the column [diameter] that are closest to 100 (just before and just after) during the first intrusion. Then: v1_nano = total_volume - cumulative_pore_volume_n
total_volume2	Value of the column (cumulative_pore_volume) at the end of the second intrusion step
vol_agg_micron	We infer the value of vol_agg_micron by using linear regression with the two points of “diameter” that are closest to 1000 (just before and just after) during the second intrusion.
vol_agg_nano	We infer the value of vol_agg_micron by using linear regression with the two points of “diameter” that are closest to 100 (just before and just after) during the second intrusion.
v2_micro	total_volume2 - vol_agg_micro
v2_nano	total_volume2 - vol_agg_nano
mode1	Maximum value of the column [smoothed_pore_volume] where the values of the column [diameter] are <= 100 during the first intrusion
mode1	Maximum value of the column [smoothed_pore_volume] where the values of the column [diameter] are <= 100 during the second intrusion
A1
B1
ldp1	(B1 – A1) / mode1
A2
B2
ldp2	(B2 – A2) / mode2

Sedigraph

For each sample, the Python script parse_sedigraph_analysis.py creates a .csv file containing the following columns extracted or computed from the raw data file:

Column	Formula
sample_id	extracted from the raw data file
date_measure	extracted from the raw data file. It's the date in which the file was created/generated
diameter	extracted from the raw data file
mass_percent	extracted from the raw data file
differential	[mass_percent][idx] - [mass_percent][idx+1]

For each sample, the Python script compute_sedigraph_analysis.py creates one table containing the following computed values:

d1_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 1.
d05_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 0.5.
d03_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 0.3.

Handling duplicate data in BigQuery

(NEEDS CORRECTION)

For Analytical Lab, we extract raw data files that were added in the last 24H.

If an ancient file was replaced by a new one, the last added file will also be processed and added to BigQuery as if it was a new one. Therefore, we need to handle duplicate data by suppressing ancient versions of the same file.

In order to do so, all the tables of the Analytical Lab should contain a column named “processing_date” (the date when the raw data file was processed and added to the corresponding table of Big Query).

A simple way for suppressing the ancient version of a file is to create a new table in BigQuery by using the following query adapted to the concerned table:

(See tables “test_processing_date” and “test_processing_date_last” from the dataset “raw_data_test” for an example).

Page tree

DFS TD - Analytical Lab