The following document was created at the beginning of the project and needs to be update but the core idea is still valide:

Raw data extraction

Raw Raw data for the tests of Analytical lab is files for all labs are extracted from the servers presented in the following table:https://docs.google.com/spreadsheets/d/1z5AndNQxOsRI8O5TI3m16hkF7uN47ntAPFBuJ9A0CLA/edit#gid=0spreadsheet: Asset Tag Spreadsheet

The table below presents all the Analytical tests that are currently included in the Datalake, the source of data and the raw data files format for each test:

Test	Data Sources	File format
CPS	Raw data files	csv
Granulometrie	Raw data files	xls
~~ICP MS~~	~~Raw data files~~	csv
~~ICP OS~~	~~Raw data files~~	xls
Porohg	Raw data files	xls
Sedigraph	Raw data files	xls
Mesure Carbone	Raw data files	.pdf
~~Infra Rouge~~	~~Raw data files~~	~~out of scope~~

Talend jobs

Python scripts and Talend jobs

Data is processed using Talend and Python. Talend jobs regarding Application Lab can be found in the Talend project RnI_Silica_Analytical

Job F000_Orch_Flow_Synthesis is the orchestrator and will call successively all the other jobs a determined order.

Python scripts

A copy of the Python scripts Python scripts for downloading, parsing and computing new fields and columns for the Analytical Application Lab raw data and ELN data can be found here:

Google Drive Live Link

url	https://drive.google.com/drive/folders/

1nO0jMmGDLLAOvK

1F7ESi1oZFuis3kq9d_

UUFB92To63ak3RHYG

Talend data preparation for the Analytical Lab has the same structure as for Synthesis Lab.

...

JD0Ka9W8ejh6nN

However it's recommended to get the last version of the scripts available in the production path: D:\DATA\PROD\RnI\Silica\Permanent. For the moment, git is not in place for this project.

Analytical Tests files

CPS

For each raw data file, the Python script parse_CPS_analysis.py extracts or calculates the following columns and generates a .csv file:

Column name	Formula
sample_id	extracted from the raw data file
date_measure	extracted from the raw data file name. It's the date in which the file was created/generated
diameter_microns	extracted from the raw data file
frequency	extracted from the raw data file
frequency_normalized	We first compute: max_frequency = Maximum of the column [frequency] Then: [frequency_normalized] = [frequency] * 100 / max_frequency
area	For each row we compute: min_frequency =The smallest value among [frequency][idx] and [frequency][idx+1] max_frequency = The greatest value among [frequency][idx] and [frequency][idx+1] Then: area =((([diameter_microns][idx] - [diameter_microns][idx+1]) * min_frequency) + (([diameter_microns][idx] - [diameter_microns][idx+1]) * (max_frequency - min_frequency))) / 2
aggregate	aggregate = [area][idx] + [aggregate][idx-1]
aggregate_normalized	We first compute: max_aggregate = Maximum value of the column [aggregate] Then: aggregate_normalized = 100 - [aggregate][idx] * 100 / max_aggregate

...

Value	Formula
sample_id	extracted from the raw data file name
date_measure	extracted from the raw data file. It's the date in which the file was created/generated
d10	Inferred using linear regression, by taking the nearest smaller than 10 values of the [aggregate_normalized] column
d10	Inferred using linear regression, by taking the nearest smaller than 16 values of the [aggregate_normalized] column
d50	Inferred using linear regression, by taking the nearest smaller than 50 values of the [aggregate_normalized] column
d84	Inferred using linear regression, by taking the nearest smaller than 84 values of the [aggregate_normalized] column
d90	Inferred using linear regression, by taking the nearest smaller than 90 values of the [aggregate_normalized] column
ld	(d84 - d16) / d50
mode	We first compute: diameter_at_max_frequency = The value of the column [diameter] corresponding to the maximum of the column [frequency] Then: mode = 1000 * diameter_at_max_frequency

...

Kinetics files are google spreadsheets that are listed in the google spreadsheet “Dispersibilité 190W”, in the sheet called “Samples”:

https://docs.google.com/spreadsheets/d/1kiPlW5-_a-Eo8fTyP9LmnOsKEZt3gkjw4fYk1TRpHM4/edit#gid=4663305

Sample IDs are taken from the column “Liste des échantillons analysés” and the unique key corresponding to the sample is taken from the column “URL”.

...

record_nb
date_heure
Dx_10
Dx_16
Dx_50
Dx_84
Dx_90
D_4_3
D_3_2
mode
results_below_1
results_below_40
results_below_50
results_below_75
mode_0
mode_1
mode_2
mode_count
mode_percentage_0
mode_percentage_1
mode_percentage_2
obscuration_laser_percentage
obscuration_bleue_percentage

ICP (processes not in production as at 22/09/2022)

For both ICP MS and ICP OES, along with raw data files, we need to extract data from the google spreadsheet called “Registre Appareil - ANA-RIC-Paris Balance/micro-onde N°0003531” (feuille 1):

...

Column	Formula
sample_id	extracted from the raw data file name
FD	extracted from the raw data file
concentration_1	extracted from the raw data file
concentration_2_read	extracted from the raw data file
masse_echantillon
volume_liquide
concentration_2_computed	([concentration_1] * [FD] * volume_liquide * masse_echantillon) /1000
low_limit	0.3 * the value of the first etalon
high_limit	1.65 * the value of the last etalon
warning	For each element: “warning” 🡪 if the value of “concentration_1” is higher than or equal to high_limit +/- high_limit * 0.1; “warning+++” 🡪 if the value of “concentration_1” is higher than or equal to high_limit; “warning+++” 🡪 if the value of “concentration_1” is lower than or equal to low_limit.

The table ICP_MS_summary contains, for each sample, the value of “concentration_2_computed” for each element of the periodic table of elements (i.e., Al, Ti, Ca, Fe, Mg, Na, S, K, Zn, Sn, Li, P, As, Cd, Ce, Co, Cr, Cu, Hg, Mn, Ni, Pb, Sb, Se, W, Zr, Mo, La, Ba, Bi, V, Be, B, Si, Sc, Ga, Ge, Rb, Sr, Y, Nb, Ru, Rh, Pd, Ag, In, Te, Cs, Hf, Ta, Re, Os, Ir, Pt, Au, Tl, Pr, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu).

...

For each sample, the Python script parse_InfraRouge.py creates a .csv file containing the following columns extracted from the raw data file:

wave_length
absorbance

Mesure

...

Carbone (not being used as at 22/09/22)

For each sample, the Python script parse_Mesure_Carbone.py creates a .csv file containing the following columns extracted from the raw data file:

...

Column	Formula
sample_id	extracted from the raw data file name
date_measure	extracted from the raw data file. It's the date in which the file was created/generated
total_volume	Value of the column [cumulative_pore_volume] at the end of the first intrusion step
v1_micron	We first: Infer the value of cumulative_pore_volume_m by using linear regression with the two points of the column [diameter] that are closest to 1000 (just before and just after) during the first intrusion. Then: v1_micron = total_volume - cumulative_pore_volume_m
v1_nano	We first: Infer the value of cumulative_pore_volume_n by using linear regression with the two points of the column [diameter] that are closest to 100 (just before and just after) during the first intrusion. Then: v1_nano = total_volume - cumulative_pore_volume_n
total_volume2	Value of the column (cumulative_pore_volume) at the end of the second intrusion step
vol_agg_micron	We infer the value of vol_agg_micron by using linear regression with the two points of “diameter” that are closest to 1000 (just before and just after) during the second intrusion.
vol_agg_nano	We infer the value of vol_agg_micron by using linear regression with the two points of “diameter” that are closest to 100 (just before and just after) during the second intrusion.
v2_micro	total_volume2 - vol_agg_micro
v2_nano	total_volume2 - vol_agg_nano
mode1	Maximum value of the column [smoothed_pore_volume] where the values of the column [diameter] are <= 100 during the first intrusion
mode1	Maximum value of the column [smoothed_pore_volume] where the values of the column [diameter] are <= 100 during the second intrusion
A1
B1
ldp1	(B1 – A1) / mode1
A2
B2
ldp2	(B2 – A2) / mode2

...

For each sample, the Python script parse_sedigraph_analysis.py creates a .csv file containing the following columns extracted or computed from the raw data file:

Column	Formula
sample_id	extracted from the raw data file
date_measure	extracted from the raw data file. It's the date in which the file was created/generated
diameter	extracted from the raw data file
mass_percent

...

extracted from the raw data file
differential	[mass_percent][idx] - [mass_percent][idx+1]

For each sample, the Python script compute_sedigraph_analysis.py creates one table containing the following computed values:

d1_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 1.
d05_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 0.5.
d03_micro → The value of the column [mass_percent] for which the value of the column [diameter] is closest to 0.3.

Handling duplicate data in BigQuery

(NEEDS CORRECTION)

For Analytical Lab, we extract raw data files that were added in the last 24H.

If an ancient file was replaced by a new one, the last added file will also be processed and added to BigQuery as if it was a new one. Therefore, we need to handle duplicate data by suppressing ancient versions of the same file.

In order to do so, all the tables of the Analytical Lab should contain a column named “processing_date” (the date when the raw data file was processed and added to the corresponding table of Big Query).

A simple way for suppressing the ancient version of a file is to create a new table in BigQuery by using the following query adapted to the concerned table:

...

.

Page tree

Versions Compared

Old Version 4

New Version Current

Key

Raw data extraction

Talend jobs

Python scripts and Talend jobs

Python scripts

Analytical Tests files

CPS

ICP (processes not in production as at 22/09/2022)

Mesure

Carbone (not being used as at 22/09/22)

Handling duplicate data in BigQuery

Page tree

Page History

Versions Compared

Old Version 4

New Version Current

Key

Raw data extraction

Talend jobs

Python scripts and Talend jobs

Python scripts

Analytical Tests files

CPS

ICP (processes not in production as at 22/09/2022)

Mesure

Carbone (not being used as at 22/09/22)

Handling duplicate data in BigQuery