Before moving on to the data encoding stage, we first impute the missing values in the features using a dictionary variable in Dataiku variables, which is used by the "simple_imput()" function in the Python encoding recipe.
The details of the imputation method by GBU and by feature are described here:
To monitor the percentage of imputations per feature and per CPC, we added a specific zone in the flow named "Imputation_analysis":
In this zone we created three separate datasets, as follow:
==> The checks carried out in this dataset compare these metrics with a fixed threshold of 0.4 for the first metric and 0.5 for the second (these values can be modified in the check parameters).
==> The check sends a warning or an error (depending on the type of message defined in the dataiku variables) if the "pct_imputed_cols" is greater than a threshold (set to 0.5 in the check parameters).
==> The check verifies whether the imputation pct for a given feature is greater than the defined threshold (set at 0.5 in the check parameters).
In the Python encoding recipe we also collapse categorical features using the "collapse_categorical()" function, in which we define a threshold to collapse modalities that have fewer values than this threshold (the default value is 5).
Vocabulary : target CPC vs. target of the model
In order to understand this section, it is important to clarify two different usages of the word "target". Indeed, since the machine learning part of the project includes two successive models, the usage of target is different for the two of them.
___________________
For our models to work and compute the distance, we need to have numeric values only as input. Since we also have categorical (e.g. region and incoterms features) features in our original data, we have to transform them to numerical data.
This is a common step in machine learning and is called "encoding". In our case we are using a specific encoder for this which is named "target encoder".
Here is how it works :
Example on the "Incoterms" feature for Sulfosuccinate_Sulfosuccinamate family :
Initial value | Encoded value |
|---|---|
| PPD | 0.72 |
| DDP | 0.41 |
| COL | 0.80 |
| PPD | 0.72 |
| CIF | 0.64 |
In the example above :