Before moving on to the data encoding stage, we first impute the missing values in the features using a dictionary variable in Dataiku variables, which is used by the "simple_imput()" function in the Python encoding recipe.
The details of the imputation method for each feature with missing values can be summarized in this table:
GBU | Feature | Imputation method | Details |
|---|---|---|---|
CS | product_group | most frequent | applied for all families |
| CS | end_use | most frequent | applied for all families |
| CS | gbu_customer_seg | most frequent | applied for all families |
| CS | manual_region | most frequent | applied for all families |
| CS | lip2 | most frequent | applied only for Alkoxylates |
| CS | historical_unit_price_coalesce_ratio_on_12 | constant | value equal to 1 for all families |
| CS | historical_sales_coalesce_ratio_on_12 | constant | value equal to 1 for all families |
| CS | historical_unit_price_ratio_3_on_12_month | constant | value equal to 1 for all families |
| CS | n_competitors | constant | applied for Sulfosuccinate_Sulfosuccinamate & Sulfosuccinates_Healthcare value equal to 1 |
| CS | COMPONENT_ratio | mean | applied for all families |
| CS | IMPURITY_ratio | mean | applied for all families |
| CS | SOLVENT_ratio | mean | applied for all families |
| CS | n_components | mean | applied for all families |
To monitor the percentage of imputations per feature and per CPC, we added a specific zone in the flow named "Imputation_analysis":
In this zone we created three separate datasets, as follow:
==> The checks carried out in this dataset compare these metrics with a fixed threshold of 0.4 for the first metric and 0.5 for the second (these values can be modified in the check parameters).
==> The check sends a warning or an error (depending on the type of message defined in the dataiku variables) if the "pct_imputed_cols" is greater than a threshold (set to 0.5 in the check parameters).
==> The check verifies whether the imputation pct for a given feature is greater than the defined threshold (set at 0.5 in the check parameters).
In the Python encoding recipe we also collapse categorical features using the "collapse_categorical()" function, in which we define a threshold to collapse modalities that have fewer values than this threshold (the default value is 5).