Imputation:
Before moving on to the data encoding stage, we first impute the missing values in the features using a dictionary variable in Dataiku variables, which is used by the "simple_imput()" function in the Python encoding recipe.
The details of the imputation method for each feature with missing values can be summarized in this table:
GBU | Feature | Imputation method | Details |
|---|---|---|---|
CS | product_group | most frequent | applied for all families |
| CS | end_use | most frequent | applied for all families |
| CS | gbu_customer_seg | most frequent | applied for all families |
| CS | manual_region | most frequent | applied for all families |
| CS | lip2 | most frequent | applied only for Alkoxylates |
| CS | historical_unit_price_coalesce_ratio_on_12 | constant | value equal to 1 for all families |
| CS | historical_sales_coalesce_ratio_on_12 | constant | value equal to 1 for all families |
| CS | historical_unit_price_ratio_3_on_12_month | constant | value equal to 1 for all families |
| CS | n_competitors | constant | applied for Sulfosuccinate_Sulfosuccinamate & Sulfosuccinates_Healthcare value equal to 1 |
| CS | COMPONENT_ratio | mean | applied for all families |
| CS | IMPURITY_ratio | mean | applied for all families |
| CS | SOLVENT_ratio | mean | applied for all families |
| CS | n_components | mean | applied for all families |
To monitor the percentage of imputations per feature and per CPC, we added a specific zone in the flow named "Imputation_analysis":
In this zone we created three separate datasets, as follow:
- Imputed_features_by_cpc: Contains imputed values (based on the adopted method defined above) for each CPC for the included features in the corresponding family. Two metrics are controlled in this dataset, the first being "tot_features_weights" which computes the impact of the imputed features for each CPC based on the average weights (from the first model) of these features. The second is "pct_imputed_cols", which calculates the percentage of imputed features in relation to the features included in the corresponding CPC family.
==> The checks carried out in this dataset compare these metrics with a fixed threshold of 0.4 for the first metric and 0.5 for the second (these values can be modified in the check parameters).
- Imputed_HBs_by_cpc: Contains imputation statistics only for the features included as hard boundaries (number and percentage of imputed hard boundaries for each family by CPC).
==> The check sends a warning or an error (depending on the type of message defined in the dataiku variables) if the "pct_imputed_cols" is greater than a threshold (set to 0.5 in the check parameters).
- Imputation_pct_by_feature: Calculates the percentage of imputation for each features (including HBs) by family.
==> The check verifies whether the imputation pct for a given feature is greater than the defined threshold (set at 0.5 in the check parameters).
Collapse categorical:
In the Python encoding recipe we also collapse categorical features using the "collapse_categorical()" function, in which we define a threshold to collapse modalities that have fewer values than this threshold (the default value is 5).
