Imputation:

Before moving on to the data encoding stage, we first impute the missing values in the features using a dictionary variable in Dataiku variables, which is used by the "simple_imput()" function in the Python encoding recipe.

The details of the imputation method for each feature with missing values can be summarized in this table:

GBU

Feature

Imputation method

Details

CS

product_groupmost frequentapplied for all families
CSend_usemost frequentapplied for all families
CSgbu_customer_segmost frequentapplied for all families
CSmanual_regionmost frequentapplied for all families
CSlip2most frequentapplied only for Alkoxylates
CShistorical_unit_price_coalesce_ratio_on_12constantvalue equal to 1 for all families
CShistorical_sales_coalesce_ratio_on_12constantvalue equal to 1 for all families
CShistorical_unit_price_ratio_3_on_12_monthconstantvalue equal to 1 for all families
CSn_competitorsconstant

applied for Sulfosuccinate_Sulfosuccinamate & Sulfosuccinates_Healthcare

value equal to 1

CSCOMPONENT_ratiomeanapplied for all families
CSIMPURITY_ratiomeanapplied for all families
CSSOLVENT_ratiomeanapplied for all families
CSn_componentsmeanapplied for all families


To monitor the percentage of imputations per feature and per CPC, we added a specific zone in the flow named "Imputation_analysis":

In this zone we created three separate datasets, as follow:

          ==> The checks carried out in this dataset compare these metrics with a fixed threshold of 0.4 for the first metric and 0.5 for the second (these values can be modified in the check parameters).

          ==> The check sends a warning or an error (depending on the type of message defined in the dataiku variables) if the "pct_imputed_cols" is greater than a threshold (set to 0.5 in the check parameters).

          ==> The check verifies whether the imputation pct for a given feature is greater than the defined threshold (set at 0.5 in the check parameters).

Collapse categorical:

In the Python encoding recipe we also collapse categorical features using the "collapse_categorical()" function, in which we define a threshold to collapse modalities that have fewer values than this threshold (the default value is 5).