You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Imputation:

Before moving on to the data encoding stage, we first impute the missing values in the features using a dictionary variable in Dataiku variables, which is used by the "simple_imput()" function in the Python encoding recipe.

The details of the imputation method for each feature with missing values can be summarized in this table:

GBU

Feature

Imputation method

Details

CS

product_groupmost frequentapplied for all families
CSend_usemost frequentapplied for all families
CSgbu_customer_segmost frequentapplied for all families
CSmanual_regionmost frequentapplied for all families
CSlip2most frequentapplied only for Alkoxylates
CShistorical_unit_price_coalesce_ratio_on_12constantvalue equal to 1 for all families
CShistorical_sales_coalesce_ratio_on_12constantvalue equal to 1 for all families
CShistorical_unit_price_ratio_3_on_12_monthconstantvalue equal to 1 for all families
CSn_competitorsconstant

applied for Sulfosuccinate_Sulfosuccinamate & Sulfosuccinates_Healthcare

value equal to 1

CSCOMPONENT_ratiomeanapplied for all families
CSIMPURITY_ratiomeanapplied for all families
CSSOLVENT_ratiomeanapplied for all families
CSn_componentsmeanapplied for all families


To monitor the percentage of imputations per feature and per CPC, we added a specific zone in the flow named "Imputation_analysis":

In this zone we created three separate datasets, as follow:

  • Imputed_features_by_cpc: Contains imputed values (based on the adopted method defined above) for each CPC for the included features in the corresponding family. Two metrics are controlled in this dataset, the first being "tot_features_weights" which computes the impact of the imputed features for each CPC based on the average weights (from the first model) of these features. The second is "pct_imputed_cols", which calculates the percentage of imputed features in relation to the features included in the corresponding CPC family.

          ==> The checks carried out in this dataset compare these metrics with a fixed threshold of 0.4 for the first metric and 0.5 for the second (these values can be modified in the check parameters).

  • Imputed_HBs_by_cpc: Contains imputation statistics only for the features included as hard boundaries (number and percentage of imputed hard boundaries for each family by CPC).

          ==> The check sends a warning or an error (depending on the type of message defined in the dataiku variables) if the "pct_imputed_cols" is greater than a threshold (set to 0.5 in the check parameters).

          ==> The check verifies whether the imputation pct for a given feature is greater than the defined threshold (set at 0.5 in the check parameters).

Collapse categorical:

In the Python encoding recipe we also collapse categorical features using the "collapse_categorical()" function, in which we define a threshold to collapse modalities that have fewer values than this threshold (the default value is 5).

  • No labels