Imputation:

Before moving on to the data encoding stage, we first impute the missing values in the features using a dictionary variable in Dataiku variables, which is used by the "simple_imput()" function in the Python encoding recipe.

The details of the imputation method for each feature with missing values can be summarized in this table:

GBU	Feature	Imputation method	Details
CS	product_group	most frequent	applied for all families
CS	end_use	most frequent	applied for all families
CS	gbu_customer_seg	most frequent	applied for all families
CS	manual_region	most frequent	applied for all families
CS	lip2	most frequent	applied only for Alkoxylates
CS	historical_unit_price_coalesce_ratio_on_12	constant	value equal to 1 for all families
CS	historical_sales_coalesce_ratio_on_12	constant	value equal to 1 for all families
CS	historical_unit_price_ratio_3_on_12_month	constant	value equal to 1 for all families
CS	n_competitors	constant	applied for Sulfosuccinate_Sulfosuccinamate & Sulfosuccinates_Healthcare value equal to 1
CS	COMPONENT_ratio	mean	applied for all families
CS	IMPURITY_ratio	mean	applied for all families
CS	SOLVENT_ratio	mean	applied for all families
CS	n_components	mean	applied for all families

To monitor the percentage of imputations per feature and per CPC, we added a specific zone in the flow named "Imputation_analysis":

In this zone we created three separate datasets, as follow:

Imputed_features_by_cpc: Contains imputed values (based on the adopted method defined above) for each CPC for the included features in the corresponding family. Two metrics are controlled in this dataset, the first being "tot_features_weights" which computes the impact of the imputed features for each CPC based on the average weights (from the first model) of these features. The second is "pct_imputed_cols", which calculates the percentage of imputed features in relation to the features included in the corresponding CPC family.

==> The checks carried out in this dataset compare these metrics with a fixed threshold of 0.4 for the first metric and 0.5 for the second (these values can be modified in the check parameters).

Imputed_HBs_by_cpc: Contains imputation statistics only for the features included as hard boundaries (number and percentage of imputed hard boundaries for each family by CPC).

==> The check sends a warning or an error (depending on the type of message defined in the dataiku variables) if the "pct_imputed_cols" is greater than a threshold (set to 0.5 in the check parameters).

Imputation_pct_by_feature: Calculates the percentage of imputation for each features (including HBs) by family.

==> The check verifies whether the imputation pct for a given feature is greater than the defined threshold (set at 0.5 in the check parameters).

Collapse categorical:

In the Python encoding recipe we also collapse categorical features using the "collapse_categorical()" function, in which we define a threshold to collapse modalities that have fewer values than this threshold (the default value is 5).

Space shortcuts

Page tree

Imputation:

Collapse categorical:

Space shortcuts

Page tree

3. Data encoding

Imputation:

Collapse categorical: