The objective of this page is provide simple explanation for the business (with few technical items), to understand all the steps of the modelization
In our original data, we have numeric and categorical features (region, product taxonomy features, …).
for machine learning model, or to compute a similarity distance between CPC, we need to have only numeric features.
So we transform categorical features to numeric, applying a "Target Encoding" :
![]()
From a Categorical feature with no information about order and proximity between modalities, we obtain an ordered numeric variable usable for machine learning model and the similarity distance calculation.
One model is created for each family.
The objective is to predict the target (price with "log" transformation) according to all numeric features selected. To do this, some CPC are used to train the model, and others to test the performance. An optimization is done to find the best parameters of the model for each family.
We use the R² metrics to measure the model performance, generally between 0 (bad) and 1 (perfect). In general, we are good if we are between 0.4 and 0.9
the objective is not to have a perfect model, because in this case we probably fit to well our current data, and the model will not generalized well to new data that are coming each month.
But if we are too low, this mean that :
The modelization in only a first step. Our objective is not to predict the price as well as possible, but to obtain coherent features importance and volume curves that can be used to compute similarity between CPCs.
we describe in the next section model's outputs that should be reviewed to validate the modelization step.
it measure the prediction performance of the model. The objective is to compare it with the previous campaign, and see if it is stable or if we have a decrease.
if there is a significant decrease, models have to be retrain with a grid search to find optimal parameters. If no decrease, it should be done once a year.
![]()
Example for Amodel
Example for Halar

To find comparables, we compute a similarity distance between all CPCs two by two. we use a "cosine" distance, which is a very common usage in data science.
But compute a "Weighted cosine distance" using the features weights defined thanks to the model.
For each target, all comparables are ranked according to the similarity, and we exclude ones which not respect the hard boundaries.
Finally, we keep the 10 closest comparables
when we have selected the 10 comparables, they can have very different volume, as the volume is not included in the selection criteria.
So the objectif is to adjust the comparables' price to answer the question : "what would be their price, if their sales volume was identical to the volume of the target"
First, we have 4 steps to understand the price variation in percentage around the mean price of the family.
Example for Amodel
![]()


Then, based on this curve realized 3 steps to compute the adjustment that we have to apply to the comparable. It has to be done for each comparable.
Mettre le graph du dashboard quand les données seront à jour pour l'explication
This operation is done for each comparable CPC, not for the target.
Group volume adjustment is not applied for all families, but only on 13 of the 16 families. It can evolve according to the analysis results, as we will explain it below.
As we have seen in the section "3.2 Features importance : Shap value", for the Amodel example, we have an impact of the group volume sales.
![]()
This variable is coded in 5 modalities :
First, we use the Shap values to compute the median per modality, and then we draw a boxplot.
Example for Amodel:
![]()
These values are the modification that we will applied to our target "price log", if we validate that we want to apply a group volume adjustment for the Amodel family.
![]()
Example for Kalix:
![]()
Finally, to create the price recommendation, we take the median price of the 10 comparables
The maximum increase is caped at 30%