The objective of this page is provide simple explanations for the business to understand all the steps of the modelization using few technical items.
Vocabulary : target CPC vs. target of the model
In order to understand this section, it is important to clarify two different usages of the word "target". Indeed, since the machine learning part of the project includes two successive models, the usage of target is different for the two of them.
___________________
For our models to work and compute the distance, we need to have numeric values only as input. Since we also have categorical (e.g. region and product taxonomy features) features in our original data, we have to transform them to numerical data.
This is a common step in machine learning and is called "encoding". In our case we are using a specific encoder for this which is named "target encoder".
Here is how it works :
Example on the "Region" feature for Amodel family :
![]()
In the example above :
From a categorical feature with no information about order and proximity between modalities, we obtain an ordered numeric variable usable for machine learning model and the similarity distance calculation.
Note : one model is built for each of the product family, every family is totally independant.
The model used here is a regression model (LightGBM specifically).
The usual objective of a regression model is to predict the value of the numeric target (price with "log" transformation) using the input numeric features (result of previously explained encoding step).
To do this, some CPC are used to train the model, and others to test the performance. An optimization (also called fine-tuning) is done to find the best parameters of the model for each family.
The R² metrics is used to measure the model performance, result is generally between 0 (bad) and 1 (perfect). We consider the model good enough between 0.4 and 0.9.
Note : the objective is not to have a perfect model, because fitting too well the training data often leads to a bad generalization.
In simpler words, it means that the provided data are too specific and detailed and while it allows the model to perform better on the training set, it will fail its prediction with any slight change in the new data we will provide at each campaign.
If R2 are too low, this could mean two things :
In this use-case, we do not use the output of the model to predict prices directly.
What we are interested in is to understand the importance of each of our pricing lever (feature) in predicting the price.
This is done by computing SHAP values when running the model. That is also why we can accept average R2 score.
This feature importance will then be used by the second model as coefficient to find neighbors for each CPC.
Next section describes the model outputs that should be reviewed to validate the modelization step.
Measures the prediction performance of the model. We can compare it with the previous campaign and see if they are stable.
Any big decrease could represent a data issue. If data seems clean and a decrease still appears, models have to be retrain (with a grid search) to find optimal parameters.
If no decrease, retraining and optmizing parameters should be done once a year.
![]()

Shap values, sorted by importance
SHAP values are then used to define the weight (or importance) of every pricing lever.
![]()
Final feature importance (weight).

Let's observe the "product_coating_type" feature :
We can see two distinct clusters.
The impact difference here is really high between the groups, they have therefore be considered not comparable one with the other.
This lead to the creation of what we call a "hard boundary".
When trying to find comparables in the next steps, we will filter out from the comparable set any CPC that has a different value than the target CPC for product_coating_type.
To find comparables, we compute a similarity distance between all CPCs two by two. This computation is based on the numeric values of the CPC features on which we apply the feature weight from the first model as a factor.
For each target CPC, all comparables are ranked according to the similarity distance, and we exclude comparables outside of the hard boundaries.
In the end, we keep the 10 closest comparables (the ones with the lowest similarity distances) for final price calculation and display in the WebApp.
Volume being excluded from the comparable selection, we can have big gaps between a target and its comparables. That is why we need a specific volume adjustment step.
The objective is to adjust the prices of comparables to answer the following question : "what would be the comparable cpc price if it was sold at the same volume as the target cpc ?"
Example for Amodel :
First steps : computing the adjustment function for each of the families
![]()


Then, adjusting every comparable using the fit function.
These steps are simulated in the "Volume adjustment simulations" tab of the dashboard for a better understanding based on real values of the current run.
To use the tab, you need to select a family and a cpc that will be used as a comparable. Then, you can enter the volume you want to adjust to. This simulates the volume of the target cpc.
By changing the volume, you can see how the volume adjusted price is affected.
Amodel CPC example Z58-19179_557270 :

This volume adjustment is applied on every comparable of every target cpc, no adjustment is applied on the target directly !
Group volume adjustment is not applied for all families, but only on 13 of the 16 families. It can evolve according to the analysis results, as we will explain it below.
As we have seen in the section "3.2 Features importance : Shap value", for the Amodel example, we have an impact of the group volume sales.
![]()
This variable is coded in 5 modalities :
First, we use the Shap values to compute the median per modality, and then we draw a boxplot.
Example for Amodel:
![]()
These values are the modification that we will applied to our target "price log", if we validate that we want to apply a group volume adjustment for the Amodel family.
but they are not interpretable without transformation.
![]()
Example for Kalix:
![]()
Finally, for each comparable, we check if it belongs to the same group volume modality compared to the target :
Finally, to create the price recommendation, we take the median price of the 10 comparables
The maximum increase is caped at 30%