Modelization : simple explanation

The objective of this page is provide simple explanation for the business (with few technical items), to understand all the steps of the modelization

1.Data Encoding

In our original data, we have numeric and categorical features (region, product taxonomy features, …).

for machine learning model, or to compute a similarity distance between CPC, we need to have only numeric features.

So we transform categorical features to numeric, applying a "Target Encoding" :

we replace variables' modalities by a numeric value, which is the Target (price with "log" transformation) average of this modality.
An example on the region for Amodel family :

- This give us several information:
- The average price in Americas is greater compared to other regions
- EMEA and Other APAC will be consider close in the similarity distance computation, While Americas is farthest to EMEA.

From a Categorical feature with no information about order and proximity between modalities, we obtain an ordered numeric variable usable for machine learning model and the similarity distance calculation.

2.Modelization of the Target (price with log transformation)

One model is created for each family.

The objective is to predict the target (price with "log" transformation) according to all numeric features selected. To do this, some CPC are used to train the model, and others to test the performance. An optimization is done to find the best parameters of the model for each family.

We use the R² metrics to measure the model performance, generally between 0 (bad) and 1 (perfect). In general, we are good if we are between 0.4 and 0.9

the objective is not to have a perfect model, because in this case we probably fit to well our current data, and the model will not generalized well to new data that are coming each month.

But if we are too low, this mean that :

Maybe some levers that can explain the price dispersion between CPC are not available.
Or the dispersion cannot be explained (due to human behavior, price negotiation with different customers, …) :
- For example, if we have 2 CPCs with exactly the same values for all features, but with different price.

The modelization in only a first step. Our objective is not to predict the price as well as possible, but to obtain coherent features importance and volume curves that can be used to compute similarity between CPCs.

we describe in the next section model's outputs that should be reviewed to validate the modelization step.

3.Modelization outputs

3.1 R²

it measure the prediction performance of the model. The objective is to compare it with the previous campaign, and see if it is stable or if we have a decrease.

if there is a significant decrease, models have to be retrain with a grid search to find optimal parameters. If no decrease, it should be done once a year.

3.2 Features importance : Shape values

Example for Amodel

On the left, we have the Shap values, sorted by importance
- each point is a CPC
- the red is when a CPC have a high value on the feature, and blue a low value.
- The Shap Value is the impact on the model of a feature, explaining a deviation from the global price average of the family.
  - it isolates the impact of a variable.
  - as we are modelizing the "price log", and not directly the "price", it can't be interpreted as a "% price variation" on this graph, we need to apply a transformation to interprete it as a "% price variation" versus the average price of the family
  - for example on the region :
    - when a CPC have a low value for the region (APAC and EMEA) (categorical variable are converted in numeric feature during Data Encoding), the price impact is 0.05 lower on the graph (11% after transformation) than the family average.
    - when a CPC have a high value (Americas) the price is around 0.05 higher (12% after transformation).
  - For the volume, we can check that a high volume in red can explain a lower price.

On the right, we have the feature importance.
- it is an aggregation of Shap value graphics
- For each feature, we just take the mean absolute value of the SHAP values.
- Then is normalize to obtain a sum of 1.
  - it can we interpreted as a percentage
  - for example, the region have more importance, and it can explain 18% of the dispersion around the family price average.

Example for Halar

We observe for the product coating feature that we obtain 2 clusters, one ("not relevant" modality) with an impact of -0.225 (-40%) on the price, and the other +0.125 (33%) for "Primer" / "Top coat" / "Standalone" modalities.
So these 2 group of product coating are not comparable
Based on this chart, with the business we decided to create a hard boundary on this variable, to not compare product with different coating type.

4.Finds comparables

To find comparables, we compute a similarity distance between all CPCs two by two. we use a "cosine" distance, which is a very common usage in data science.

But compute a "Weighted cosine distance" using the features weights defined thanks to the model.

If 2 CPCs are close on a important variable, it is more impactful than CPCs very close on an unimportant variable.
Volume feature is excluded from similarity distance calculation, because it has an important weight, and the objective is not to find comparables which have similar volume, but similar characteristics.
- So a volume adjustment is applied as a next step.

For each target, all comparables are ranked according to the similarity, and we exclude ones which not respect the hard boundaries.

Finally, we keep the 10 closest comparables

5.Volume adjustment

when we have selected the 10 comparables, they can have very different volume, as the volume is not included in the selection criteria.

So the objectif is to adjust the comparables' price to answer the question : "what would be their price, if their sales volume was identical to the volume of the target"

First, we have 4 steps to understand the price variation in percentage around the mean price of the family.

Example for Amodel

We start from the Shap value of the volume, seen previously.
- on the graph scale, we go from -0.09 to +0.06

we add the volume as a second dimension on the X axis.
- on the Y-axis, we retrieve our Shap values with the scale between -0.09 and +0.06

We do a modelization, to fit a curve that adjust the price impact according to the volume.

Finally, we apply the transformation to convert the Y-axis, and obtain a result that we can interprete as a "% price variation" due to volume, versus the average price of the family.
- the curve doesn't change, only the Y-axis is modified

Then, based on this curve realized 3 steps to compute the adjustment that we have to apply to the comparable. It has to be done for each comparable.

We take the comparable volume on the X-axis, and take the Y-axis value of the intersection with the curve
- ex : -10% for volume of 5 (100 000)
we do the same with the target volume
- ex : +2% for volume of 3 (1 000)
We compute the difference between the target value, minus the comparable value.
- ex : +2% - (-10%) = 12%
This comparable will have its price increase by 12%, to compensate for its change from an actual volume of 100,000 to a simulated volume of 1,000.

Mettre le graph du dashboard quand les données seront à jour pour l'explication

6.Group volume adjustment

xxx

7.Price recommendation

Cap 30%

dqf

Space shortcuts

Page tree