1st Model: the weights model

The weights model is used to learn and extract the relative importance / price-driving impact of each feature for a given family - to be used when weighting each feature in the similarity distance function.

The model used to extract the weights is primarily a LightGBM model, trained to predict the price of a CPC given its feature, which in turn allows us to extract the weights of each feature (i.e. price driving relevance) using SHAP values.

R² metrics

The R² metrics is used to measure the model performance, result is generally between 0 (bad) and 1 (perfect). We consider the model good enough between 0.4 and 0.9.

Note : the objective is not to have a perfect model, because fitting too well the training data often leads to a bad generalization.
In simpler words, it means that the provided data are too specific and detailed and while it allows the model to perform better on the training set, it will fail its prediction with any slight change in the new data we will provide at each campaign.

If R2 are too low, this could mean two things :

We are missing some pricing levers (features) to explain price successfully. The ones we provide are not enough to have a good prediction on price.
The dispersion of prices cannot be explained by data (due to human behavior, specific negotiation with the customers, etc.)
For example, we can end up with 2 CPCs with exactly the same values for all features, with different prices.

==> In this use-case, we do not use the output of the model to predict prices directly.

Example of R² output:

Features importance : SHAP values

What we are interested in is to understand the importance of each of our pricing lever (feature) in predicting the price. This is done by computing SHAP values when running the model.
This feature importance will then be used by the second model as coefficient to find neighbors for each CPC.

Specialty Monomers example: