The weights model is used to learn and extract the relative importance / price-driving impact of each feature for a given family - to be used when weighting each feature in the similarity distance function.
The model used to extract the weights is primarily a LightGBM model, trained to predict the price of a CPC given its feature, which in turn allows us to extract the weights of each feature (i.e. price driving relevance) using SHAP values.
The R² metrics is used to measure the model performance, result is generally between 0 (bad) and 1 (perfect). We consider the model good enough between 0.4 and 0.9.
Note : the objective is not to have a perfect model, because fitting too well the training data often leads to a bad generalization.
In simpler words, it means that the provided data are too specific and detailed and while it allows the model to perform better on the training set, it will fail its prediction with any slight change in the new data we will provide at each campaign.
If R2 are too low, this could mean two things :
==> In this use-case, we do not use the output of the model to predict prices directly.
Example of R² output:
![]()
What we are interested in is to understand the importance of each of our pricing lever (feature) in predicting the price. This is done by computing SHAP values when running the model.
This feature importance will then be used by the second model as coefficient to find neighbors for each CPC.
Specialty Monomers example:

The strength of the SHAP value is that it is able to isolate the impact of a specific feature, ignoring all the other pricing features input in the model.
Final feature importance (weight):
SHAP values are then used to define the weight (or importance) of every pricing lever.

For each model run, the weights model will be retrained using LightGBM parameters specified in the select_weights_model function (inside similarity library). LightGBM is sensitive to the specific hyperparameters, and may overfit if care is not taken. We therefore select the hyperparameters individually for each family, and find them using a hyperparameter grid search.
This functionality is implemented in the weights model, and can be run at the bottom of the weights model recipe for a specific family (toggled off by default) - printing out the optimal parameters. These parameters are then moved to the model parameter section for this family in the config, so future runs use these optimal parameters. We typically run the hyperparameter search when we’re adding a new family, or if large changes are made to an existing one (e.g. many new features).
The grid search will train a model for each of the combinations of parameters specified in the weights model (in the library) and pick the ones with the best cross validated R2-score. To ensure best possible parameters, we have a rich search space - making the procedure time intensive (~30 minutes). Faster times can be achieved by reducing the amount of grid search parameters in the library implementation.
Also note that this search will try Lasso and Random Forest models, as a reference. Generally, LightGBM should be able to achieve better results than these, so if these models come out as the best, this indicates that manual tuning of LightGBM parameters should
To find comparables, we compute a similarity distance between all CPCs two by two. This computation is based on the numeric values of the CPC features on which we apply the feature weight from the first model as a factor.
The hard boundaries are decided based on business intuition of the business teams. The business team decides whether within a family, there are some products that we should not compare.
To define a similarity threshold, we set a matching percentage for certain features that must be identical between comparables.
"match_percentage_similarity_threshold": 0.4,
"match_percentage_cols": {
"shared": [
"country_shipto",
"end_use",
"gbu_customer_seg",
"product_group"
]} |
==> the threshold is currently set to 0
For each target CPC, all comparable CPCs are ranked according to similarity distance, then we select the set of comparable CPCs as the minimum between the top 10 and the threshold for final price calculation.