You will find the term "CPC" a lot in the following documentation as it is the main level of granularity used through the application. This means that most of our datasets will have one record by CPC of a given GBU.
The Customer Product Combination (CPC) is the identifier representing a specific product sold to a specific customer.
The product and customer definition here varies from one GBU to another.
For example :
The target represents the data we are trying to optimize, in this case the unit price of a CPC (Customer Product Combination).
The unit price we use in our models is the result of several computation made in our data preparation steps, we could summarize it as follows :
Note : as of now, this unit price includes all costs : fixed and variable.
More details are available below on how these computations are included in our global data preparation flow.
To select the final list of the most relevant price drivers, we collected, built and tested more than 50 features:

These price drivers are coming from several data sources described below.
The main data source we are currently using is the Pricing Data Lake in Big Query, especially the two following datasets :
These datasets include :
You will below the details of the processing steps for these data.

In this first step, we perform a bit of data filtering and cleaning and then aggregate the forecast data at a CPC level.
The data we get from the data lake is at a CPC + distribution channel + month level. This means that each distribution channel for a given CPC can have its own value for some dimensions (especially incoterms, group of activity and enterprise segment).
Given the fact that we only want one record by CPC, we will keep the values associated to the highest amount of sales, following these steps :
The output of this first step therefore contains the forecasted data aggregated at a CPC level, which means we always get 1 record for each CPC :
This is the equivalent of the forecast data cleaning step.
Note that :
The output of this first step contains the historical data aggregated at a CPC + incoterms + month level, which means we always get 1 record for each CPC, incoterms and month.
The data we get from the data lake is at a CPC + incoterms + month level. Each incoterms for a given CPC can have its own value for some dimensions
Given the fact that we only want one record by CPC, we will keep the values associated to the highest amount of sales, following these steps :
The output of this first step contains the historical data aggregated at a CPC level, which means we always get 1 record for each CPC.
In this step, we are merging together the forecast and historical data. For each of the dimension, we take the value from the forecast data when available, else we revert to the value of the historical data.
It allows us to fill some missing data in the forecasts and therefore still keep the CPC in the analysis.
For the following measures, we keep both the forecasted and historical values in their dedicated fields : sales, volume, last invoice price.
We therefore end up with a dataset having one record by CPC and one value for every dimensions, as long as both the historical and forecasted measures.
What we call the unified timeline is the dataset we get by combining the historical and forecast data as described above. Once we do this, we can apply our last steps of data preparation.
Removal of the records having no product_line_04 or forecast_qty_vkg
Since now, we did not manipulate unit prices at all, even if they are the central element of our price optimization. Indeed, due to the previously mentioned aggregation steps, we can not use the unit prices directly available in the Data Lake but we can compute it again based on the sales and volumes we have at this stage.
For some of our features, we aggregate the sales of the CPC.
Due to the fact that we only filter out CPC having no forecasted volume, we still face (rare) situations in which a CPC has forecasted volume but no sales related. When it happens, the unit price we get is from the historical data and the CPC kept in our models. Therefore, we also have to re-compute the sales based on the final unit price (which should be equivalent as the historical sales).
This boolean feature is computed based on the gbu_customer_seg data : the feature value is true if it contains "Distributor", else false
[
'incoterms_id',
'material_id',
'material_name',
'shipto_customer_id',
'shipto_customer_medium_name',
'shipto_customer_corporate_group_name',
'manufacturing_plant_id',
'manufacturing_plant_name',
'product_id',
'product_name',
'shipping_plant_id',
'shipping_plant_name',
'country_ship_to_id',
'country_ship_to_name',
'group_of_activity_id',
'soldto_customer_id',
'soldto_customer_medium_name',
'soldto_customer_corporate_group_name',
'forecast_dt',
'local_currency_id',
'uom_id',
'gbu_region',
'gbr_enduse_shipto_name',
'gbr_market_name',
'gbu_product_family_name',
'gbu_customer_segment_name',
'gbr_enterprise_segment_w_name',
'forecast_net_price_eur_vkg',
'forecast_qty_vkg',
'forecast_sales_eur',
'last_invoice_price_eur_vkg',
'distribution_channel_id',
'product_line_04_name',
'running_dt',
'gbu_id'
]
[
'incoterms_id',
'material_id',
'material_name',
'shipto_customer_id',
'shipto_customer_medium_name',
'shipto_customer_corporate_group_name',
'manufacturing_plant_id',
'manufacturing_plant_name',
'product_id',
'product_name',
'shipping_plant_id',
'shipping_plant_name',
'country_ship_to_id',
'country_ship_to_name',
'group_of_activity_id',
'soldto_customer_id',
'soldto_customer_medium_name',
'soldto_customer_corporate_group_name',
'pnl_posting_month_dt',
'gbu_region',
'gbr_enduse_shipto_name',
'gbr_market_name',
'gbu_product_family_name',
'gbu_customer_segment_name',
'gbr_enterprise_segment_w_name',
'forecast_net_price_eur_vkg',
'qty_sold_external_vkg',
'net_sales_external_eur',
'last_invoice_price_eur_vkg',
'product_line_04_name',
'last_invoice_price_pnl_posting_month_dt',
'gbu_id'
]
- Product groupings are collected from this GSheet which is supplied by the business. As long as the product group that are in general used as hard-boundaries to limit the comparable sets, the file also contains a few other columns that may be used as features for some of the product families.
- The manual region mapping for each product family comes from this GSheet and is also supplied by the business.
- In the Join recipe, both datasets are added to the feature dataset obtained in the previous steps.
![]()
- Both product composition files are provided in this GSheet file for now. This source will have to be industrialized through the data lake.
- In the Python recipe, we compute the product composition features at a product level by aggregating together the different substances having the same component type.
- The Join recipe adds the material code to the dataset using the "EHS_Product" key mapping.

Note : Contracts data are not currently used as features because too few CPCs are under contract in the data we use.
- Both contracts files are provided in this GSheet file for now. This source will have to be industrialized through the data lake if used in the future.
- In the first prepare recipe, we only keep the contracts with a "Signed" status as the other ones are not considered active.
- We then group the dataset by CPC to only keep one contract if several are active. We currently do not have a rule to define priorities so we keep one randomly.