Target:
The target represents the data we are trying to optimize, in this case the unit price of a CPC (Customer Product Combination).
As of now, this unit price includes all costs : fixed and variable.
The price target is built following these steps (presented in these recipes in Dataiku: Prepare recipe 1, Group recipe 1, Prepare recipe 3):
1. Gathering of the CPC forecast prices and last invoice price:
Corresponds to the two fields: forecast_unit_price (Fcst_Unit_Price_Est_Act_E for next 12 months) and last_invoice_price (Last_Invoice_Price_EUR_KG) in the Transparency Dashboard extracts.
2. Cleaning of both variables by replacing "-" and non-decimal values with empty cells
3. Aggregating of the forecasted months to keep a single forecast_unit_price and last_invoiced_price at a CPC level.
- We keep the average of the forecast_unit_price on the 12 next months.
- We keep the first non-null last_invoiced_price on the 12 next months.
4. Computing the final unit price using the following formula:
computed_unit_price = if(forecasted_unit_price == 0 || isNull(forecasted_unit_price), last_invoiced_price, forecasted_unit_price)
=> We take as a priority the forecast_unit_price from the step above if it is greater than 0, else we revert to the last invoice price from the forecast data.
Price drivers:
To select the final list of the most relevant price drivers, we collected, built and tested more than 50 features:
These price drivers are coming from several data sources described below.
Forecasts and historical data
The main data source we are currently using is the Pricing Data Lake in Big Query, especially the two following datasets :
- V_FACT_sales_forecast_enriched_current : Forecasts data.
- V_FACT_sales_history_cpc_last12months : Historical data for the past 12 months.
These datasets include :
- Sales and volume measures that are also used to generate the unit price used as a target (see dedicated § above) for our models.
- Dimensions used as features / price drivers (see dedicated § above).
You will below the details of the processing steps for these data.
Forecast data cleaning and aggregation (1)
In this first step, we perform a bit of data filtering and cleaning and then aggregate the forecast data at a CPC level.
Data filtering and cleaning :
- Columns renaming to match the names used in our backend.
- Filter on the right GBU (through gbu_id field).
- Filter on the last running month (running_dt field).
- Filter forecasts of current and next 11 months (forecast_dt field) => this is mandatory as we will then sum the sales and volumes at a CPC level and want to stay on a 12 months basis.
- Delete records with null customer and product keys (shipto_customer_id / material_id for Novecare).
- Keep only the columns we need.
CPC aggregation :
The data we get from the data lake is at a CPC + distribution channel + month level. This means that each distribution channel for a given CPC can have its own value for some dimensions (especially incoterms, group of activity and enterprise segment).
Given the fact that we only want one record by CPC, we will keep the values associated to the highest amount of sales, following these steps :
- Compute total sales by distribution channel for every CPC (new field total_sales_by_channel).
- Sort the data by cpc, total_sales_by_channel (highest first) and month (closest first).
- Aggregate the sorted data by CPC :
- keeping the first non-null value for all the dimensions and last_invoice_price
- summing the measures (volume and sales)
The output of this first step therefore contains the forecasted data aggregated at a CPC level, which means we always get 1 record for each CPC.
- The dimensions displayed are the ones of the closest forecasted month for the distribution channel with the most sales.
- The volume and sales measures are summed over the 12 first months of forecasts.
- The unit price is not yet computed at this stage.
Sales and Volume measures (to derive unit price)
- In the Group recipe 1 :
forecasted_sales (Fcst_Sales_Est_Act_EUR) and forecasted_volume (Fcst_Volume_Est_Act_KG) have been aggregated by CPC in order to keep the sum of forecasts over the whole 12 months for each CPC.
- CPCs with zero values in forecasted_volume are filtered out in the Prepare recipe 3.
- Historical sales, volume and unit_price have been processed with same method as the forecasts (in the Group recipe 2 instead), and are included in the final dataset in the Python recipe, to be used used later in the feature engineering stage.
ICM measures
- Historical_unit_icm (Fcst_Unit_ICM_Est_Act_EUR) has also been aggregated by CPC to compute the average over the last 12 months (Group recipe 2).
- Historical_icm (Fcst_ICM_Est_Act_EUR) has been included in the Python recipe, and used further in the feature engineering step to compute ICM ratios.
List of "dimensions" used as features
| incoterms | manufacturing_plant | product_name |
| material_code | grp_of_activities | product_code |
| material_name | country_shipto | product_brand_name |
| shipto_code | gbu_region | country_soldto |
| shipto_name | gbu_product_family | grp_customer_seg |
| soldto_name | entreprise_seg | shipping_plant |
| soldto_code | market | end_use |
| soldto_group | gbu_customer_seg |
- In the forecast dataset, these features were aggregated by CPC to retain the earliest available data (Group recipe 1).
- The same features in the historical dataset were aggregated by CPC to retain the latest available data (Group recipe 2).
- In the Python recipe, the two dataset were combined to fill empty values of forecast data by historical data and to generate the features dataset. As a result, the main data source for all these dimensions remain the forecasts, historical data being only used as fillers.
Manual inputs
- Product groupings are collected from this GSheet which is supplied by the business. As long as the product group that are in general used as hard-boundaries to limit the comparable sets, the file also contains a few other columns that may be used as features for some of the product families.
- The manual region mapping for each product family comes from this GSheet and is also supplied by the business.
- In the Join recipe, both datasets are added to the feature dataset obtained in the previous steps.
Product characteristics
- Both product composition files are provided in this GSheet file for now. This source will have to be industrialized through the data lake.
- The PC_composition_file_sql dataset contains the product composition at a "EHS_Product" level of granularity.
- The PC_matserial_mapping_sql dataset contains the mapping between the "EHS_Product" and the material codes we are using in the rest of the project.
- In the Python recipe, we compute the product composition features at a product level by aggregating together the different substances having the same component type.
- The Join recipe adds the material code to the dataset using the "EHS_Product" key mapping.
Contracts
Note : Contracts data are not currently used as features because too few CPCs are under contract in the data we use.
- Both contracts files are provided in this GSheet file for now. This source will have to be industrialized through the data lake if used in the future.
- The Contract_Product_dataset_sql dataset contains the link between a contract and the products it is applied on. When a contract does not have any product linked, we consider that it applies to all of the products sold to the customer.
- The Contract_dataset_sql dataset contains the contract information at a customer level.
- In the first prepare recipe, we only keep the contracts with a "Signed" status as the other ones are not considered active.
- We then group the dataset by CPC to only keep one contract if several are active. We currently do not have a rule to define priorities so we keep one randomly.


