The objective of this page is to provide a detailed description of the global variables used in Dataiku's workflow, these variables are grouped as follows:

Environment

"env": "master"

This variable is used to differentiate the file storage between a "dev" and a "master" location.

This variable is always set to "master" in the Global variables and overriden by "dev" in the Local variables in order to not have to do any change between design and automation.

Versioning

These two variables are used to define the version of the runs in Dataiku folders:

"version_name": "Q3.3_all_families_run",
"version_date": "2023-09-01",

To get the version name in the python recipes, we used the get_config_version() function from config_helpers.py file.

Parameter gsheet

Directly used in the gsheet settings for parameters update, this variable contains the URL of the PROD gsheet in the global variables.

It is overriden in all design projects (dev and master included) by the local variable containing the DEV gsheet.

"parameter_gsheet": "1onxXZjM3MH-GISC47L41pUOvfwi0ySgqXxM4rLURupw"
"parameter_gsheet": "1AK6UgFg9KeBHR3_--zGJ9S930yBV3ckLRjmfk4-qiIk"

Taxonomy gsheet (SpP only)

Directly used in the gsheet settings for taxonomy, this variable contains the URL of the PROD gsheet in the global variables.

It is overriden in all design projects (dev and master included) by the local variable containing the DEV gsheet.

"taxonomy_gsheet": "1ASWg_uEPWZF_YOhydSD32gTxDL_cjSuJoV_FpdFjmGo"
"taxonomy_gsheet": "1pz4-k47el-DsoK1iI1E9CLxWeC-v-OkWFTxUDwU2yzo"

GBU

Specific GBU measures and identifiers, which must be updated for each new GBU.

"Families_in_scope" contains the names of currently integrated product families for the GBU

"All_validated_families" contains all of the families that already have been validated on both the technical and business side. To be updated each time we add a new product family

A family can therefore have been validated but not be used in current runs if it has been deactivated by the business.

"GBU_measures": {
    "historical_revenue": "historical_sales",
    "historical_volume": "historical_volume",
    "historical_price": "historical_unit_price"
  },
  "GBU_identifiers": {
    "id_key": "cpc",
    "product_key": "material_code",
    "customer_key": "shipto_code",
    "soldto_key": "soldto_code",
    "soldto_group_key": "soldto_group",
    "shipto_key": "shipto_code",
    "family_key": "gbu_product_family",
    "sales_key": "forecasted_sales",
    "volumes_key": "forecasted_volume",
    "prices_key": "computed_unit_price"
  },
  "families_in_scope": [
    "Sulfosuccinate_Sulfosuccinamate",
    "Specialty_Monomers",
    "Phosphate_Esters",
    "Amines"
  ],
  "all_validated_families": [
    "Sulfosuccinate_Sulfosuccinamate",
    "Phosphate_Esters",
    "Alkoxylates",
    "Amines"
  ],

To get these variables, we used the get_config_gbu_ids() function from config_helpers.py file.

The values of "families_in_scope" variable are used by the GBU variable "family_key" to select families in the scope.

Product composition

variables used to process product composition data, in particular to select the component to be used, specify the identifiers of the product, component type, measure and unit.

"product_composition": {
    "component_values": [
      "COMPONENT",
      "IMPURITY",
      "SOLVENT",
      "ADDITIVE",
      "Z_CONST"
    ],
    "product_identifier": "EHS_Product",
    "component_type_identifier": "Component_Type",
    "measure_identifier": "Average",
    "unit_identifier": "Unit"
  },

these variables are used as arguments to the compute_product_composition() function to compute the product composition features in this recipe.

Pre-processing

variables used in the various data preparation stages:

"preprocessing_filters": {
    "product_group": [
      "SSPH"
    ],
    "material_name": [
      "AEROSOL OT-100 SURF 25KG FBD WHSKIN",
      "AEROSOL OT-100 SURF 11KG W/LBL BOX"
    ],
    "end_use": [
      "Hpc-Api"
    ]
  },
"replace_with_null": {
    "end_use": [
      "Not Assigned"
    ],
    "gbu_customer_seg": [
      "Not valid",
      "Not yet assigned"
    ],
    "market_cluster": [
      "Not Identified",
      "-1"
    ]
  }, "imputers": {
    "most_frequent": [
      "manual_region_SS",
      "manual_region_SM",
      "product_group"
    ],
    "constant": {
      "n_competitors": 1,
      "historical_unit_price_coalesce_ratio_on_12": 1,
      "historical_sales_coalesce_ratio_on_12": 1,
      "historical_unit_price_ratio_3_on_12_month": 1
    },
    "mean": [
      "COMPONENT_ratio",
      "IMPURITY_ratio",
      "SOLVENT_ratio",
      "n_components"
    ]
  },
"outliers": {
    "remove_outliers": false,
    "method": "IQR",
    "vars_to_check": [
      "cpc_price_log",
      "cpc_volume_log",
      "cpc_revenue_log"
    ],
    "n_vars_out": 1
  },
"categorical_encoder": "TargetMean",
"ordinal_encoder": "Ordinal",

also, to create sales evolution features, a set of parameters for the get_interval_ratio() function are declared as global variables.

The function used in the sales evolution features recipe calculates a ratio of the chosen column in "evolution_columns" on one or several month ("numerator_list") in regards to another set of months ("denominator_list").

"evolution_features_params": {
    "evolution_columns": [
      "historical_sales",
      "historical_volume",
      "historical_unit_price"
    ],
    "numerator_list": [
      1,
      3,
      6
    ],
    "denominator_list": [
      12
    ]
  },

Weights Model

variables to customize the weight model:

"model": {
    "target": "cpc_price_log",
    "id_col": "cpc",
    "SHAP_VISUALS": [
      "cpc_volume_log"
    ],
    "shared_features": {
      "numerical_features": [
        "cpc_volume_log",
        "cpc_revenue_share_wrt_grp_family_revenue",
        "rev_outside_family_log",
        "n_products_per_customer",
        "n_customers_per_product",
        "historical_unit_price_coalesce_ratio_on_12",
        "historical_sales_coalesce_ratio_on_12",
        "IMPURITY_ratio",
        "COMPONENT_ratio"
      ],
      "categorical_features": [
        "incoterms",
        "manufacturing_plant",
        "product_group",
        "country_shipto",
        "end_use",
        "gbu_customer_seg"
      ],
      "ordinal_features": {
        "group_volume_but_cpc_label": [
          "0_one_cpc",
          "1_small",
          "2_medium",
          "3_big",
          "4_top"
        ]
      }
    },
"family_features": {
      "Sulfosuccinate_Sulfosuccinamate": {
        "numerical_features": [
          "n_competitors"
        ],
        "categorical_features": [],
        "ordinal_features": {}
      },
      "Specialty_Monomers": {
        "numerical_features": [
          "SOLVENT_ratio"
        ],
        "categorical_features": [],
        "ordinal_features": {}
      },
      "Alkoxylates": {
        "numerical_features": [
          "SOLVENT_ratio"
        ],
        "categorical_features": [
          "lip_2",
          "chemistry"
        ],
        "ordinal_features": {}
      },
      "Phosphate_Esters": {
        "numerical_features": [
          "SOLVENT_ratio"
        ],
        "categorical_features": [],
        "ordinal_features": {}
      },
      "Guars": {
        "numerical_features": [
          "SOLVENT_ratio"
        ],
        "categorical_features": [],
        "ordinal_features": {}
      },
      "Amines": {
        "numerical_features": [
          "SOLVENT_ratio",
          "COMPONENT_ratio"
        ],
        "categorical_features": [],
        "ordinal_features": {}
      }
    }
  },

these variables are used in the compute Weighting dataset recipe.

Similarity Model

variables linked to the initiation of the similarity model

"target": "cpc_price_log",
"id_col": "cpc",
"cohort_size_cap": 20, 
"min_cohort_for_pricing": 3,
"max_cohort_used_for_pricing": 10,
"min_impact_for_pricing_euro": 15000,
"price_recommendation_cap": 0.3, 
"local_price_recommendation_cap": {
    "udel": 0.2,
    "radel": 0.2,
    "veradel": 0.2,
    "amodel": 0.2,
    "ketaspire": 0.2,
    "ryton": 0.2,
    "torlon": 0.2,
    "fluids": 0.2,
    "pvdc": 0.2,
    "halar": 0.2,
    "ixef": 0.2,
    "kalix": 0.2,
    "solef": 0.2,
    "tecnoflon_fkm": 0.2,
	"tecnoflon_ffkm": 0.2
},
"features_weight_zero": [
  "cpc_volume_log",
  "group_volume_but_cpc_label"
],
"match_percentage_similarity_threshold": 0,
"sim_threshold_black_list": [],
"match_percentage_cols": {
   "shared": [
      "country_shipto",
      "end_use",
      "gbu_customer_seg",
      "product_group"
   ]
},

these variables are used in the compute similarity dataset and compute price recommandation recipes.

Hard boundaries

variables to apply several rules on comparables:

"hard_boundaries": {
    "Sulfosuccinate_Sulfosuccinamate": [
      "product_group",
      "manual_region_SS"
    ],
    "Specialty_Monomers": [
      "product_group",
      "manual_region_SM"
    ],
    "Alkoxylates": [
      "lip_2",
      "chemistry"
    ],
    "Phosphate_Esters": [
      "manual_region_Ph_Esters",
      "COMPONENT_nb_Ph_Esters"
    ],
    "Guars": [
      "product_group",
      "manual_region_Guars"
    ],
    "Amines": [
      "product_group"
    ]
  },
  "hard_boundaries_inverse": [
    "shipto_code",
    "soldto_code"
  ],

  "volume_hard_boundaries": {
   "Sulfosuccinate_Sulfosuccinamate": {
      "flag": 1,
      "threshold": 10
    },
    "Sulfosuccinates_Healthcare": {
      "flag": 1,
      "threshold": 10
    },
    "Specialty_Monomers": {
      "flag": 1,
      "threshold": 10
    },
    "Alkoxylates": {
      "flag": 1,
      "threshold": 10
    },
    "Phosphate_Esters": {
      "flag": 1,
      "threshold": 10
    },
    "Guars": {
      "flag": 1,
      "threshold": 10
    },
    "Amines": {
      "flag": 1,
      "threshold": 10
    },
    "Solutions_Polymers": {
      "flag": 0,
      "threshold": 10
    },
    "Esters": {
      "flag": 0,
      "threshold": 10
    }

these variables are used in the compute similarity dataset recipe.

Adjustment Model

variables used for the volume and the group volume adjustment:

"adjustment_model": {
    "volume_feature": "cpc_volume_log",
    "fit_curve": true,
    "group_vol_labels": [
      "0_one_cpc",
      "1_small",
      "2_medium",
      "3_big",
      "4_top"
    ],
    "perform_group_adjustment": false,
    "small_volume_weight": 1.8,
    "big_volume_weight": 0.8,
    "small_volume_q": 0.2,
    "big_volume_q": 0.9,     
    "group_adjustments": {
      "default": {
        "0_one_cpc": 0,
        "1_small": 0,
        "2_medium": 0,
        "3_big": 0,
        "4_top": 0
      }
    }
  },

these variables are used in the compute adjust results recipe.

Cross-validation

Variables to activate cross-validation and use the finetuned parameters for the LGBM model. 

Default params are defined for families that have not been optimized yet.

"run_cross_val": false,
  "use_hyper_params": true,
  "families_to_optimize": [
    "Sulfosuccinate_Sulfosuccinamate",
    "Phosphate_Esters",
    "Alkoxylates",
    "Amines"
  ],
  "default_params": {
    "n_estimators": 100,
    "max_depth": 6,
    "learning_rate": 0.1,
    "min_child_samples": 10,
    "R2 score": 0
  },
  "cv_params": {
    "lgb": {
      "n_estimators": [
        25,
        50,
        100
      ],
      "max_depth": [
        3,
        5,
        9
      ],
      "learning_rate": [
        0.05,
        0.1
      ],
      "min_child_samples": [
        10,
        20,
        50
      ]
    }

These variables are used in the compute optimized hyperparameters recipe.

Metrics variables

Metrics_dict contains global variables used for the monitoring of the project, through the metrics and checks.

"metrics_dict": {
    "project_error_level": ""
  }

Manual_files_checks lists the families checked for each of the manual files of the project.

  "manual_files_checks": {
    "regions": [
      "amodel",
      "ryton",
      "tecnoflon_ffkm"
    ]
  },

External variables update 

Some of the variables of the project can be updated in an autonomous way by the users. This allows them to change some of the business rules without any action required from a developer or data scientist. For more information on this process, please refer to the dedicated documentation here.

In order to define the variables in the scope of these updates and to provide enough information to the users, we use a dedicated dictionary from the project variables itself :

From the example below :

"external_variables_update": {
    "minimal_impact_threshold": {
      "technical_path": "model.min_impact_for_pricing_euro",
      "input_type": "float",
      "description": "Impact threshold under which the price recommendation for a CPC will not considered. Applies to both positive and negative impacts. Only one numerical value should be input"
    },
    "recommendation_cap": {
      "technical_path": "model.price_recommendation_cap",
      "input_type": "float",
      "description": "Absolute gap that can not be exceeded between the original and recommended price, applicable to both negative and positive values."
    },
    "included_families": {
      "technical_path": "families_in_scope",
      "input_type": "list",
      "description": "List of the families included in the run (based on the product_family_h4 in our data) "
    },
    "all_validated_families": {
      "technical_path": "all_validated_families",
      "input_type": "list",
      "description": "List of all the families that have been validated and are available to be included in a run (based on the product_family_h4 in our data)"
    },
    "hard_boundaries": {
      "technical_path": "hard_boundaries",
      "input_type": "dict",
      "description": "Dict of hard-boundaries"
    }
  }

These variables are used in the Parameters_dataset recipe.

Output filter dictionary

In this step, some CPCs can be filtered based on conditions passed by a dictionary variable, so that they do not appear in the front-end dataset displayed in the Qlik dashboard:

"output_filters_dict": {
    "Amines": {
       "grp_of_activities": {
               "operator": "==",
               "value": "CSAGR",
               "type_filter": "keep"
       }
    }
}

The filter is now applied only to the Amines family, as indicated above, and only CPCs linked to the Agro market are retained in the output data.

ICM ratios dictionary

Variables to activate the use of ICM ratios. 

"ICM_features": {
    "use_ICM_features": false,
    "interval_sizes": [1, 3, 6, 9, 12]
}