Table of contents :

1. Objectives of the document:

This document provides a technical overview of the solution delivered by BigData&Analytics Team, inspired by the previous version analytics project developed by D3S (https://wiki.solvay.com/display/BDA/PCM+-+Predictive+Credit+Management)

It explains the building blocks of the source code and the methodology followed to obtain:

insightful predictions for cash recovery (Data)
strategies to priorities documents lots in user interface
end to end application to manage daily cash collection activities

2. General presentation of the solution:

The main business stake is to increase overdue coverage with the existing task force. As of today, dunning and pre-dunning actions focus on the largest outstanding amounts, leaving aside smaller accounts (below a threshold). Pre-dunning include some additional rules applied by cash collection teams through a time-consuming manual process.

Cash collection is steered with End of Month KPIs. Although not necessarily representative of the cost of working capital, EOM metrics are relevant as they are fully aligned with other business steering indicators. Predictive analytics are a way forward, especially to better address the smaller accounts on which the overdue rate is higher.

Figure 1: objectives and core principles

Functional overview

Figure 1bis: functional overview

Machine learning methodology via Dataiku

The predictive solution leverage machine learning technology. A model is first trained on payment history to learn customer behavior based on all available characteristics. For new customers, the model infers behavior based on available data (country, currency, sector, invoice- characteristics, etc…)

Figure 2: Machine learning description

Data engineering orchestration via Talend

Several talend pipelines covers the whole project during each iteration of the update. It allows to interact with several components of the projects (SAP BW, SFTP server, Google Big Queries, Google Cloud Storage, Dataiku Data science studio, Google Cloud Functions)

F0000_Orch_Flow_Training 0.2 : Main Flow
F0201_ExtractDelta_From_SFTP_WBP 0.1 : Extract the data form SAP BW through Talend Server
F0202_Prepare_Files 01. : Prepare meta data and files schema
F0203_Push_to_BQ_Step1_Trx 0.1 / F0204_Push_to_BQ_Step1_MDM 0.1: Transfer master data and transactional data to Google Big Query Data sets
F503_Launch Queries _from GCS 0.1 : Orchestrate several modifications and preparation of Big Query datasets in GCP project with .sql files stored in Google Cloud Storage bucket.
F600_Run_Scenario_on_DSS 0.1 : Send a daily trigger to Dataiku to run the prediction and strategy computation
F700_Call_Cloud_Function_to update_CloudSQL 0.1 : Use Cloud Function component to update by an API the data stored by the Cloud SQL

Figure 3: List of all talend pipelines used for the project

Access control and administration of the pipelines are managed by SBS BDA Analytics team with limited number of Talend licences

User interface exposure via AppEngine

A simple webapp has been developt to monitor and prioritize cash collections. For UI details: see the documentation in confluence. Source code is available on version control tool (bitbucket repository) and through Google SDK in case of user credentials .

Figure 4 : Web interface with main features

3. Workflow description

There are several building blocks in or interacting with the solution:

BW/SAP server: interaction in/out from the SAP system
Google Sheets : End Users interaction with the project
Google BigQuery : Datalake storage with big data such as documents long term history
Dataiku : Machine learning processing and priorities computation
Cloud SQL / AppEngine project : hosting of the transactional UI

These building blocks are linked through the here abode functional steps.

Figure 5: Building blocks and interactions of the whole solution

4. Workflow details

Step 1. Full & Daily raw data ingestion

Figure 6 : Schematic description

Through Data Transfer Processes in SAP BW and SFTP Connectors, all data retrieved in SAP for the project as raw data is remove from SAP environment to GBQ datasets in the GCP projects corresponding to each environment:

Several details to handle and upgrade this workflow:

DTP reference in BW :
- DBFIAR20 -> OH_PCM_01– Delta
- DBFIAR21 -> OH_PCM_01– FULL

SFTP host and login and/or location :
- See in Talend Documentation

GBQ dedicated dataset :
- Raw_data

GCP Dedicated project :

List of the retrieved raw data and detailed variables: More details in the following Google Spreadsheet: Data Material

Step 2. CCT raw data ingestion

Figure 7 : Schematic description

Several information are manually stored each month in several tabs of a collaborative spreadsheet. These information are transformed into KPI for the forecast and strategies computation:

Number of times - Postponed Payment in the last 12 months
Number of times - Other Reasons
Last twelve months - recurrent
Last six months - recurrent
Last three months - recurrent
Correct GBU
And others

Several details to handle and upgrade this workflow:

Google Spreadsheet embedding the raw data : Diagnosis Report - Google Spreadsheet
GBQ dedicated dataset : extra_data and tables for each area ()
Talend pipeline : (To be )

Step 3. Iterative orchestration and data preparation to GBQ

This step helps for the daily and on demand update of the Master data and Transactional data stored in Gbq. Talend communicates with google cloud storage to launch Gbq saved queries to organize the extract-transform-load process.

Figure 8 : List of sql files for Talend transformations

Step 4. Model Design

The model is performed in Data Science Studio Platform of Dataiku, a user-friendly interface coded in python.

These SaaS tool owns several data connectors to external databases and storage such as: Google cloud storage, GCP.

The model object is (New RF on train sample) computed on demand by a Data Scientist with a python dedicated library, scikit-learn and stored as pickle object inside the platform and is made available to other dataiku project depending on access policy.

Figure 9 : Model details on dataiku.

Step 5. Validation & Accuracy computation

Each version of the model is evaluated automatically by the data science plateform. Split between train and test sample is realized with « net due date » variable, oldest documents for training (approximatively M-48 to M-7) , newest for test (M-6 to M-1)

Figure 10: Accuracy computation on dataiku.