Description
The data source is the Pure server (ftp.credit360.com), which Talend will load the file via FTP (more detail from PURE click on this link). The Talend project EHS_PURE load these data first and keep to GCP project solvay-ind-conso-[env] on the dataset ehs_pure_[env]_mig
Source FTP server = "ftp.credit360.com" / User = "solvay" by using Private key file and it keep in remote engine GCP at this folder \DATA\DEV\EHS\Pure\InOut\pure_sftp_ssh_key (control by context variable l_CNX_EHS_PURE_SFTP_private_key )
Talend Project = EHS_PURE
Talend jobs = F004_Connect_to_SFTP + F005_Data_Prep, which are not part of this project.
Load following file to /DATA/DEV/EHS/Pure/Tmp
GCS folder = ehs_pure_dev_mig
Talend Plan = PL_EHS_PURE_HOURL_New run every hour in PROD
Then, Operation Dashboard project using these data to prj-data-industrial-dash-[dev] project by create views
GCP dataset = solvay-ind-conso-dev.DS_prj_data_industrial_dash
V_core_hd_monthly
V_os_data
V_ps_data
Note: OS = Occupational safety incidents
PS = Process Safety
After that Talend job in project IND_DASHBOARD generate the FACT tables for TRII and PSE by separate the perspective by site and gbu.
Tools: Talend
Detail job
- J080_FACT_trii_site
- tJava check the date input
- tBigQueryInput1 Calculate the data from os and core_hd_monthly to get rolling last 12 months based on site and gbu
- tMap Generate key and meta_* data
- tBigQuerySQLRow delete the FACT table since it will be full load from the source
- Load the data to the FACT table
- If the loading is error, email will be sent to inform DataOps team
It is the same for
- J081_FACT_trii_gbu, which step 2 has the script group by only gbu
- J082_FACT_pse_site and J083_FACT_pse_gbu, are the same as trii but using table ps instead of os for PSE (the different is only script on step2)
Flow job
- F080_FACT_trii_site
- Setup meta_run_id and filename of the output file
- Call the detail job and pass parameters such as filename, date
- Call the standard job to upload the files from GCS to ODS
- If everything is OK, update the log.
Access rights
Required to access solvay-ind-conso-[env]
Source
- ftp.credit360.com → GCP solvay-ind-conso-[env]
Format
- Table
Destination
Location
- Bucket = cs-ew1-prj-data-industrial-dash-dev-staging
- DataOean GCP = prj-data-dm-dt-[env]
- DM
- prj-data-industrial-dash-dev.DM.FACT_trii_site
- prj-data-industrial-dash-dev.DM.FACT_trii_gbu
- prj-data-industrial-dash-dev.DM.FACT_pse_site
- prj-data-industrial-dash-dev.DM.FACT_pse_gbu
Format
- columnar format
Sizing
Site around 5000 records
GBU around 500 records
Assessment
How to validate that the generated output is valid:
select job.job_name, job.meta_start_date, job.meta_execution_id, logs.meta_run_id, logs.meta_source_system, logs.meta_step, logs.meta_status, logs.meta_num_lines, logs.meta_error_lines from STG.log_tables logs join STG.run_jobs job on logs.meta_run_id = job.meta_run_id
where logs.meta_run_id in (SELECT meta_run_id FROM STG.run_jobs order by meta_start_date desc limit 1000)
and meta_source_system in ('V_ps_data','V_os_data')
and meta_step = 'ODS to DM'
and meta_start_date > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 DAY)
order by job.meta_start_date desc
Loading
1.1 Incremental Load
Not available
1.2 Full load
Plan PL_TRII_PSE run run 9:00 AM on date 1,5,10,15,20,25,30. There is no context variable to reload
1.3. Reloading data
Just do the full load again
1.4 Plan to schedule
run 9:00 AM on date 1,5,10,15,20,25,30
1.5 Timing
The average time expected for loading: around 5 mins
Criticality
High/Medium/Low
Logging
select job.job_name, job.meta_start_date, job.meta_execution_id, logs.meta_run_id, logs.meta_source_system, logs.meta_step, logs.meta_status, logs.meta_num_lines, logs.meta_error_lines from STG.log_tables logs join STG.run_jobs job on logs.meta_run_id = job.meta_run_id
where logs.meta_run_id in (SELECT meta_run_id FROM STG.run_jobs order by meta_start_date desc limit 1000)
and meta_source_system in ('V_ps_data','V_os_data')
and meta_step = 'ODS to DM'
and meta_start_date > DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 5 DAY)
order by job.meta_start_date desc


