Page tree


You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Description

Tools: Talend

Detail job

  • J070_Kadiska_to_GCS

  1. Define the date
    1. Time to select  "Begin" to "End"
    2. Time to stop "Stop"
    3. Time today = "Now"
  2. Login to Kadiska API on URL: https://app.kadiska.com/api/v1/config/clients/client:5b77758db8/tokens (l_VAR_kadiska_url_token)and keep the token as global variable
  3. Start looping 
  4. Extract data from Kadiska API on URL: https://app.kadiska.com/api/v1/query (l_VAR_kadiska_url_query) controlling by following variables
  5. In case [no more data] there is no data or the end time >= stop time, stop the loop
  6. In case [there is data more than limit] there is data and number of line = limit, add offset + number of line
  7. In case [no more data during the interval, go to next hour]there is data and number of line < limit  or there is no data and End time < Stop,  set offset = 0 and add Begin and End time 1 hour (interval)

Flow job

  • F070_Kadiska_rum_to_ODS

  1. Define variable such as filename
  2. Get meta_run_id
  3. Define meta_file_name for incremental or reload which depending on l_VAR_kadiska_q_reload but the name is fixed to KADISKA_RUM[_RELOAD]
  4. Get the last successfully load from table incremental_loading 
  5. Call detail job 
  6. Load to ODS
  7. Update the last successful date to incremetnal_loading table

Access rights

It is request to network team : bilal.seddiki@solvay.com (manager = 

Source

https://app.kadiska.com/api/v1/query

Format

JSON

Destination

Location

  • Bucket = cs-ew1-prj-data-dm-dt-[dev]-staging/KADISKA/
  • DataOean GCP = prj-data-dm-dt-[env]
  • STG Table names =
    • prj-data-dm-dt-prod.STG.STG_KDK_0000_0000_F001_I_H_rum
  • ODS Table names =
    • prj-data-dm-dt-prod.ODS.ODS_KDK_0000_F001_I_H_rum
  • DPL View names =  
    •  prj-data-dm-dt-[env].DPL.V_FACT_hlx_work_order

Format

columnar format

Sizing


Assessment

Loading

1.1 Incremental Load

It is control by variables:

  • l_VAR_kadiska_q_condition : normally it is blank but it can add additional filter such as   ["=", "watcher_name", ["$","Google Calendar"]],
  • l_VAR_kadiska_q_interval : during project design, we agree to have interval 1 hour = 3600000 milliseconds
  • l_VAR_kadiska_q_limit : max number of records to get from API each call
  • l_VAR_kadiska_q_nloop : max number of loop
  • l_VAR_kadiska_q_offset : start index of the record from API query
  • l_VAR_kadiska_q_reload : "incremental" for the incremental load, in case of reload change this variable to end time with epoch format (starting time will get from table incremental_loading on meta_file_name = KADISKA_RUM_RELOAD
  • l_VAR_kadiska_email_flag : "yes" when need Talend to send email to inform when the incremental load is not increasing on the incremental_table, which can use the loading duplicate data
  • l_VAR_kadiska_email_recipient: list of email to get the email

1.2 Full load

N/A

1.3. Reloading data

It is control by variables:

  • l_VAR_kadiska_q_limit: The number of records to get from API each time
    l_VAR_kadiska_q_nloop: the number of loop for each load.  (the number of limit x nloop will be max record on the file to load each time to BQ)
    l_VAR_kadiska_q_offset: The number of record index from API. It starts with 0. Will change to another number only specific reload case
  • l_VAR_kadiska_q_reload:   If it is NOT "increment", it will the full load with condition on this parameter such as [1724371200000]. It will be the end time (epoch format). The start date will always get from incremental_loading table with meta_file_name = KADISKA_RUM_RELOAD

How to:

1. Stop schedule job

2. Change the start date that is required to reload on

UPDATE STG.incremental_loading

 SET meta_last_process_date = '2021-12-22 00:00:00'   --the date that want to reload

where meta_file_name = 'KADISKA_RUM_RELOAD' -- identify the object that want to reload but the name to change must be _RELOAD.  Without _RELOAD will use for incremental load only.

3. Change context l_VAR_kadiska_q_reload to the end date with epoch format 

4. Recheck the variables: limit, nloop, offset

5. Run the job to load until KADISKA_RUM_RELOAD is reach the end time that we enter on l_VAR_kadiska_q_reload. Meaning reload is complete

6. Change the context  l_VAR_kadiska_q_reload to be "incremental" 

7. Reschedule the job

1.4 Plan to schedule

Every hour

1.5 Timing

5 minutes

Criticality

High/Medium/Low

Logging

1. Check the last loading

select * from STG.incremental_loading
where meta_file_name like '%KADISKA_RUM'
order by meta_file_name

2. Check the loading records / error

select job.job_name, job.meta_start_date, logs.meta_run_id, logs.meta_source_system, logs.meta_step, logs.meta_status, logs.meta_num_lines, logs.meta_error_lines from STG.log_tables logs join STG.run_jobs job on logs.meta_run_id = job.meta_run_id
where  logs.meta_run_id in (SELECT meta_run_id FROM STG.run_jobs order by meta_start_date desc limit 100)
and job_name like '%Kadiska%'
and meta_start_date >  DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
order by meta_start_date desc