Implementation recipes for pipelines

Runtime object: Function App or Container Job or Container App

First question to answer is the type of the process:

Batch job running once a day, every hour → deploy as Azure Function App (single CPU!)
Batch job that needs lots of CPU power → deploy as Azure Container App: Job
Continuous running tasks → deploy as Azure Container App: App

The second question to answer is the scope of the code/repo/function/container.

On the one hand, we could create one repository and it covers the 6000 SAP tables we ingest. On the other hand we could have one repo per SAP table, meaning we would have 6000 repos just for that.

Obviously neither extremes are desirable. As the repository is the deployment unit, it should contain enough code to deserve a repo of its own, yet whenever something is changed, the entire repo is deployed. Hence it should not be massive either.

The hierarchy of objects is:

One environment (e.g. dev) can have multiple Function Apps
One Function App has a single repository
One Function App can have multiple Functions
One Function can be executed multiple times, even in parallel, with different parameters, e.g. we have one Function and it is called 6000 times with the table name as parameter
One Function instance can process multiple tables sequentially or in parallel

For containers it is similar, except for the wording

One environment has one Container App, which is technically the Kubernetes Cluster
One Container App has multiple Containers
One container has a single repository
One container can be executed multiple times, even in parallel
One Container can process multiple tables sequentially or in parallel

Ask the Infrastructure team to create the needed services

We have separated the work between the Infrastructure team and the developer in such a way that the Infrastructure handles the network and services tightly integrated with the network, but the developer has enough freedom to make changes to the individual services using biceps code.

Concrete that means, Infrastructure owns

Network
Function App creation
Container App creation
Container Registry creation
Storage Account creation
Virtual machines all settings but developer provides the install script of applications
Key Vault

Hence asking the Infra team to create a new Function App will be the most common request.

The developer owns

The Functions within the Function App with their settings, e.g. RAM, Python version,..
The Container image and instance with all settings
Storage account elements like blob container, table, storage account
The secrets in the key vault (in dev only)

Create the repository

Step 1: Create a repository at https://github.com/SQO-SySight with the name being either

Ingest-<SourceSystem>: For code that reads data from a source system, e.g. Ingest-StarTek
Project-<Name>: For code that transforms data for a specific project, e.g. Project-CSRD
Transform-<Name>: For code that transforms source data for general consumption, e.g. Transform-SAP-Master-Data-1

The default branch is master.

Step 2: Copy from one of the template repos the basic objects and customize. The things that need to be adjusted are described in the README of each.

Step 3: Customize

Apart from writing the code, the infrastructure related biceps files should be reviewed to match the actual need. For example, not every Function needs the same amount of memory, managed identities, permissions,..

Develop

The code should be developed and debugged on the local laptop, from within the Syensqo network. The Azure VNet and the Syensqo network are paired and in the development environment of Azure, the developer has rather wide permissions. All the ones the function/container has plus more.

The alternative is to have a Window VM inside the Azure network, but mind the additional costs of that.

Debugging locally is way more efficient than deploying the code and watching it from the outside. It also allows to check the performance, enables profiling and avoids that runaway jobs bring down the whole environment by consuming the complete power.

General rules

Code is written for production and production needs. In production the developer has only the permissions to view logs, but cannot start something, cannot modify the data.
All programs handle errors gracefully.
The program can be started as often as desired - it will not lead to duplicate data or primary key violations.
Temporary errors, e.g. source not available, are recovered automatically.
Data errors cause the data to be loaded, but flagged as WARN/FAIL. Otherwise the data, e.g. a sales order, is not loaded, it contributes zero to today's revenue, which is wrong.
Programs create three documents
- Impact/Lineage json
- Trigger conditions json
- schema json

Logging

If something went wrong in production, the logs are the first place to understand what happened. The person looking at to logs first will be an IT support person, who has no understanding of the code. If the logs provide enough information for this person to fix the issue by himself, the system will be available more quickly and we, the developers, won't be bothered. Hence investing some time into logging is certainly a time saver for us.

logger = lib_producer.utils.get_logger(<name>) is the default way of getting a Python logger instance. This logger will be set to the correct log level, considering the environment and the settings and also set the log format.
The log should never contain sensitive data, only technical data. For example logging the entire data row is a no-go.
The log contains the information when the program got started and ended as INFO.
Also as INFO all important processing steps.
In DEBUG more fine grained information can be provided, up to approx 1 line per second. A good example would be a program that reads 100k records in 10 seconds avg, should write the processed row count every 10k rows.
All errors should be written to the log as ERROR. If possible it should include what to do instead of just the problem, e.g. a connection to the source database fails but why? Is the server down? Is the hostname not valid? Did the password change?

Security

Because the same code is executed in all environments, it cannot contain any environment specific information. Certainly no connection credentials. All credentials are stored in the environment's Azure Key Vault.

Whenever possible, the permissions of the code are part of the infrastructure deployment. If the code must access an Azure service, the function/container running the code has an RBAC assigned that allows it.
Stored credentials are only used for services outside of Azure, e.g. an onPrem database.
In the DEV environment we have the permissions to create the keys, in PROD certainly not.
Also consider the case that the credentials can change while the program is running. Hence reading the data from the Key Vault at program start and then the program runs for a month, is not ideal. The Key Vault should be read once a day. One pattern is that the Key Vault is read only at program start and if it is no longer valid, the connection fails, thus the program fails. It will be automatically restarted and hence read the credentials again.
Credentials are never spilled into logs for obvious reasons.

Unit/Integration tests

The developed code should include test automation. Ideally would be a mock source and a mock target, so we can test without the need of any external resources. There are solutions for that but these are all rather time intensive. Hence the compromise is that we read the data from a development source system and write into a DummyProducer. This does not guarantee that the source data covers all cases, especially delta handling will be a challenge, but at least it allows to compare what has been read with what has been produced.

If for some sources a mock source can be created in a reasonable amount of time, this is favored.

The approach for switching between mock- and real producers happens via the software pattern of adapters.

But integration tests should be implemented also, because this will be a source of rather more oversights. Missing or wrong credentials, incorrect roles and permissions, naming errors when reading the key vault etc.

Test the source to target data movement, focus is on connectivity and performance.
Test the transformation outcome with local data.

Page tree