GitLab: GitLab


ALMINA Plant Data Pipeline – Technical Documentation


Overview

This pipeline automates the extraction, transformation, validation, and loading (ETL) of daily plant data from ALMINA’s Excel workbooks into a BigQuery summary table. It is designed for both batch and event-driven (GCP Cloud Function) operation, with robust error handling, logging, and schema management.


Architecture & Workflow

High-level steps:

  1. Ingestion: Excel files are read from Google Cloud Storage (GCS), either in batch or via Cloud Function trigger.
  2. Mapping: The MappingTest worksheet defines how to extract and map data from daily sheets.
  3. Extraction & Transformation: Data is extracted per shift, cleaned, and converted to the correct types.
  4. Validation: Data quality is checked, and metadata is attached.
  5. Loading: The processed DataFrame is uploaded to BigQuery, with schema alignment and idempotency.
  6. Logging: All steps are logged with structured, context-rich messages.

Configuration


Module Reference

Main Script (main.py)


SummaryDataProcessor


Field Mapping & Mapping Rules


Data Transformation


BigQuery Schema & Utilities


BigQuery Service


GCP Storage Service


Logging Utilities


Settings & Configuration


Sanitizers


Retry Utilities


Data Flow Summary

  1. File Ingestion:
  2. Mapping Extraction:
  3. Data Extraction:
  4. Transformation:
  5. Validation & Metadata:
  6. BigQuery Upload:
  7. Logging:

Error Handling & Logging


Extensibility & Best Practices


Security & Compliance


Appendix: Example Usage

Local Batch Processing

export GCP_PROJECT_ID=your-project
export GCS_BUCKET_NAME=your-bucket
export BQ_DATASET_ID=your-dataset
python main.py

Cloud Function Deployment