Log in
Linked Applications
Loading…
Spaces
Glossaries
Create
Create
Hit enter to search
Announcement Banner
Help
Online Help
Keyboard Shortcuts
Feed Builder
What’s new
Available Gadgets
About Confluence
Log in
Cash Collection Wiki
Pages
…
Cash Collection Wiki Home
Digital Mining
InSyght
Almina extract data
Almina extract data
search
attachments
weblink
advanced
image-effects
image-attributes
Paragraph
Paragraph
Heading 1
Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
Preformatted
Quote
Bold
Italic
Underline
Colour picker
More colours
Formatting
Strikethrough
Subscript
Superscript
Monospace
Clear formatting
Bullet list
Numbered list
Task list
Outdent
Indent
Align left
Align center
Align right
Page layout
Link
Table
Insert
Insert content
Files and images
Link
Markup
Horizontal rule
Task list
Date
Symbol
Insert macro
User mention
Jira Issue/Filter
Info
Add Lucidchart Diagram
draw.io Diagram
Embed draw.io Diagram
draw.io Board Diagram
Status
Gallery
Table of Contents
Google Drive Live Link
Embedded Google Drive Folder
Embedded Google Drive File
Google Drive Search Result
Other macros
Page layout
No layout
Two column (simple)
Two column (simple, left sidebar)
Two column (simple, right sidebar)
Three column (simple)
Two column
Two column (left sidebar)
Two column (right sidebar)
Three column
Three column (left and right sidebars)
Undo
Redo
Find/Replace
Keyboard shortcuts help
You are not logged in. Any changes you make will be marked as
anonymous
. You may want to
Log In
if you already have an account.
This page is also being edited by
. Your changes will be merged with theirs when you save.
<p><br /></p><h1>Almina extract data</h1><h3>Git repo: <a href="https://gitlab.syensqo.com/solvay-digital-mining/smartfloat/report-automation/etl_report_automation/-/blob/dev-almina-extract-data/almina_extract_data_gcp.py?ref_type=heads">GitLab</a></h3><h2>Overview</h2><p>This module provides <strong>backward compatibility</strong> for legacy tests and systems, while internally leveraging a new modular architecture for data processing. It acts as a bridge, maintaining the old API while using improved, maintainable, and testable components.</p><p><strong>Key Features:</strong></p><ul><li>Processes Excel files from Google Cloud Storage (GCS)</li><li>Validates and cleans tabular data</li><li>Uploads processed data to Google BigQuery</li><li>Structured logging and robust error handling</li><li>Designed for both batch and event-driven (Cloud Function) execution</li></ul><hr /><h2>Architecture</h2><ul><li><strong>Wrappers:</strong> Functions maintain the legacy API but delegate logic to new modular components.</li><li><strong>Modular Imports:</strong> Uses new architecture from <code class="undefined">src.config</code>, <code class="undefined">src.utils</code>, <code class="undefined">src.models</code>, <code class="undefined">src.services</code>, and <code class="undefined">src.processors</code>.</li><li><strong>Cloud Integration:</strong> Handles authentication, storage, and BigQuery operations using Google Cloud libraries.</li><li><strong>Logging:</strong> Structured logging for both local and cloud environments.</li><li><strong>Data Validation:</strong> Ensures data quality before upload.</li></ul><hr /><h2>Configuration</h2><p>The module reads configuration from environment variables for backward compatibility:</p><table class="confluenceTable"><thead class=""><tr class=""><th class="confluenceTh">Variable Name</th><th class="confluenceTh">Description</th><th class="confluenceTh">Default Value</th></tr></thead><tbody class=""><tr class=""><td class="confluenceTd"><code class="undefined">GCP_PROJECT_ID</code></td><td class="confluenceTd">Google Cloud Project ID</td><td class="confluenceTd"><code class="undefined">'your-project-id'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">GCS_BUCKET_NAME</code></td><td class="confluenceTd">GCS bucket containing Excel files</td><td class="confluenceTd"><code class="undefined">'your-bucket-name'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">BQ_DATASET_ID</code></td><td class="confluenceTd">BigQuery dataset ID</td><td class="confluenceTd"><code class="undefined">'your-dataset-id'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">BQ_TABLE_ID</code></td><td class="confluenceTd">BigQuery table ID</td><td class="confluenceTd"><code class="undefined">'almina_data'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">CUSTOMER_ID</code></td><td class="confluenceTd">Customer identifier</td><td class="confluenceTd"><code class="undefined">'almina_1'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">GOOGLE_APPLICATION_CREDENTIALS</code></td><td class="confluenceTd">Path to GCP credentials file</td><td class="confluenceTd"><code class="undefined">None</code></td></tr></tbody></table><hr /><h2>Main Functions</h2><h3>1. <strong>setup_logging()</strong></h3><p>Configures structured logging for both local and GCP environments.</p><h3>2. <strong>retry_with_backoff(max_retries=3, backoff_factor=2.0)</strong></h3><p>Decorator for retrying functions with exponential backoff on failure.</p><h3>3. <strong>sanitize_column_name(original_name)</strong></h3><p>Converts column headers to snake_case and ensures BigQuery compatibility.</p><h3>4. <strong>get_bq_schema()</strong></h3><p>Returns the explicit BigQuery schema as a list of <code class="undefined">SchemaField</code> objects.</p><h3>5. <strong>map_columns_to_output_schema(df)</strong></h3><p>Maps incoming DataFrame columns to canonical schema names, tolerating minor variations.</p><h3>6. <strong>validate_data_quality(df, file_name)</strong></h3><p>Validates the DataFrame for structure, required columns, data types, and value ranges. Returns a validation report.</p><h3>7. <strong>process_xlsx_file(file_content, file_name, upload_time_to_bucket=None)</strong></h3><p>Processes an Excel file:</p><ul><li>Reads and cleans data from the 'Resumo' sheet</li><li>Maps and sanitizes columns</li><li>Validates data quality</li><li>Adds metadata columns</li><li>Returns a cleaned DataFrame ready for upload</li></ul><h3>8. <strong>authenticate_gcp()</strong></h3><p>Authenticates with GCP using service account credentials and returns storage and BigQuery clients.</p><h3>9. <strong>ensure_table_exists(bq_client)</strong></h3><p>Ensures the BigQuery table exists with the correct schema, partitioning, and clustering.</p><h3>10. <strong>upload_to_bigquery(df_processed, bigquery_client)</strong></h3><p>Uploads the processed DataFrame to BigQuery, ensuring schema compliance.</p><h3>11. <strong>process_bucket_files()</strong></h3><p>Processes all <code class="undefined">.xlsx</code> files in the configured GCS bucket and uploads them to BigQuery.</p><h3>12. <strong>process_xlsx_cloud_function(event, context)</strong></h3><p>Entry point for GCP Cloud Function, triggered by file uploads to the bucket.</p><h3>13. <strong>main()</strong></h3><p>Main entry point for batch processing. Validates environment, processes all files, and handles errors.</p><hr /><h2>Usage</h2><ul><li><strong>Batch Mode:</strong> Run as a script to process all files in the bucket.</li><li><strong>Cloud Function:</strong> Deploy <code class="undefined">process_xlsx_cloud_function</code> for real-time processing on file upload.</li></ul><hr /><h2>Error Handling & Logging</h2><ul><li>All critical operations are wrapped with structured logging.</li><li>Errors are logged with context and re-raised for visibility.</li><li>Data validation errors prevent upload to BigQuery.</li></ul><hr /><h2>Security & Compliance</h2><ul><li>Uses Google Cloud IAM and service accounts for authentication.</li><li>Handles credentials securely via environment variables.</li><li><strong>Note:</strong> Always verify compliance with your organization’s data handling and cloud security policies.</li></ul><hr /><h2>Example: Cloud Function</h2><pre>Deploy almina-extract-data as a GCP Cloud Function with appropriate triggers and permissions.</pre><hr /><h2>Encouragement for Critical Thinking</h2><p>This documentation is AI-generated and should be reviewed for accuracy and completeness. Always validate the module’s behavior in your environment and consult your team for best practices.</p><p><br /></p>
Edit
Preview
Save
Close
{"serverDuration": 374, "requestCorrelationId": "9ab6206136f1ea46"}