Almina extract data

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.

<p><br /></p><h1>Almina extract data</h1><h3>Git repo: <a href="https://gitlab.syensqo.com/solvay-digital-mining/smartfloat/report-automation/etl_report_automation/-/blob/dev-almina-extract-data/almina_extract_data_gcp.py?ref_type=heads">GitLab</a></h3><h2>Overview</h2><p>This module provides <strong>backward compatibility</strong> for legacy tests and systems, while internally leveraging a new modular architecture for data processing. It acts as a bridge, maintaining the old API while using improved, maintainable, and testable components.</p><p><strong>Key Features:</strong></p><ul><li>Processes Excel files from Google Cloud Storage (GCS)</li><li>Validates and cleans tabular data</li><li>Uploads processed data to Google BigQuery</li><li>Structured logging and robust error handling</li><li>Designed for both batch and event-driven (Cloud Function) execution</li></ul><hr /><h2>Architecture</h2><ul><li><strong>Wrappers:</strong> Functions maintain the legacy API but delegate logic to new modular components.</li><li><strong>Modular Imports:</strong> Uses new architecture from <code class="undefined">src.config</code>, <code class="undefined">src.utils</code>, <code class="undefined">src.models</code>, <code class="undefined">src.services</code>, and <code class="undefined">src.processors</code>.</li><li><strong>Cloud Integration:</strong> Handles authentication, storage, and BigQuery operations using Google Cloud libraries.</li><li><strong>Logging:</strong> Structured logging for both local and cloud environments.</li><li><strong>Data Validation:</strong> Ensures data quality before upload.</li></ul><hr /><h2>Configuration</h2><p>The module reads configuration from environment variables for backward compatibility:</p><table class="confluenceTable"><thead class=""><tr class=""><th class="confluenceTh">Variable Name</th><th class="confluenceTh">Description</th><th class="confluenceTh">Default Value</th></tr></thead><tbody class=""><tr class=""><td class="confluenceTd"><code class="undefined">GCP_PROJECT_ID</code></td><td class="confluenceTd">Google Cloud Project ID</td><td class="confluenceTd"><code class="undefined">'your-project-id'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">GCS_BUCKET_NAME</code></td><td class="confluenceTd">GCS bucket containing Excel files</td><td class="confluenceTd"><code class="undefined">'your-bucket-name'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">BQ_DATASET_ID</code></td><td class="confluenceTd">BigQuery dataset ID</td><td class="confluenceTd"><code class="undefined">'your-dataset-id'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">BQ_TABLE_ID</code></td><td class="confluenceTd">BigQuery table ID</td><td class="confluenceTd"><code class="undefined">'almina_data'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">CUSTOMER_ID</code></td><td class="confluenceTd">Customer identifier</td><td class="confluenceTd"><code class="undefined">'almina_1'</code></td></tr><tr class=""><td class="confluenceTd"><code class="undefined">GOOGLE_APPLICATION_CREDENTIALS</code></td><td class="confluenceTd">Path to GCP credentials file</td><td class="confluenceTd"><code class="undefined">None</code></td></tr></tbody></table><hr /><h2>Main Functions</h2><h3>1. <strong>setup_logging()</strong></h3><p>Configures structured logging for both local and GCP environments.</p><h3>2. <strong>retry_with_backoff(max_retries=3, backoff_factor=2.0)</strong></h3><p>Decorator for retrying functions with exponential backoff on failure.</p><h3>3. <strong>sanitize_column_name(original_name)</strong></h3><p>Converts column headers to snake_case and ensures BigQuery compatibility.</p><h3>4. <strong>get_bq_schema()</strong></h3><p>Returns the explicit BigQuery schema as a list of <code class="undefined">SchemaField</code> objects.</p><h3>5. <strong>map_columns_to_output_schema(df)</strong></h3><p>Maps incoming DataFrame columns to canonical schema names, tolerating minor variations.</p><h3>6. <strong>validate_data_quality(df, file_name)</strong></h3><p>Validates the DataFrame for structure, required columns, data types, and value ranges. Returns a validation report.</p><h3>7. <strong>process_xlsx_file(file_content, file_name, upload_time_to_bucket=None)</strong></h3><p>Processes an Excel file:</p><ul><li>Reads and cleans data from the 'Resumo' sheet</li><li>Maps and sanitizes columns</li><li>Validates data quality</li><li>Adds metadata columns</li><li>Returns a cleaned DataFrame ready for upload</li></ul><h3>8. <strong>authenticate_gcp()</strong></h3><p>Authenticates with GCP using service account credentials and returns storage and BigQuery clients.</p><h3>9. <strong>ensure_table_exists(bq_client)</strong></h3><p>Ensures the BigQuery table exists with the correct schema, partitioning, and clustering.</p><h3>10. <strong>upload_to_bigquery(df_processed, bigquery_client)</strong></h3><p>Uploads the processed DataFrame to BigQuery, ensuring schema compliance.</p><h3>11. <strong>process_bucket_files()</strong></h3><p>Processes all <code class="undefined">.xlsx</code> files in the configured GCS bucket and uploads them to BigQuery.</p><h3>12. <strong>process_xlsx_cloud_function(event, context)</strong></h3><p>Entry point for GCP Cloud Function, triggered by file uploads to the bucket.</p><h3>13. <strong>main()</strong></h3><p>Main entry point for batch processing. Validates environment, processes all files, and handles errors.</p><hr /><h2>Usage</h2><ul><li><strong>Batch Mode:</strong> Run as a script to process all files in the bucket.</li><li><strong>Cloud Function:</strong> Deploy <code class="undefined">process_xlsx_cloud_function</code> for real-time processing on file upload.</li></ul><hr /><h2>Error Handling &amp; Logging</h2><ul><li>All critical operations are wrapped with structured logging.</li><li>Errors are logged with context and re-raised for visibility.</li><li>Data validation errors prevent upload to BigQuery.</li></ul><hr /><h2>Security &amp; Compliance</h2><ul><li>Uses Google Cloud IAM and service accounts for authentication.</li><li>Handles credentials securely via environment variables.</li><li><strong>Note:</strong> Always verify compliance with your organization’s data handling and cloud security policies.</li></ul><hr /><h2>Example: Cloud Function</h2><pre>Deploy almina-extract-data as a GCP Cloud Function with appropriate triggers and permissions.</pre><hr /><h2>Encouragement for Critical Thinking</h2><p>This documentation is AI-generated and should be reviewed for accuracy and completeness. Always validate the module’s behavior in your environment and consult your team for best practices.</p><p><br /></p>