FinanceCo

Case Study

About

This project involved the design and implementation of a modern, scalable data warehouse solution on the Azure cloud, leveraging Azure Databricks and Delta Lake. The objective was to replace a legacy system with a high-performance lakehouse architecture capable of supporting critical financial reporting, business intelligence, and advanced analytics while ensuring robust governance and security.

The Challenge

The existing data infrastructure faced several critical limitations that hindered business agility and created operational risks:

Inefficient Data Ingestion: The process for handling initial historical data loads and ongoing incremental data feeds was manual and inefficient, leading to data duplication and processing delays.
Lack of Historical Insight: The legacy system did not effectively track changes to key business entities, such as customer information, making it impossible to perform historical analysis.
Poor Query Performance: Analytical queries against large datasets were unacceptably slow, delaying critical financial reporting and business intelligence.
Fragmented Governance: There was no centralized system for managing data access, posing a significant security risk for sensitive financial and personal data (PII).

Solution

An end-to-end data pipeline was engineered on Azure Databricks, implementing a Medallion Architecture (Bronze, Silver, Gold layers) to progressively refine and structure the data.

Bronze Layer - Unified Data Ingestion: A single, parameterized Databricks notebook was developed to handle both initial and incremental data. Using Databricks Auto Loader, the pipeline efficiently processes only new files from incremental feeds, while a MERGE operation intelligently combines this with the one-time historical data load. This creates unified, deduplicated source tables (unified_customers, unified_transactions) in the Bronze layer.
Silver Layer - Dimensional Modeling & Cleansing: The core of the data warehouse was built in the Silver layer, where raw data is transformed into a clean, queryable dimensional model.
- Dimension Table (dim_customers): (See Screenshot 2: SCD Type 2 Implementation below) A Slowly Changing Dimension (SCD) Type 2 table was implemented for customers. This sophisticated technique preserves a full history of all changes to customer records, enabling point-in-time analysis.
- Fact Table (fct_transactions): (See Screenshot 3: Fact Table Creation below) A partitioned fact table was created to store transactional events. The table is physically partitioned by date, a key optimization that dramatically accelerates time-based queries.
Gold Layer - Business-Ready Aggregates: (See Screenshot 4: Gold Layer Aggregation below) For direct use by analysts and BI tools like Power BI, the Gold layer contains aggregated summary tables. A customer_monthly_summary table was created by joining the Silver layer tables to provide high-level business insights.
Performance Optimization: (See Screenshot 5: Performance Optimization Techniques below) Advanced performance tuning techniques were applied to ensure millisecond-level query responses:
- OPTIMIZE and Z-ORDER were used on the large fact table to compact files and co-locate related data.
- The smaller, frequently-joined dimension table was loaded into the CACHE for in-memory access.
- Broadcast join strategies were demonstrated to accelerate joins between large and small tables.
Centralized Governance with Unity Catalog: (See Screenshot 6: Unity Catalog for Governance below) A robust security framework was established using Databricks Unity Catalog.
- Column-Level Security: Secure views were created to mask sensitive PII like customer email addresses for general analyst groups.
- Row-Level Security: Views with WHERE clauses were implemented to filter data, ensuring user groups could only access records relevant to their region (e.g., USA-only data).

Impact

The modernized data platform delivered significant improvements across the board:

Enhanced Performance: Query execution times for analytical reports were drastically reduced, enabling faster decision-making.
Increased Data Trust: The automated and idempotent data processing pipeline ensures high data quality and consistency.
Improved Security and Compliance: Fine-grained access controls implemented via Unity Catalog protect sensitive data and provide a clear audit trail for compliance.
Scalability for the Future: The Delta Lake-based architecture can seamlessly scale to handle exponential growth in data volume.
Empowered Analytics: The well-structured Gold layer provides analysts with direct access to clean, aggregated, and reliable data, fostering a data-driven culture.

Bronze Layer Ingestion

This screenshot highlights the robust Bronze layer ingestion process. Using Databricks Auto Loader, the spark.readStream continuously monitors new incoming files, automatically inferring schema and handling schema evolution. The foreachBatch pattern, combined with a MERGE statement, is crucial for handling both initial historical data and subsequent incremental updates. This ensures that new records are inserted and existing records are efficiently updated or deduplicated based on primary keys, creating clean, unified source tables ready for the next stages of the data pipeline. This setup automates data ingestion, eliminating manual intervention and ensuring data freshness.

SCD Type 2 Implementation

Here, we visualize the core of the Silver layer's dim_customers table, showcasing the implementation of Slowly Changing Dimension (SCD) Type 2. The first merge operation intelligently identifies and "expires" old versions of customer records by setting is_current to false and populating end_date when key attributes change. Subsequently, new or updated versions of these records are inserted with is_current set to true and a new start_date. This critical process ensures a complete historical lineage of customer data, enabling precise point-in-time analysis, which is vital for accurate financial reporting and customer behavior analysis.

Parameterization

This screenshot highlights a key design principle for creating reusable and flexible data pipelines: parameterization using Databricks widgets. By defining inputs like entity (for 'customers' or 'transactions') in 01_Bronze_Layer_Unification.py or catalog_name in 00_Setup_and_Configuration.py, notebooks can be executed with different configurations without modifying the underlying code. This approach promotes modularity and maintainability, allowing the same pipeline logic to be applied across various datasets or environments, reducing development time and potential errors. It's crucial for building scalable and adaptable data solutions.

Fact Table Creation

This screenshot illustrates the creation of the fct_transactions table in the Silver layer, a cornerstone for analytical queries. Raw transaction data is transformed by casting Amount to DoubleType and TransactionDate to TimestampTypefor numerical accuracy and temporal integrity. Crucially, a transaction_date_partition column is derived for effective data partitioning. The data is then written to a Delta table, leveraging partitionBy for physical partitioning on this date column. This design significantly boosts query performance, especially for time-series analysis, as Spark can prune irrelevant partitions, drastically reducing the data scanned.

Gold Layer Aggregation

This section demonstrates the creation of the customer_monthly_summary table in the Gold layer, transforming detailed Silver layer data into readily consumable business insights. We first join dim_customers (filtered for current records) and fct_transactions on CustomerID. Then, data is grouped by CustomerID and transaction_month (derived using trunc). Aggregation functions like sum and count are applied to calculate total_sales_amount and number_of_transactions. This aggregated table serves as the single source of truth for business intelligence dashboards and high-level reporting, simplifying data access for analysts and improving query speed.

Performance Optimization

This screenshot highlights key performance optimization strategies applied to the data warehouse. OPTIMIZE is used to compact small files within the Delta table, improving read efficiency. More importantly, ZORDER BY (CustomerID)physically co-locates related data based on CustomerID across different files, dramatically accelerating queries that filter or join on this column. Additionally, caching the frequently accessed dim_customers table in memory (CACHE TABLE) ensures that subsequent queries involving this dimension execute with near-instantaneous lookup times. These techniques are crucial for delivering the millisecond-level query responses required by analytical applications.

Unity Catalog for Governance

This screenshot exemplifies the robust data governance framework implemented using Databricks Unity Catalog. For column-level security, a view (dim_customers_analyst_view) is created to mask sensitive PII like email addresses, ensuring that general analyst groups only see obfuscated data. Building on this, row-level security is demonstrated by creating another view (dim_customers_usa_only_view) that filters records based on Country, granting usa_analystsaccess only to relevant regional data. By granting permissions exclusively on these secure views and revoking access to underlying tables, Unity Catalog enforces fine-grained access control, protecting sensitive financial and personal data.