GDPR Compliance in MS Fabric

GDPR in Fabric

Case Study

About

ConnectSphere, a professional social network. It was designed to handle GDPR "Right to be Forgotten" requests in a dynamic environment with daily incremental data. The solution combines an efficient ingestion engine with a surgical data erasure process to ensure both performance and compliance.

The Challenge

ConnectSphere faced the high-risk challenge of fulfilling GDPR erasure requests at scale. The core problem was performing these erasures not on a static database, but on a live platform with a constant stream of new user activity. This required a hybrid architecture that could efficiently process daily data loads without compromising the erasure's completeness. The logic needed to track a user's fragmented data across five sources, including profiles, articles, and direct messages. This involved precise deletions and complex anonymization to preserve other users' histories. The entire process also required a fully auditable trail to prove compliance.

Solution

I architected a two-pipeline solution in Microsoft Fabric to solve ConnectSphere's challenge, decoupling data ingestion from the erasure logic. An incremental ingestion pipeline uses PySpark MERGE to keep master Delta Tables continuously updated from daily data feeds. The separate, on-demand erasure pipeline then executes surgical SQL commands against these complete tables, resulting in a scalable and compliant system that automates a critical business process.

Unified Data Foundation

This screenshot showcases the centralized data foundation of the project within the Microsoft Fabric Lakehouse. On the left, the Files section serves as the landing zone for raw, incremental data, which is logically organized into date-partitioned folders under daily_loads. This structured approach ensures a clear separation between raw and processed data. On the right, the Tables section contains the curated master Delta Tables, which represent the single source of truth for the application. This architecture is fundamental, as it leverages the power of Delta Lake to provide ACID transactions and schema enforcement directly on the data lake, creating a governed and reliable foundation upon which the entire GDPR compliance solution is built.

Incremental Ingestion

This image displays the automated orchestration flow of the daily incremental ingestion pipeline, built in Microsoft Fabric. The design demonstrates a modern, data-driven engineering pattern. The process begins with a Get Metadata activity that dynamically discovers new daily data folders, eliminating the need for hardcoded dates. This list of folders is then passed to a ForEach loop, which iterates through each day's data sequentially. Inside the loop, a Notebook activity is called, passing the folder name as a parameter. This executes the core PySpark MERGE logic for each day's data. This automated, looping design is highly efficient and scalable, ensuring only new data is processed while minimizing compute costs.

Data Integration with MERGE

This code snippet highlights the core of the incremental ingestion logic: the powerful and atomic MERGE INTO command executed with PySpark SQL. This single statement is responsible for intelligently integrating the daily delta data into the master Delta Tables. It performs a conditional "upsert" operation, which compares records from the source (daily file) to the target (master table) based on a primary key. It efficiently updates existing records when a match is found (WHEN MATCHED) and seamlessly inserts new records when no match exists (WHEN NOT MATCHED). This approach is fundamental to maintaining data consistency and avoiding inefficient full table reloads, demonstrating a best-practice for enterprise-grade data warehousing on a lakehouse architecture.

On-Demand GDPR Erasure

This screenshot displays the clean and maintainable design of the on-demand GDPR erasure pipeline. The architecture is intentionally simple, featuring a single, parameterized Notebook activity. This design effectively decouples the complex erasure logic, which is fully encapsulated and managed within the PySpark notebook, from the orchestration layer. By parameterizing the notebook with the user_id_to_delete, the pipeline becomes a reusable and flexible tool. It can be triggered on a schedule, manually, or by an external application (e.g., via an API call) whenever a new erasure request is submitted. This demonstrates a robust architectural pattern for executing complex, on-demand business processes in a modern data platform.

Immutable Audit Trail

This image provides evidence of the final, critical output of the GDPR process: the immutable audit log. The screenshot shows a query against the audit_log Delta Table, displaying a clear and permanent record of the erasure operation. Each row captures the essential details required for compliance, including the specific user ID processed, the actions taken (DELETE, ANONYMIZE), the tables affected, the number of records impacted, and a precise timestamp. This audit trail is the definitive proof that the "Right to be Forgotten" request was fulfilled systematically and completely. It transforms the project from a simple data manipulation task into a trustworthy, enterprise-grade compliance solution ready for regulatory scrutiny.