
Couture Collective
Case Study
About
This portfolio piece demonstrates a solution for a common retail analytics challenge: unifying disparate sales data. In a hypothetical case study, I used PySpark to process and harmonize sales data from a fictional retailer 'Couture Collective' e-commerce and physical store channels. My solution shows how to build a single source of truth to enable a cohesive cross-channel strategy.
The Challenge
The case study presented a scenario where a fashion retailer's sales data was siloed between its online and physical store operations. The two teams used separate systems, resulting in fragmented data with different schemas that prevented a unified view of product performance. This led to hypothetical business problems like poor inventory allocation, overstocking popular online items in stores, and inefficient marketing spend. The core challenge was to design a data processing solution to break down these data silos and create a single, reliable source of truth.
Solution
To solve the case, I developed the core data processing logic in PySpark to ingest, harmonize, and aggregate the two disparate sales datasets. My application features a key transformation that pivots the data, enabling a direct, side-by-side comparison of KPIs across both sales channels. This approach effectively solves the scenario by identifying "Channel Stars" for strategic business focus.

Schema Harmonization
This code demonstrates the foundational and most critical step in data integration: schema harmonization. Before any analysis can be performed, disparate data sources must be structurally aligned. The provided PySpark script ingests two distinct sales datasets—one from in-store operations and another from online channels—each with unique column names and formats.
​
Using a series of targeted transformations like withColumnRenamed and cast, I meticulously standardize the schemas. This process aligns column names and ensures data types, like currency, are consistent. I also enrich the data by adding a channel column to tag the origin of each record, a crucial step for downstream segmentation and analysis. The result is two perfectly aligned, clean DataFrames, ready to be unified into a single, reliable dataset for business intelligence.

Aggregation and Pivoting
After harmonizing the schemas, this code transforms the raw, transactional data into an analytics-ready format. It first unifies the separate sales DataFrames using unionByName and enriches the dataset with product details via a join. The script then calculates key performance indicators (KPIs) like total revenue and units sold for each product within each sales channel.
​
The most crucial step is the pivot operation, which reshapes the data from a "tall" format into a "wide" one. This places in-store and online metrics side-by-side in a single row, making direct performance comparison intuitive and powering the final business analysis.

Business Logic for Data Segmentation
This final code snippet demonstrates the crucial 'last mile' of analytics: translating a clean dataset into actionable business intelligence. The process begins by calculating the performance delta between channels to create a simple comparative metric.
​
The core of this script is the implementation of a specific business rule using PySpark's filter transformation to segment the data. The logic identifies "Channel Stars" by isolating products that are either exclusive to one channel or significantly outperform the other. The result is two distinct, prioritized lists that directly empower business teams to make strategic decisions about marketing, inventory, and sales focus.