SoundWave

Case Study

About

This project engineers a custom data product to solve a critical content discovery problem for a digital media platform. It involves designing and implementing a dynamic "Trending Now" algorithm from first principles using PySpark. The goal was to move beyond simple metrics to create a system that accurately reflects current user engagement.

The Challenge

The platform's existing "Top Podcasts" chart created a poor user and creator experience. Based on simple all-time listen counts, the ranking system was static, creating a feedback loop where popular shows remained at the top indefinitely. This stale system made it nearly impossible for new and emerging creators to get noticed. This led to poor user engagement and creator dissatisfaction, requiring a new engine that could capture not just popularity, but true momentum.

Solution

The solution is a scalable PySpark job that uses advanced window functions to engineer features for recent popularity and week-over-week growth, applying a weighted score. This data product delivers a dynamic "Trending Now" chart that surfaces emerging content and boosts user engagement.

Defining Schemas

This code establishes a reliable data ingestion process by explicitly defining the structure of the source data before reading it. The purpose of this approach is to prevent common errors and performance issues associated with schema inference, where Spark has to guess the data types. The core logic uses StructType and StructField to create a blueprint for each CSV file, specifying each column's name and intended data type. This predefined schema is then passed to the spark.read.csvreader, ensuring that data is loaded correctly and efficiently into three separate, type-safe DataFrames.

Daily Aggregation

This query block performs the foundational data preparation for the trending algorithm, transforming raw event logs into a clean, daily time-series. The process begins by enriching the data using a sequence of join operations, which combine the listening_df with episode and podcast metadata to create a single, comprehensive view. The core logic then pivots from individual events to daily summaries. First, the withColumn and to_date functions standardize the time dimension by creating a listen_date column. Finally, a groupBy operation on this date and podcast identifiers, paired with count(), aggregates the data to calculate the total daily_listens for each podcast.

Engineering Trending Features

This code translates raw daily data into intelligent features that power the trending algorithm. It defines and calculates two key metrics that measure a podcast's performance from different angles to create a holistic view of what is truly "trending.

Performance Tracking: It first defines a Window partitioned by podcast_id, creating an analytical frame to independently track the performance history of each podcast over time.
Popularity Score: It calculates a 7-day rolling average of listens. This metric provides a stable measure of a podcast's recent, consistent performance, smoothing out daily spikes to understand its core listenership.
Momentum Score: It calculates the week-over-week percentage growth. This metric is designed to identify podcasts that are accelerating in popularity, giving a higher score to those with rapid, recent growth.
Emerging Podcast Boost: The logic includes a crucial when clause that handles cases where a podcast had zero listens the previous week, smartly assigning a high growth score to reward new activity and help surface emerging content.

Final Scoring & Chart Generation

This final block of code takes the engineered features and translates them into the user-facing "Trending Now" charts. It applies the scoring formula and then generates both an overall and a per-category ranking based on the most current data available.

Weighted Score Calculation: It first calculates the final trending_score by applying a weighted formula that combines the popularity and momentum scores. Per the business logic, momentum is weighted more heavily (60%) than raw popularity (40%) to ensure that new and accelerating podcasts are prominently featured.
Current Data Filtering: To ensure the chart reflects what is "trending now," the code identifies the single most recent date in the dataset. It then filters the entire dataset to only include scores from that specific day for the final ranking process.
Overall "Trending Now" Chart: A straightforward orderBy() in descending order is applied to the trending_score. This ranks all podcasts against each other to produce the primary, site-wide "Top 20 Trending Now" list.
Per-Category Chart: To provide more granular discovery, a Window function is used to partitionBy("category"). The rank() function is then applied over this window to rank podcasts within their own genre, allowing the creation of "Top 5" lists for categories like Technology or True Crime.