Playground Series S5E4

Predict Podcast Listening Time

A regression challenge predicting how long users will listen to podcast episodes based on user behavior, content features, temporal patterns, and engagement history.

Competition Rank

536 / 3,310

Percentile

Top 16%

Evaluation Metric

RMSE

Data Type

Time Series

Problem Overview

This competition involved predicting podcast listening duration - a valuable metric for content recommendation systems and platform optimization. The dataset included user listening history, podcast metadata (genre, duration, release date), temporal features (time of day, day of week), and user engagement patterns. The challenge required handling time series data and understanding user behavior patterns.

Technical Approach

Temporal Feature Engineering: Created features capturing listening patterns by time of day, day of week, and seasonality effects
User Behavior Modeling: Built features representing user listening habits, average completion rates, and preferred genres
Time Series Processing: Properly handled temporal ordering and created lag features for sequential patterns
Advanced Stacking: Implemented multi-level stacking with diverse base models and meta-learning
K-Fold Cross-Validation: Used time-aware splitting to prevent data leakage from future information
Meta-Modeling: Trained meta-models on out-of-fold predictions to capture model interactions

Key Insight

Time series features proved crucial - users have strong temporal patterns in their podcast consumption. Features like "average listening time in the past hour" and "same time yesterday" significantly improved predictions. Additionally, podcast length interacted non-linearly with listening time - short podcasts had higher completion rates while longer ones showed more variation.

Lesson: Submission File Validation

Unfortunately, I made a submission file formatting error that cost me a top 10% ranking. The actual model performance was much stronger than the final rank suggests. This was a painful but valuable lesson about the importance of submission validation and double-checking output formats. I now always implement automated validation checks for competition submissions and production pipelines.

Technology Stack

Python Pandas NumPy Scikit-learn XGBoost LightGBM CatBoost Optuna Feature-engine

Lessons Learned

Beyond the submission error lesson, this competition deepened my understanding of time series machine learning. Unlike traditional tabular data, time series requires careful consideration of temporal dependencies, proper train/validation splits, and feature engineering that respects time ordering. These principles apply broadly to forecasting problems in production systems.

The advanced stacking and meta-modeling techniques I developed here have become part of my standard toolkit. As a Data and DevOps Engineer, I now build automated ML pipelines that incorporate these techniques with proper validation and error checking at every step.