A regression challenge predicting accident risk scores based on driver behavior, vehicle characteristics, weather conditions, and road features to improve road safety analytics.
This Playground Series competition focused on predicting road accident risk, a crucial application for insurance companies, autonomous vehicle systems, and traffic safety departments. The dataset included diverse features such as driver demographics, driving patterns, environmental conditions, and historical accident data. The goal was to build a robust regression model that could accurately predict risk scores across various scenarios.
Weather conditions and time-of-day interactions proved to be highly predictive. Creating features that captured the combined effect of poor weather during rush hours significantly improved model performance. Additionally, historical accident frequency per driver was the single most important feature, emphasizing the value of longitudinal data in risk assessment.
This competition reinforced my understanding of regression tasks with imbalanced target distributions. Managing the long tail of high-risk cases required careful validation strategy and custom loss functions. The experience also highlighted the importance of ensemble diversity - combining models with different strengths (CatBoost's categorical handling, XGBoost's regularization, LightGBM's speed) produced more robust predictions than any single model.
From a DevOps perspective, I also experimented with containerizing the training pipeline using Docker, making it reproducible and easier to deploy in production environments - a critical skill for taking ML models from notebooks to real-world systems.