Rob Kras

About

Hello! I'm Rob, a Computer Scientist passionate about the intersection of mathematics, algorithms, and programming.
I specialize in Data Science and Artificial Intelligence, with a proven track record in machine learning competitions.

Education

MSc Computer Science
Leiden University (2024-2025)
Specialization: Data Science & AI
Thesis: Cross-Modal Sound Symbolism in Vision-Language Models

BSc Computer Science
Vrije Universiteit Amsterdam (2020-2023)
Minor: Data Science

Technical Stack

Languages: Python, C/C++, Scala, JavaScript
ML/AI: TensorFlow, PyTorch, Scikit-learn, Pandas, NumPy
Tools: Git, SQL, Jupyter, HuggingFace, Docker
Specializations: Machine Learning, Deep Learning, NLP, Computer Vision

Research

Multimodal Sound Symbolism

Grade: 8/10

GitHub Repository

My master's graduation research project investigating cross-modal sound symbolism in vision-language models. This thesis explored how modern AI systems understand the relationship between sounds and visual elements, examining the phenomenon where certain sounds are consistently associated with specific visual properties across cultures. The research combined computational linguistics, computer vision, and cognitive science to analyze how multimodal AI models process and represent these sound-meaning associations.

Machine Learning Portfolio

A curated collection of my competitive machine learning projects, showcasing technical growth and problem-solving evolution.

Titanic Survival Prediction

Rank 2,331 / 15,346

Version 1 Version 2

My first experience with Kaggle, albeit a practice problem. Learned feature engineering fundamentals, handling missing data, and the power of XGBoost over basic logistic regression. Overall this challenge sparked my interest in machine learning. Having a place to apply it in a competitive yet educational environment was invaluable. In the second iteration, I applied a weight scaling technique to boost likely-to-be-saved passengers, improving my score significantly. Besides this, removing features that are redundant when the ship sinks or are highly correlated to another feature helped as well.

Spaceship Titanic

Rank 613 / 1,816

Version 1 Version 2

This was my second experience with Kaggle, building on the practice problem of the Titanic. In the first iteration I tried to create some new features myself, yet this didn't give me any favorable results. After looking through the forums, I learned that removing related features as well as grid finetuning resulted in an increasingly higher performance. Nevertheless, after looking through this problem again, it became apparent that the model I built was overfitting thanks to adding too many features. To counter this, I removed related features (or changed them into 1 feature), and reran the experiment. From this, I achieved a higher score, and learned a valuable lesson about overfitting and feature engineering.

House Prices Prediction

Rank 37 / 3,935 (Top 1%)

Version 1 Version 2 Version 3

The final practice problem I tackled before moving onto playground competitions! Here I had a major breakthrough in regression techniques. Discovered and learned from data leakage issues, balancing competitive performance with real-world applicability. This is the first problem where I tried to incorporate domain knowledge into my feature engineering, which paid off well. By using SHAP, I was able to identify the most important features, and focus on those. At the end, however, I made discovery that this competition had a data leakage issue, which I exploited to achieve an almost perfect score.

Rainfall Prediction

Rank 5 / 2,529 (Top 0.2%)

Version 1 Version 2

In this competition I tried experimenting with a variety of techniques, including feature engineering, KFolds, and ensembling best public scores. I learned that simpler algorithms (KNN) can outperform complex ensembles when properly optimized. Painful lesson about submission file formats. Even worse, I also discovered that Kaggle wants you to select your best submission as the final one, rather than an ensemble of your best submissions. This sadly cost me a very high position, but taught me a valuable lesson about competition logistics. My highest achieved rank during this competition was 5th.

Top 3 Fertilizer-types

Rank 732 / 2,650

Version 1

This competition was about predicting the top 3 fertilizer types for a given crop and soil condition. I tried a variety of approaches, including advanced feature engineering, ensemble methods, and stacking techniques. Ultimately, I ended up with an approach where I create an ensemble of CATBoost, and XGBoost models, of which I then average out the results.

Podcast Listening Behavior

Rank 536 / 3,310

Version 1

In this competition I tried using a variety of advanced approaches, including meta modelling, KFolds, and finetuning on ensemble methods and advanced stacking techniques. This competition was about exploring user behavior with temporal features and listening patterns. I learned a lot about time series data and how to properly preprocess it for machine learning models. Unfortunately, I made a submission file error which cost me a top 10% ranking, but I still learned a lot from the experience. The rank included on Kaggle is therefore not representative of my actual performance.

Personality Type Prediction

Rank 1,379 / 4,067

Version 1

This was a horrid competition about predicting personality types based on limited data. Since the data was too limited to train a proper model, I had to resort to advanced oversampling techniques like SMOTE to generate synthetic data points. I also tried stacking multiple models to improve performance, and used Bayesian optimization to finetune hyperparameters. Sadly, the competition was not very well designed, and I learned that sometimes competitions are not worth the effort. Similar to the previous problem, I forgot to select my best submission as the final one, which cost me a high ranking. Nevertheless, I learned a lot about handling imbalanced datasets and advanced ensemble techniques.

Bank Customer Analysis

Rank 576 / 3,367

Version 1

This competition was incredibly fun! It involved predicting whether a client will subscribe to a bank term deposit. By creating advanced features, applying stratified KFolds, using optuna, and creating an ensemble of models (lightGBM, XGBoost, CATBoost), I was able to achieve a high ranking, even after only participating in the first 7 days! Pretty cool.

Credit Card Fraud Detector

No rank, open dataset

Version 1

In this instance I decided to tackle a problem where the data is highly imbalanced and is related to real-world applications. Considering how cases of fraudulent activity are rare, actually training a model to find fraud poses a challenge for traditional machine learning algorithms. By using advanced oversampling techniques like SMOTE I was able to turn the imbalanced dataset into a balanced one, although I made the mistake of applying SMOTE before setting KFolds. I learned later on that it is always important to apply oversampling techniques within the training folds to avoid data leakage. This experience taught me a valuable lesson about using SMOTE correctly and to stay on my toes!

Road Accident Risk Prediction

Rank 313 / 4,082 (Top 8%)

Version 1

This Playground Series competition focused on predicting accident risk based on various features. I employed advanced feature engineering techniques and ensemble methods combining CATBoost, XGBoost, and LightGBM models. Through careful hyperparameter tuning with Optuna and stratified K-fold cross-validation, I achieved a strong ranking in the top 8%. This competition reinforced my understanding of imbalanced datasets and the importance of proper model stacking for regression tasks.

Music BPM Prediction

Rank 131 / 2,581 (Top 5%)

Version 1

An interesting challenge predicting beats per minute (BPM) for music tracks based on audio features. I analyzed feature correlations with BPM and built ensemble models to capture the complex patterns in audio data. This competition was particularly exciting as it combined signal processing concepts with machine learning. Through careful feature selection and advanced gradient boosting techniques, I secured a top 5% position, demonstrating the effectiveness of understanding domain-specific features.

Many more to come!

Open Source Contributions

ML Utilities Library Reusable machine learning components and helper functions
Collection of battle-tested utilities for data science workflows