⭐ 1. Introduction & Overview¶
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
🔹 2. Import Libraries & Set Up¶
In [65]:
# General
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Machine Learning
import xgboost as xg
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import accuracy_score, f1_score, recall_score, mean_absolute_error, mean_squared_error, r2_score, root_mean_squared_error, roc_auc_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import StandardScaler, LabelEncoder
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from imblearn.over_sampling import SMOTE
# Feature Importance & Explainability
import shap
# Settings
import warnings
warnings.filterwarnings("ignore")
# Set random seed for reproducibility
SEED = 42
np.random.seed(SEED)
print("Libraries loaded. Ready to go!")
Libraries loaded. Ready to go!
🔹 3. Load & Explore Data¶
In [66]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
In [67]:
train.head()
Out[67]:
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
In [68]:
train.shape
Out[68]:
(1460, 81)
In [69]:
train.isnull().sum()
Out[69]:
Id 0 MSSubClass 0 MSZoning 0 LotFrontage 259 LotArea 0 ... MoSold 0 YrSold 0 SaleType 0 SaleCondition 0 SalePrice 0 Length: 81, dtype: int64
In [70]:
# Quick summary of dataset
train.describe()
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 588 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
🔹 4. Data Visualization & EDA¶
In [71]:
float_cols = [col for col in train.columns if train[col].dtype == "float64"]
cols_per_row = 3
num_plots = len(float_cols)
rows = (num_plots // cols_per_row) + (num_plots % cols_per_row > 0)
fig, axes = plt.subplots(rows, cols_per_row, figsize=(15, 5 * rows))
axes = axes.flatten()
for idx, col in enumerate(float_cols):
sns.histplot(train[col], bins=50, kde=True, ax=axes[idx])
axes[idx].set_title(f"Distribution of {col}")
for i in range(idx + 1, len(axes)):
fig.delaxes(axes[i])
plt.tight_layout()
plt.show()
In [72]:
categorical_features = train.select_dtypes(include=['object']).columns
num_features = len(categorical_features)
cols = 3
rows = (num_features // cols) + (num_features % cols > 0)
# Create subplots
fig, axes = plt.subplots(rows, cols, figsize=(15, rows * 5))
axes = axes.flatten()
for i, feature in enumerate(categorical_features):
train[feature].value_counts().plot.pie(
autopct='%1.1f%%', ax=axes[i], startangle=90, cmap="viridis"
)
axes[i].set_title(feature)
axes[i].set_ylabel("")
# Hide any unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout()
plt.show()
In [73]:
heatmap_train = pd.DataFrame()
for col in train.columns:
if train[col].dtype == "float64" or train[col].dtype == "int64":
heatmap_train[col] = train[col]
plt.figure(figsize=(30,12))
sns.heatmap(heatmap_train.corr(), annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()
In [74]:
heatmap_train = train.select_dtypes(include=["float64", "int64"])
corr_matrix = heatmap_train.corr()
threshold = 0.75
high_corr_pairs = (
corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
.stack()
.reset_index()
)
high_corr_pairs.columns = ["Feature 1", "Feature 2", "Correlation"]
high_corr_pairs = high_corr_pairs[high_corr_pairs["Correlation"].abs() > threshold]
plt.figure(figsize=(30, 12))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Feature Correlation Matrix")
plt.show()
print("Highly correlated feature pairs (above threshold):")
print(high_corr_pairs)
Highly correlated feature pairs (above threshold): Feature 1 Feature 2 Correlation 174 OverallQual SalePrice 0.790982 225 YearBuilt GarageYrBlt 0.825667 378 TotalBsmtSF 1stFlrSF 0.819530 478 GrLivArea TotRmsAbvGrd 0.825489 637 GarageCars GarageArea 0.882475
In [75]:
l1 = high_corr_pairs['Feature 1'].tolist()
l2 = high_corr_pairs['Feature 2'].tolist()
interesting_features = list(set(l1+l2))
interesting_features.remove('SalePrice')
print(interesting_features)
['YearBuilt', 'GarageArea', 'TotRmsAbvGrd', 'GarageYrBlt', 'TotalBsmtSF', '1stFlrSF', 'GarageCars', 'OverallQual', 'GrLivArea']
🔹 5. Feature Engineering¶
In [76]:
train.columns = train.columns.str.strip()
test.columns = test.columns.str.strip()
In [77]:
print(f"Train set, null count: \n{train.isnull().sum()}")
print("\n")
print(f"Test set, null count: \n{test.isnull().sum()}")
Train set, null count: Id 0 MSSubClass 0 MSZoning 0 LotFrontage 259 LotArea 0 ... MoSold 0 YrSold 0 SaleType 0 SaleCondition 0 SalePrice 0 Length: 81, dtype: int64 Test set, null count: Id 0 MSSubClass 0 MSZoning 4 LotFrontage 227 LotArea 0 ... MiscVal 0 MoSold 0 YrSold 0 SaleType 1 SaleCondition 0 Length: 80, dtype: int64
In [78]:
outliers = pd.concat([
train[(train['OverallQual'] == 4) & (train['SalePrice'] > 2e5)],
train[(train['OverallQual'] == 8) & (train['SalePrice'] > 5e5)],
train[(train['OverallQual'] == 10) & (train['SalePrice'] > 7e5)],
train[(train['GrLivArea'] > 4000)],
train[(train['OverallCond'] == 2) & (train['SalePrice'] > 3e5)],
train[(train['OverallCond'] == 5) & (train['SalePrice'] > 7e5)],
train[(train['OverallCond'] == 6) & (train['SalePrice'] > 7e5)]
]).sort_index().drop_duplicates()
In [79]:
train = train.drop(outliers.index)
In [80]:
train["LotFrontage"] = train.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
test["LotFrontage"] = test.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
train[col] = train[col].fillna('None')
test[col] = test[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
train[col] = train[col].fillna(0)
test[col] = test[col].fillna(0)
train['TotalSF'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']
test['TotalSF'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF']
In [81]:
for col in train.columns:
if train[col].dtype == "object":
train[col] = train[col].fillna("None")
elif train[col].dtype in ["float64", "int64"]:
train[col] = train[col].fillna(train[col].mean())
for col in test.columns:
if test[col].dtype == "object":
test[col] = test[col].fillna("None")
elif test[col].dtype in ["float64", "int64"]:
test[col] = test[col].fillna(test[col].mean())
In [82]:
for col in train.columns:
if train[col].isnull().sum() > 0:
print(col)
for col in test.columns:
if test[col].isnull().sum() > 0:
print(col)
No more empty items left. Great!
In [83]:
import itertools
def create_combination_features(df, features):
combinations = itertools.combinations(features, 2)
for comb in combinations:
feature_name = "_".join(comb)
df[feature_name] = df[list(comb)].mean(axis=1)
return df
train = create_combination_features(train, interesting_features)
test = create_combination_features(test, interesting_features)
In [84]:
train.head()
Out[84]:
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | TotalBsmtSF_1stFlrSF | TotalBsmtSF_GarageCars | TotalBsmtSF_OverallQual | TotalBsmtSF_GrLivArea | 1stFlrSF_GarageCars | 1stFlrSF_OverallQual | 1stFlrSF_GrLivArea | GarageCars_OverallQual | GarageCars_GrLivArea | OverallQual_GrLivArea | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | None | Reg | Lvl | AllPub | ... | 856.0 | 429.0 | 431.5 | 1283.0 | 429.0 | 431.5 | 1283.0 | 4.5 | 856.0 | 858.5 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | None | Reg | Lvl | AllPub | ... | 1262.0 | 632.0 | 634.0 | 1262.0 | 632.0 | 634.0 | 1262.0 | 4.0 | 632.0 | 634.0 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | None | IR1 | Lvl | AllPub | ... | 920.0 | 461.0 | 463.5 | 1353.0 | 461.0 | 463.5 | 1353.0 | 4.5 | 894.0 | 896.5 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | None | IR1 | Lvl | AllPub | ... | 858.5 | 379.5 | 381.5 | 1236.5 | 482.0 | 484.0 | 1339.0 | 5.0 | 860.0 | 862.0 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | None | IR1 | Lvl | AllPub | ... | 1145.0 | 574.0 | 576.5 | 1671.5 | 574.0 | 576.5 | 1671.5 | 5.5 | 1100.5 | 1103.0 |
5 rows × 118 columns
In [85]:
le = LabelEncoder()
for col in train.columns:
if train[col].dtype == "object":
train[col] = le.fit_transform(train[col])
for col in test.columns:
if test[col].dtype == "object":
test[col] = le.fit_transform(test[col])
🔹 6. Model Selection¶
In [89]:
X = train.drop(columns=["Id", "SalePrice"])
X_test = test.drop(columns=['Id'])
y = train['SalePrice']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=SEED)
In [ ]:
param_grid = {
'n_estimators': [100, 200, 500],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7, 9],
'subsample': [0.8, 0.9, 1.0],
'colsample_bytree': [0.8, 0.9, 1.0],
'alpha': [0, 0.01, 0.1, 1],
'lambda': [0, 0.1, 0.5, 1],
'gamma': [0, 0.1, 0.2, 1],
'early_stopping_rounds': [5, 10, 20, 30]
}
grid_search = GridSearchCV(xg.XGBRegressor(tree_method="gpu_hist", random_state=SEED), param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)])
print("Best Parameters:", grid_search.best_params_)
best_params = grid_search.best_params_
In [ ]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)
oof_predictions = np.zeros(len(train))
for train_idx, val_idx in kf.split(train):
X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
model = xg.XGBRegressor(**best_params)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
oof_predictions[val_idx] = y_pred # Store out-of-fold predictions
print(f"Fold RMSE: {root_mean_squared_error(y_val, y_pred)}") # Change RMSE by accuracy/recall/f1-score for classification
final_rmse = root_mean_squared_error(y, oof_predictions)
print(f"Final Cross-Validation RMSE: {final_rmse}")
Fold RMSE: 23510.853515625 Fold RMSE: 24593.13671875 Fold RMSE: 27025.958984375 Fold RMSE: 24410.958984375 Fold RMSE: 21487.322265625 Final Cross-Validation RMSE: 24273.442706870697
In [91]:
predictions = model.predict(X_test)
output = pd.DataFrame({'Id': test['Id'], 'SalePrice': predictions})
output.to_csv('submission_xgb.csv', index=False)
print("Your submission was successfully saved!")
Your submission was successfully saved!
🔹 Experiment¶
In [ ]:
y = train["SalePrice"]
X = pd.get_dummies(train.drop(columns=["SalePrice"]))
X_test = pd.get_dummies(test)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=SEED)
model = xg.XGBRegressor(
**best_params,
random_state=SEED)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)])
results = model.evals_result()
plt.figure(figsize=(10,7))
plt.plot(results["validation_0"]["rmse"], label="Training loss")
plt.plot(results["validation_1"]["rmse"], label="Validation loss")
plt.xlabel("Number of trees")
plt.ylabel("Loss")
plt.legend()
rms = min(results["validation_1"].values(), key=min)
rms = min(rms)
predictions = model.predict(X_test)
predictions_val = model.predict(X_val)
print(f"RMSE Score: {rms}")
output = pd.DataFrame({'Id': test['Id'], 'SalePrice': predictions})
output.to_csv('submission_experiment.csv', index=False)
print("Your submission was successfully saved!")
[0] validation_0-rmse:68626.07721 validation_1-rmse:72258.08294 [1] validation_0-rmse:62854.93875 validation_1-rmse:66772.04603 [2] validation_0-rmse:57619.17683 validation_1-rmse:61349.71447 [3] validation_0-rmse:52937.04308 validation_1-rmse:56694.73902 [4] validation_0-rmse:48650.87250 validation_1-rmse:52216.31762 [5] validation_0-rmse:44804.70744 validation_1-rmse:48836.47042 [6] validation_0-rmse:41266.45199 validation_1-rmse:45331.83450 [7] validation_0-rmse:38081.24497 validation_1-rmse:42577.41073 [8] validation_0-rmse:35209.05072 validation_1-rmse:40127.35297 [9] validation_0-rmse:32570.91283 validation_1-rmse:37972.22511 [10] validation_0-rmse:30196.15381 validation_1-rmse:35841.68373 [11] validation_0-rmse:28076.53755 validation_1-rmse:34175.90371 [12] validation_0-rmse:26120.43178 validation_1-rmse:32806.23747 [13] validation_0-rmse:24345.91098 validation_1-rmse:31408.97480 [14] validation_0-rmse:22774.49313 validation_1-rmse:30091.86559 [15] validation_0-rmse:21319.17285 validation_1-rmse:29003.32860 [16] validation_0-rmse:20004.62127 validation_1-rmse:28068.22057 [17] validation_0-rmse:18792.97963 validation_1-rmse:27298.90860 [18] validation_0-rmse:17701.94187 validation_1-rmse:26708.68356 [19] validation_0-rmse:16701.53995 validation_1-rmse:26114.34657 [20] validation_0-rmse:15741.54457 validation_1-rmse:25666.34121 [21] validation_0-rmse:14884.90701 validation_1-rmse:25248.77796 [22] validation_0-rmse:14104.05679 validation_1-rmse:24928.75005 [23] validation_0-rmse:13376.86793 validation_1-rmse:24627.60184 [24] validation_0-rmse:12751.09599 validation_1-rmse:24409.04957 [25] validation_0-rmse:12157.37256 validation_1-rmse:24277.67961 [26] validation_0-rmse:11615.93605 validation_1-rmse:24052.79719 [27] validation_0-rmse:11133.57998 validation_1-rmse:23903.31662 [28] validation_0-rmse:10673.42753 validation_1-rmse:23820.95122 [29] validation_0-rmse:10263.64468 validation_1-rmse:23745.82891 [30] validation_0-rmse:9892.41625 validation_1-rmse:23623.23728 [31] validation_0-rmse:9593.44133 validation_1-rmse:23543.44517 [32] validation_0-rmse:9253.13593 validation_1-rmse:23435.61173 [33] validation_0-rmse:8944.58223 validation_1-rmse:23408.26285 [34] validation_0-rmse:8655.50374 validation_1-rmse:23309.31023 [35] validation_0-rmse:8371.12527 validation_1-rmse:23208.41289 [36] validation_0-rmse:8141.50229 validation_1-rmse:23197.34590 [37] validation_0-rmse:7904.50226 validation_1-rmse:23172.74677 [38] validation_0-rmse:7675.71023 validation_1-rmse:23131.52820 [39] validation_0-rmse:7474.97098 validation_1-rmse:23096.30170 [40] validation_0-rmse:7298.82094 validation_1-rmse:23035.66872 [41] validation_0-rmse:7131.52418 validation_1-rmse:23006.28975 [42] validation_0-rmse:6987.01515 validation_1-rmse:23007.98037 [43] validation_0-rmse:6836.17819 validation_1-rmse:22996.30310 [44] validation_0-rmse:6690.92556 validation_1-rmse:22996.47894 [45] validation_0-rmse:6560.87242 validation_1-rmse:22972.89823 [46] validation_0-rmse:6472.61286 validation_1-rmse:22968.22877 [47] validation_0-rmse:6353.61344 validation_1-rmse:22954.01391 [48] validation_0-rmse:6230.62563 validation_1-rmse:22956.92058 [49] validation_0-rmse:6143.92705 validation_1-rmse:22956.43432 [50] validation_0-rmse:6043.65107 validation_1-rmse:22947.14629 [51] validation_0-rmse:5954.64712 validation_1-rmse:22959.06506 [52] validation_0-rmse:5866.78981 validation_1-rmse:22961.52964 [53] validation_0-rmse:5790.85335 validation_1-rmse:22960.31395 [54] validation_0-rmse:5748.21584 validation_1-rmse:22943.30217 [55] validation_0-rmse:5655.56027 validation_1-rmse:22936.91396 [56] validation_0-rmse:5586.87323 validation_1-rmse:22938.11315 [57] validation_0-rmse:5500.84083 validation_1-rmse:22923.90728 [58] validation_0-rmse:5450.36832 validation_1-rmse:22936.11754 [59] validation_0-rmse:5396.97668 validation_1-rmse:22940.63311 [60] validation_0-rmse:5326.09779 validation_1-rmse:22936.08439 [61] validation_0-rmse:5262.64006 validation_1-rmse:22937.15180 [62] validation_0-rmse:5218.94700 validation_1-rmse:22938.09131 [63] validation_0-rmse:5189.42555 validation_1-rmse:22926.26762 [64] validation_0-rmse:5139.59278 validation_1-rmse:22914.35895 [65] validation_0-rmse:5083.37675 validation_1-rmse:22914.35683 [66] validation_0-rmse:5054.13402 validation_1-rmse:22917.60134 [67] validation_0-rmse:5000.71276 validation_1-rmse:22905.02756 [68] validation_0-rmse:4956.30766 validation_1-rmse:22907.71261 [69] validation_0-rmse:4882.04750 validation_1-rmse:22915.68807 [70] validation_0-rmse:4835.71479 validation_1-rmse:22907.99910 [71] validation_0-rmse:4743.17501 validation_1-rmse:22879.63849 [72] validation_0-rmse:4707.86138 validation_1-rmse:22870.00178 [73] validation_0-rmse:4667.16330 validation_1-rmse:22873.78818 [74] validation_0-rmse:4647.10486 validation_1-rmse:22871.19728 [75] validation_0-rmse:4565.99281 validation_1-rmse:22880.82973 [76] validation_0-rmse:4549.16766 validation_1-rmse:22883.62098 [77] validation_0-rmse:4490.24718 validation_1-rmse:22883.02383 [78] validation_0-rmse:4452.70707 validation_1-rmse:22883.12231 [79] validation_0-rmse:4402.94711 validation_1-rmse:22891.65518 [80] validation_0-rmse:4354.86195 validation_1-rmse:22883.83470 [81] validation_0-rmse:4328.34795 validation_1-rmse:22892.13400 [82] validation_0-rmse:4267.65791 validation_1-rmse:22890.53066 [83] validation_0-rmse:4242.48310 validation_1-rmse:22892.74295 [84] validation_0-rmse:4223.03513 validation_1-rmse:22890.66551 [85] validation_0-rmse:4194.20143 validation_1-rmse:22888.66547 [86] validation_0-rmse:4127.06618 validation_1-rmse:22891.52383 [87] validation_0-rmse:4063.75713 validation_1-rmse:22893.77718 [88] validation_0-rmse:4027.44833 validation_1-rmse:22891.99440 [89] validation_0-rmse:3996.76496 validation_1-rmse:22889.39733 [90] validation_0-rmse:3962.44909 validation_1-rmse:22881.76672 [91] validation_0-rmse:3934.07244 validation_1-rmse:22880.11506 RMSE Score: 22870.001775942485 Your submission was successfully saved!
In [ ]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values, X_val)
shap.force_plot(explainer.expected_value, shap_values[0], X_val.iloc[0])
Out[ ]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.