Winning a Kaggle Competition in Python
Kaggle is the most famous platform for Data Science competitions. Taking part in such competitions allows you to work with real-world datasets, explore various machine learning problems, compete with other participants and, finally, get invaluable hands-on experience. In this course, you will learn how to approach and structure any Data Science competition. You will be able to select the correct local validation scheme and to avoid overfitting. Moreover, you will master advanced feature engineering together with model ensembling approaches. All these techniques will be practiced on Kaggle competitions datasets.
- Winning a Kaggle Competition in Python
- 1. Kaggle competitions process
- Explore test data
- 2. Dive into the Competition
- Define a competition metric
- K-fold cross-validation
- Overall validation score
- 3. Feature Engineering
- 4. Modeling
- Baseline model
- Replicate validation score
- Baseline based on the date
- Baseline based on the gradient boosting
- Hyperparameter tuning
- Grid search
- 2D grid search
- Model ensembling
- Model blending
- Model stacking I
- Model stacking II
- Final tips
- Testing Kaggle forum ideas
- Select final submissions
- Final thoughts
Winning a Kaggle Competition in Python
Course Description
Kaggle is the most famous platform for Data Science competitions. Taking part in such competitions allows you to work with real-world datasets, explore various machine learning problems, compete with other participants and, finally, get invaluable hands-on experience. In this course, you will learn how to approach and structure any Data Science competition. You will be able to select the correct local validation scheme and to avoid overfitting. Moreover, you will master advanced feature engineering together with model ensembling approaches. All these techniques will be practiced on Kaggle competitions datasets.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
- Kaggle benefits
- Get practical experience on the real-world data
- Develop portfolio projects
- Meet a great Data Science community
- Try new domain or model type
- Keep up-to-date with the best performing methods
- Process
Explore train data
You will work with another Kaggle competition called "Store Item Demand Forecasting Challenge". In this competition, you are given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items in 10 different stores.
To begin, let's explore the train data for this competition. For the faster performance, you will work with a subset of the train data containing only a single month history.
Your initial goal is to read the input data and take the first look at it.
Instructions
100 XP
- Import
pandas
aspd
. - Read train data using
pandas
'read_csv()
method. - Print the head of the train data (using
head()
method) to see the data sample.
import pandas as pd
# Read train data
train = pd.read_csv('train.csv')
# Look at the shape of the data
print('Train shape:', train.shape)
# Look at the head() of the data
print(train.head())
Explore test data
Having looked at the train data, let's explore the test data in the "Store Item Demand Forecasting Challenge". Remember, that the test dataset generally contains one column less than the train one.
This column, together with the output format, is presented in the sample submission file. Before making any progress in the competition, you should get familiar with the expected output.
That is why, let's look at the columns of the test dataset and compare it to the train columns. Additionally, let's explore the format of the sample submission. The train
DataFrame is available in your workspace.
Instructions 1/2
50 XP
- Read the test dataset.
- Print the column names of the train and test datasets.
import pandas as pd
# Read the test data
test = pd.read_csv('test.csv')
# Print train and test columns
print('Train columns:', train.columns.tolist())
print('Test columns:', test.columns.tolist())
import pandas as pd
# Read the test data
test = pd.read_csv('test.csv')
# Print train and test columns
print('Train columns:', train.columns.tolist())
print('Test columns:', test.columns.tolist())
# Read the sample submission file
sample_submission = pd.read_csv('sample_submission.csv')
# Look at the head() of the sample submission
print(sample_submission.head())
The sample submission file consists of two columns: id
of the observation and sales
column for your predictions. Kaggle will evaluate your predictions on the true sales
data for the corresponding id
. So, it’s important to keep track of the predictions by id
before submitting them. Let’s jump in the next lesson to see how to prepare a submission file!
Determine a problem type
You will keep working on the Store Item Demand Forecasting Challenge. Recall that you are given a history of store-item sales data, and asked to predict 3 months of the future sales.
Before building a model, you should determine the problem type you are addressing. The goal of this exercise is to look at the distribution of the target variable, and select the correct problem type you will be building a model for.
The train
DataFrame is already available in your workspace. It has the target variable column called "sales". Also, matplotlib.pyplot
is already imported as plt
.
Instructions
50 XP
Possible Answers
-
Classification.
-
Regression.
-
Clustering.
That's correct! The sales
variable is continuous, so you're solving a regression problem.
plt.plot(train.sales)
plt.show()
Train a simple model
As you determined, you are dealing with a regression problem. So, now you're ready to build a model for a subsequent submission. But now, instead of building the simplest Linear Regression model as in the slides, let's build an out-of-box Random Forest model.
You will use the RandomForestRegressor
class from the scikit-learn
library.
Your objective is to train a Random Forest model with default parameters on the "store" and "item" features.
Instructions
100 XP
- Read the train data using
pandas
. - Create a Random Forest object.
- Train the Random Forest model on the "store" and "item" features with "sales" as a target.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
# Read the train data
train = pd.read_csv('train.csv')
# Create a Random Forest object
rf = RandomForestRegressor()
# Train a model
rf.fit(X=train[['store', 'item']], y=train['sales'])
Congratulations, you've built the first simple model. Now it's time to use it for the test predictions. Go on to the next step!
Prepare a submission
You've already built a model on the training data from the Kaggle Store Item Demand Forecasting Challenge. Now, it's time to make predictions on the test data and create a submission file in the specified format.
Your goal is to read the test data, make predictions, and save these in the format specified in the "sample_submission.csv" file. The rf
object you created in the previous exercise is available in your workspace.
Note that starting from now and for the rest of the course, pandas
library will be always imported for you and could be accessed as pd
.
Instructions 1/2
50 XP
- Read "test.csv" and "sample_submission.csv" files using
pandas
. - Look at the head of the sample submission to determine the format.
test = pd.read_csv("test.csv")
sample_submission = pd.read_csv("sample_submission.csv")
# Show the head() of the sample_submission
print(sample_submission.head())
test = pd.read_csv('test.csv')
sample_submission = pd.read_csv('sample_submission.csv')
# Show the head() of the sample_submission
print(sample_submission.head())
# Get predictions for the test set
test['sales'] = rf.predict(test[['store', 'item']])
# Write test predictions using the sample_submission format
test[['id', 'sales']].to_csv('kaggle_submission.csv', index=False)
Congratulations! You've prepared your first Kaggle submission. Now, you could upload it to the Kaggle platform and see your score and current position on the Leaderboard. Move forward to learn more about the Leaderboard itself!
What model is overfitting?
Let's say you've trained 4 different models and calculated a metric for both train and validation data sets. For example, the metric is Mean Squared Error (the lower its value the better). Train and validation metrics for all the models are presented in the table below.
Please, select the model that overfits to train data.
Model | Train MSE | Validation MSE |
---|---|---|
Model 1 | 2.35 | 2.46 |
Model 2 | 2.20 | 2.15 |
Model 3 | 2.10 | 2.14 |
Model 4 | 1.90 | 2.35 |
Answer the question
50XP
Possible Answers
- Model 1.
- Model 2.
- Model 3.
- Model 4.
That's right! Model 4 has considerably lower train MSE compared to other models. However, validation MSE started growing again.
Train XGBoost models
Every Machine Learning method could potentially overfit. You will see it on this example with XGBoost. Again, you are working with the Store Item Demand Forecasting Challenge. The train
DataFrame is available in your workspace.
Firstly, let's train multiple XGBoost models with different sets of hyperparameters using XGBoost's learning API. The single hyperparameter you will change is:
-
max_depth
- maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit.
Instructions 1/3
35 XP
- Set the maximum depth to 2. Then hit Submit Answer button to train the first model.
import xgboost as xgb
# Create DMatrix on train data
dtrain = xgb.DMatrix(data=train[['store', 'item']],
label=train['sales'])
# Define xgboost parameters
params = {'objective': 'reg:linear',
'max_depth': 2,
'silent': 1}
# Train xgboost model
xg_depth_2 = xgb.train(params=params, dtrain=dtrain)
- Now, set the maximum depth to 8. Then hit Submit Answer button to train the second model.
import xgboost as xgb
# Create DMatrix on train data
dtrain = xgb.DMatrix(data=train[['store', 'item']],
label=train['sales'])
# Define xgboost parameters
params = {'objective': 'reg:linear',
'max_depth': 8,
'silent': 1}
# Train xgboost model
xg_depth_8 = xgb.train(params=params, dtrain=dtrain)
- Finally, set the maximum depth to 15. Then hit Submit Answer button to train the third model.
import xgboost as xgb
# Create DMatrix on train data
dtrain = xgb.DMatrix(data=train[['store', 'item']],
label=train['sales'])
# Define xgboost parameters
params = {'objective': 'reg:linear',
'max_depth': 15,
'silent': 1}
# Train xgboost model
xg_depth_15 = xgb.train(params=params, dtrain=dtrain)
All right, now you have 3 different XGBoost models trained. Let's explore them further!
Explore overfitting XGBoost
Having trained 3 XGBoost models with different maximum depths, you will now evaluate their quality. For this purpose, you will measure the quality of each model on both the train data and the test data. As you know by now, the train data is the data models have been trained on. The test data is the next month sales data that models have never seen before.
The goal of this exercise is to determine whether any of the models trained is overfitting. To measure the quality of the models you will use Mean Squared Error (MSE). It's available in sklearn.metrics
as mean_squared_error()
function that takes two arguments: true values and predicted values.
train
and test
DataFrames together with 3 models trained (xg_depth_2
, xg_depth_8
, xg_depth_15
) are available in your workspace.
Instructions
100 XP
- Make predictions for each model on both the train and test data.
- Calculate the MSE between the true values and your predictions for both the train and test data.
from sklearn.metrics import mean_squared_error
dtrain = xgb.DMatrix(data=train[['store', 'item']])
dtest = xgb.DMatrix(data=test[['store', 'item']])
# For each of 3 trained models
for model in [xg_depth_2, xg_depth_8, xg_depth_15]:
# Make predictions
train_pred = model.predict(dtrain)
test_pred = model.predict(dtest)
# Calculate metrics
mse_train = mean_squared_error(train['sales'], train_pred)
mse_test = mean_squared_error(test['sales'], test_pred)
print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))
So, you see that the third model with depth 15 is already overfitting. It has considerably lower train error compared to the second model, however test error is higher. Be aware of overfitting and move on to the next chapter to know how to beat it!
Understand the problem type
As you've just seen, the first step of the solution workflow is to skim through the problem statement. Your goal now is to determine data types available as well as the problem type for the Avito Demand Prediction Challenge. The evaluation metric in this competition is the Root Mean Squared Error. The problem definition is presented below.
In this Kaggle competition, Avito is challenging you to predict demand for an online advertisement based on its full description (price, title, images, etc.), its context (geo position, similar ads already posted) and historical demand for similar ads in the past.
What problem type are you facing, and what data do you have at your disposal?
Answer the question
50XP
Possible Answers
-
This is a regression problem with tabular, time series, image and text data.
-
This is a regression problem with tabular, text and image data.
-
This is a classification problem with tabular, time series, image and text data.
-
This is a clustering problem with tabular, text and image data.
That's correct! This competition contains a mix of various structured and unstructured data.
Define a competition metric
Competition metric is used by Kaggle to evaluate your submissions. Moreover, you also need to measure the performance of different models on a local validation set.
For now, your goal is to manually develop a couple of competition metrics in case if they are not available in sklearn.metrics
.
In particular, you will define:
-
Mean Squared Error (MSE) for the regression problem: $$ MSE= \frac{1}{N} ∑\limits_{i=1}^{N}{(y_i−\hat{y}_i)^2}$$
-
Logarithmic Loss (LogLoss) for the binary classification problem: $$LogLoss=−\frac{1}{N}∑\limits_{i=1}^{N} {( y_i ln (p_i) +(1−y_i) ln(1−p_i))}$$
Instructions 1/2
50 XP
- Using `numpy`, define MSE metric. As a function input, you're given true `y_true` and predicted `y_pred` arrays.
import numpy as np
# Import MSE from sklearn
from sklearn.metrics import mean_squared_error
# Define your own MSE function
def own_mse(y_true, y_pred):
# Raise differences to the power of 2
squares = np.power(y_true - y_pred, 2)
# Find mean over all observations
err = np.mean(squares)
return err
print('Sklearn MSE: {:.5f}. '.format(mean_squared_error(y_regression_true, y_regression_pred)))
print('Your MSE: {:.5f}. '.format(own_mse(y_regression_true, y_regression_pred)))
- Using `numpy`, define LogLoss metric. As input, you're given true class `y_true` and probability predicted `prob_pred`.
import numpy as np
# Import log_loss from sklearn
from sklearn.metrics import log_loss
# Define your own LogLoss function
def own_logloss(y_true, prob_pred):
# Find loss for each observation
terms = y_true * np.log(prob_pred) + (1 - y_true) * np.log(1 - prob_pred)
# Find mean over all observations
err = np.mean(terms)
return -err
print('Sklearn LogLoss: {:.5f}'.format(log_loss(y_classification_true, y_classification_pred)))
print('Your LogLoss: {:.5f}'.format(own_logloss(y_classification_true, y_classification_pred)))
Great! You see that your functions work the same way that built-in sklearn
metrics. Knowing the problem type and evaluation metric, it's time to start Data Analysis. Let's move on to the next lesson on EDA!
EDA statistics
As mentioned in the slides, you'll work with New York City taxi fare prediction data. You'll start with finding some basic statistics about the data. Then you'll move forward to plot some dependencies and generate hypotheses on them.
The train
and test
DataFrames are already available in your workspace.
Instructions 1/2
50 XP
- Find the shapes of the train and test data.
- Look at the head of the train data.
print('Train shape:', train.shape)
print('Test shape:', test.shape)
# Train head()
print(train.head())
print('Train shape:', train.shape)
print('Test shape:', test.shape)
# Train head()
print(train.head())
# Describe the target variable
print(train.fare_amount.describe())
# Train distribution of passengers within rides
print(train.passenger_count.value_counts())
All right! You just obtained a couple of descriptive statistics about the data. You can look at them to understand the data structure. However, they are not informative enough to get ideas for the future solution. Let's get down to more practical EDA!
EDA plots I
After generating a couple of basic statistics, it's time to come up with and validate some ideas about the data dependencies. Again, the train
DataFrame from the taxi competition is already available in your workspace.
To begin with, let's make a scatterplot plotting the relationship between the fare amount and the distance of the ride. Intuitively, the longer the ride, the higher its price.
To get the distance in kilometers between two geo-coordinates, you will use Haversine distance. Its calculation is available with the haversine_distance()
function defined for you. The function expects train
DataFrame as input.
Instructions
100 XP
- Create a new variable "distance_km" as Haversine distance between pickup and dropoff points.
- Plot a scatterplot with "fare_amount" on the x axis and "distance_km" on the y axis. To draw a scatterplot use matplotlib
scatter()
method. - Set a limit on a ride distance to be between 0 and 50 kilometers to avoid plotting outliers.
train['distance_km'] = haversine_distance(train)
# Draw a scatterplot
plt.scatter(x=train['fare_amount'], y=train['distance_km'], alpha=0.5)
plt.xlabel('Fare amount')
plt.ylabel('Distance, km')
plt.title('Fare amount based on the distance')
# Limit on the distance
plt.ylim(0, 50)
plt.show()
Nice plot! It's obvious now that there is a clear dependency between ride distance and fare amount. So, ride distance is, probably, a good feature. Let's find some others!
EDA plots II
Another idea that comes to mind is that the price of a ride could change during the day.
Your goal is to plot the median fare amount for each hour of the day as a simple line plot. The hour feature is calculated for you. Don't worry if you do not know how to work with the date features. We will explore them in the chapter on Feature Engineering.
Instructions
100 XP
- Group
train
DataFrame by"hour"
and calculate the median for the"fare_amount"
column. - Using
hour_price
DataFrame obtained, plot a line with"hour"
on the x axis and"fare_amount"
on the y axis.
train['pickup_datetime'] = pd.to_datetime(train.pickup_datetime)
train['hour'] = train.pickup_datetime.dt.hour
# Find median fare_amount for each hour
hour_price = train.groupby('hour', as_index=False)['fare_amount'].median()
# Plot the line plot
plt.plot(hour_price['hour'], hour_price['fare_amount'], marker='o')
plt.xlabel('Hour of the day')
plt.ylabel('Median fare amount')
plt.title('Fare amount based on day time')
plt.xticks(range(24))
plt.show()
Great! We see that prices are a bit higher during the night. It is a good indicator that we should include the "hour"
feature in the final model, or at least add a binary feature "is_night"
. Move on to the next lesson to learn how to check whether new features are useful for the model or not!
K-fold cross-validation
You will start by getting hands-on experience in the most commonly used K-fold cross-validation.
The data you'll be working with is from the "Two sigma connect: rental listing inquiries" Kaggle competition. The competition problem is a multi-class classification of the rental listings into 3 classes: low interest, medium interest and high interest. For faster performance, you will work with a subsample consisting of 1,000 observations.
You need to implement a K-fold validation strategy and look at the sizes of each fold obtained. train
DataFrame is already available in your workspace.
Instructions
100 XP
- Create a
KFold
object with 3 folds. - Loop over each split using the
kf
object. - For each split select training and testing folds using
train_index
andtest_index
.
from sklearn.model_selection import KFold
# Create a KFold object
kf = KFold(n_splits=3, shuffle=True, random_state=123)
# Loop through each split
fold = 0
for train_index, test_index in kf.split(train):
# Obtain training and testing folds
cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
print('Fold: {}'.format(fold))
print('CV train shape: {}'.format(cv_train.shape))
print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
fold += 1
So, we see that the number of observations in each fold is almost uniform. It means that we've just splitted the train data into 3 equal folds. However, if we look at the number of medium-interest listings, it's varying from 162 to 175 from one fold to another. To make them uniform among the folds, let's use Stratified K-fold!
Stratified K-fold
As you've just noticed, you have a pretty different target variable distribution among the folds due to the random splits. It's not crucial for this particular competition, but could be an issue for the classification competitions with the highly imbalanced target variable.
To overcome this, let's implement the stratified K-fold strategy with the stratification on the target variable. train
DataFrame is already available in your workspace.
Instructions
100 XP
- Create a
StratifiedKFold
object with 3 folds and shuffling. - Loop over each split using
str_kf
object. Stratification is based on the "interest_level" column. - For each split select training and testing folds using
train_index
andtest_index
.
from sklearn.model_selection import StratifiedKFold
# Create a StratifiedKFold object
str_kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=123)
# Loop through each split
fold = 0
for train_index, test_index in str_kf.split(train, train['interest_level']):
# Obtain training and testing folds
cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
print('Fold: {}'.format(fold))
print('CV train shape: {}'.format(cv_train.shape))
print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
fold += 1
Great! Now you see that both size and target distribution are the same among the folds. The general rule is to prefer Stratified K-Fold over usual K-Fold in any classification problem. Move to the next lesson to learn about other cross-validation strategies!
Time K-fold
Remember the "Store Item Demand Forecasting Challenge" where you are given store-item sales data, and have to predict future sales?
It's a competition with time series data. So, time K-fold cross-validation should be applied. Your goal is to create this cross-validation strategy and make sure that it works as expected.
Note that the train
DataFrame is already available in your workspace, and that TimeSeriesSplit
has been imported from sklearn.model_selection
.
Instructions
100 XP
- Create a
TimeSeriesSplit
object with 3 splits. - Sort the train data by "date" column to apply time K-fold.
- Loop over each time split using
time_kfold
object. - For each split select training and testing folds using
train_index
andtest_index
.
time_kfold = TimeSeriesSplit(n_splits=3)
# Sort train data by date
train = train.sort_values('date')
# Iterate through each split
fold = 0
for train_index, test_index in time_kfold.split(train):
cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
print('Fold :', fold)
print('Train date range: from {} to {}'.format(cv_train.date.min(), cv_train.date.max()))
print('Test date range: from {} to {}\n'.format(cv_test.date.min(), cv_test.date.max()))
fold += 1
Great! You've applied time K-fold cross-validation strategy for the demand forecasting. Look at the output. It works as expected, training only on the past data and predicting the future. Progress to the next exercise to evaluate different models!
Overall validation score
Now it's time to get the actual model performance using cross-validation! How does our store item demand prediction model perform?
Your task is to take the Mean Squared Error (MSE) for each fold separately, and then combine these results into a single number.
For simplicity, you're given get_fold_mse()
function that for each cross-validation split fits a Random Forest model and returns a list of MSE scores by fold. get_fold_mse()
accepts two arguments: train
and TimeSeriesSplit
object.
Instructions 1/3
35 XP
- Create time 3-fold cross-validation.
- Print the
numpy
mean of MSE scores by folds.
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
# Sort train data by date
train = train.sort_values('date')
# Initialize 3-fold time cross-validation
kf = TimeSeriesSplit(n_splits=3)
# Get MSE scores for each cross-validation split
mse_scores = get_fold_mse(train, kf)
print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
# Sort train data by date
train = train.sort_values('date')
# Initialize 3-fold time cross-validation
kf = TimeSeriesSplit(n_splits=3)
# Get MSE scores for each cross-validation split
mse_scores = get_fold_mse(train, kf)
print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
print('MSE by fold: {}'.format(mse_scores))
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
# Sort train data by date
train = train.sort_values('date')
# Initialize 3-fold time cross-validation
kf = TimeSeriesSplit(n_splits=3)
# Get MSE scores for each cross-validation split
mse_scores = get_fold_mse(train, kf)
print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
print('MSE by fold: {}'.format(mse_scores))
print('Overall validation MSE: {:.5f}'.format(np.mean(mse_scores) + np.std(mse_scores)))
Congratulations, you've mastered it! Now, you know different validation strategies as well as how to use them to obtain overall model performance. It's time for the next and the most interesting part of the solution process: Feature Engineering and Modeling. See you in the next Chapters!
Arithmetical features
To practice creating new features, you will be working with a subsample from the Kaggle competition called "House Prices: Advanced Regression Techniques". The goal of this competition is to predict the price of the house based on its properties. It's a regression problem with Root Mean Squared Error as an evaluation metric.
Your goal is to create new features and determine whether they improve your validation score. To get the validation score from 5-fold cross-validation, you're given the get_kfold_rmse()
function. Use it with the train
DataFrame, available in your workspace, as an argument.
Instructions 1/3
50 XP
- Create a new feature representing the total area (basement, 1st and 2nd floors) of the house. The columns
"TotalBsmtSF"
,"FirstFlrSF"
and"SecondFlrSF"
give the areas of the basement, 1st and 2nd floors, respectively.
print('RMSE before feature engineering:', get_kfold_rmse(train))
# Find the total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']
# Look at the updated RMSE
print('RMSE with total area:', get_kfold_rmse(train))
print('RMSE before feature engineering:', get_kfold_rmse(train))
# Find the total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']
print('RMSE with total area:', get_kfold_rmse(train))
# Find the area of the garden
train['GardenArea'] = train['LotArea'] - train['FirstFlrSF']
print('RMSE with garden area:', get_kfold_rmse(train))
print('RMSE before feature engineering:', get_kfold_rmse(train))
# Find the total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']
print('RMSE with total area:', get_kfold_rmse(train))
# Find the area of the garden
train['GardenArea'] = train['LotArea'] - train['FirstFlrSF']
print('RMSE with garden area:', get_kfold_rmse(train))
# Find total number of bathrooms
train['TotalBath'] = train['FullBath'] + train['HalfBath']
print('RMSE with number of bathrooms:', get_kfold_rmse(train))
Nice! You've created three new features. Here you see that house area improved the RMSE by almost $1,000. Adding garden area improved the RMSE by another $600. However, with the total number of bathrooms, the RMSE has increased. It means that you keep the new area features, but do not add "TotalBath" as a new feature. Let's now work with the datetime features!
Date features
You've built some basic features using numerical variables. Now, it's time to create features based on date and time. You will practice on a subsample from the Taxi Fare Prediction Kaggle competition data. The data represents information about the taxi rides and the goal is to predict the price for each ride.
Your objective is to generate date features from the pickup datetime. Recall that it's better to create new features for train and test data simultaneously. After the features are created, split the data back into the train and test DataFrames. Here it's done using pandas
' isin()
method.
The train
and test
DataFrames are already available in your workspace.
Instructions
100 XP
- Concatenate the
train
andtest
DataFrames into a single DataFrametaxi
. - Convert the "pickup_datetime" column to a
datetime
object. - Create the day of week (using
.dayofweek
attribute) and hour (using.hour
attribute) features from the "pickup_datetime" column.
taxi = pd.concat([train, test])
# Convert pickup date to datetime object
taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])
# Create a day of week feature
taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek
# Create an hour feature
taxi['hour'] = taxi['pickup_datetime'].dt.hour
# Split back into train and test
new_train = taxi[taxi['id'].isin(train['id'])]
new_test = taxi[taxi['id'].isin(test['id'])]
Great! Now you know how to perform feature engineering for train and test DataFrames simultaneously. Having considered numerical and datetime features, move forward to master feature engineering for categorical ones!
Label encoding
Let's work on categorical variables encoding. You will again work with a subsample from the House Prices Kaggle competition.
Your objective is to encode categorical features "RoofStyle" and "CentralAir" using label encoding. The train
and test
DataFrames are already available in your workspace.
Instructions
100 XP
- Concatenate
train
andtest
DataFrames into a single DataFramehouses
. - Create a
LabelEncoder
object without arguments and assign it tole
. - Create new label-encoded features for "RoofStyle" and "CentralAir" using the same
le
object.
houses = pd.concat([train, test])
# Label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Create new features
houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])
# Look at new features
print(houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head())
All right! You can see that categorical variables have been label encoded. However, as you already know, label encoder is not always a good choice for categorical variables. Let's go further and apply One-Hot encoding.
One-Hot encoding
The problem with label encoding is that it implicitly assumes that there is a ranking dependency between the categories. So, let's change the encoding method for the features "RoofStyle" and "CentralAir" to one-hot encoding. Again, the train
and test
DataFrames from House Prices Kaggle competition are already available in your workspace.
Recall that if you're dealing with binary features (categorical features with only two categories) it is suggested to apply label encoder only.
Your goal is to determine which of the mentioned features is not binary, and to apply one-hot encoding only to this one.
Instructions 1/4
35 XP
- Determine the distribution of "RoofStyle" and "CentralAir" features using
pandas
'value_counts()
method.
houses = pd.concat([train, test])
# Look at feature distributions
print(houses['RoofStyle'].value_counts(), '\n')
print(houses['CentralAir'].value_counts())
houses = pd.concat([train, test])
# Label encode binary 'CentralAir' feature
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])
houses = pd.concat([train, test])
# Label encode binary 'CentralAir' feature
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])
# Create One-Hot encoded features
ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')
# Concatenate OHE features to houses
houses = pd.concat([houses, ohe], axis=1)
# Look at OHE features
print(houses[[col for col in houses.columns if 'RoofStyle' in col]].head(3))
Congratulations! Now you've mastered one-hot encoding as well! The one-hot encoded features look as expected. Remember to drop the initial string column, because models will not handle it automatically. OK, we're done with simple categorical encoders. Let's move to the target encoder!
Mean target encoding
First of all, you will create a function that implements mean target encoding. Remember that you need to develop the two following steps:
- Calculate the mean on the train, apply to the test
- Split train into K folds. Calculate the out-of-fold mean for each fold, apply to this particular fold
Each of these steps will be implemented in a separate function: test_mean_target_encoding()
and train_mean_target_encoding()
, respectively.
The final function mean_target_encoding()
takes as arguments: the train and test DataFrames, the name of the categorical column to be encoded, the name of the target column and a smoothing parameter alpha. It returns two values: a new feature for train and test DataFrames, respectively.
Instructions 1/3
35 XP
- You need to add smoothing to avoid overfitting. So, add $α$ parameter to the denominator in
train_statistics
calculations. - You need to treat new categories in the test data. So, pass a global mean as an argument to the
fillna()
method.
def test_mean_target_encoding(train, test, target, categorical, alpha=5):
# Calculate global mean on the train data
global_mean = train[target].mean()
# Group by the categorical feature and calculate its properties
train_groups = train.groupby(categorical)
category_sum = train_groups[target].sum()
category_size = train_groups.size()
# Calculate smoothed mean target statistics
train_statistics = (category_sum + global_mean * alpha) / (category_size + alpha)
# Apply statistics to the test data and fill new categories
test_feature = test[categorical].map(train_statistics).fillna(global_mean)
return test_feature.values
def train_mean_target_encoding(train, target, categorical, alpha=5):
# Create 5-fold cross-validation
kf = KFold(n_splits=5, random_state=123, shuffle=True)
train_feature = pd.Series(index=train.index)
# For each folds split
for train_index, test_index in kf.split(train):
cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
# Calculate out-of-fold statistics and apply to cv_test
cv_test_feature = test_mean_target_encoding(cv_train, cv_test, target, categorical, alpha)
# Save new feature for this particular fold
train_feature.iloc[test_index] = cv_test_feature
return train_feature.values
def mean_target_encoding(train, test, target, categorical, alpha=5):
# Get the train feature
train_feature = train_mean_target_encoding(train, target, categorical, alpha)
# Get the test feature
test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
# Return new features to add to the model
return train_feature, test_feature
Great! Now you are equipped with a function that performs mean target encoding of any categorical feature. Move on to learn how to implement mean target encoding for the K-fold cross-validation using the mean_target_encoding()
function you've just built!
K-fold cross-validation
You will work with a binary classification problem on a subsample from Kaggle playground competition. The objective of this competition is to predict whether a famous basketball player Kobe Bryant scored a basket or missed a particular shot.
Train data is available in your workspace as bryant_shots
DataFrame. It contains data on 10,000 shots with its properties and a target variable "shot\_made\_flag"
-- whether shot was scored or not.
One of the features in the data is "game_id"
-- a particular game where the shot was made. There are 541 distinct games. So, you deal with a high-cardinality categorical feature. Let's encode it using a target mean!
Suppose you're using 5-fold cross-validation and want to evaluate a mean target encoded feature on the local validation.
Instructions
100 XP
- To achieve this, you need to repeat encoding procedure for the
"game_id"
categorical feature inside each folds split separately. Your goal is to specify all the missing parameters for themean_target_encoding()
function call inside each folds split. - Recall that the
train
andtest
parameters expect the train and test DataFrames. - While the
target
andcategorical
parameters expect names of the target variable and categorical feature to be encoded.
kf = KFold(n_splits=5, random_state=123, shuffle=True)
# For each folds split
for train_index, test_index in kf.split(bryant_shots):
cv_train, cv_test = bryant_shots.iloc[train_index], bryant_shots.iloc[test_index]
# Create mean target encoded feature
cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,
test=cv_test,
target='shot_made_flag',
categorical='game_id',
alpha=5)
# Look at the encoding
print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))
Nice! You could see different game encodings for each validation split in the output. The main conclusion you should make: while using local cross-validation, you need to repeat mean target encoding procedure inside each folds split separately. Go on to try other problem types beyond binary classification!
Beyond binary classification
Of course, binary classification is just a single special case. Target encoding could be applied to any target variable type:
- For binary classification usually mean target encoding is used
- For regression mean could be changed to median, quartiles, etc.
- For multi-class classification with N classes we create N features with target mean for each category in one vs. all fashion
The mean_target_encoding()
function you've created could be used for any target type specified above. Let's apply it for the regression problem on the example of House Prices Kaggle competition.
Your goal is to encode a categorical feature "RoofStyle"
using mean target encoding. The train
and test
DataFrames are already available in your workspace.
Instructions
100 XP
- Specify all the missing parameters for the
mean_target_encoding()
function call. Target variable name is"SalePrice"
. Set $α$ hyperparameter to 10. - Recall that the
train
andtest
parameters expect the train and test DataFrames. - While the
target
andcategorical
parameters expect names of the target variable and feature to be encoded.
train['RoofStyle_enc'], test['RoofStyle_enc'] = mean_target_encoding(train=train,
test=test,
target='SalePrice',
categorical='RoofStyle',
alpha=10)
# Look at the encoding
print(test[['RoofStyle', 'RoofStyle_enc']].drop_duplicates())
So, you observe that houses with the Hip
roof are the most pricy, while houses with the Gambrel
roof are the cheapest. It's exactly the goal of target encoding: you've encoded categorical feature in such a manner that there is now a correlation between category values and target variable. We're done with categorical encoders. Not it's time to talk about the missing data!
Find missing data
Let's impute missing data on a real Kaggle dataset. For this purpose, you will be using a data subsample from the Kaggle "Two sigma connect: rental listing inquiries" competition.
Before proceeding with any imputing you need to know the number of missing values for each of the features. Moreover, if the feature has missing values, you should explore the type of this feature.
Instructions 1/2
50 XP
- Read the
"twosigma_train.csv"
file usingpandas
. - Find the number of missing values in each column.
twosigma = pd.read_csv('twosigma_train.csv')
# Find the number of missing values in each column
print(twosigma.isnull().sum())
twosigma = pd.read_csv('twosigma_train.csv')
# Find the number of missing values in each column
print(twosigma.isnull().sum())
# Look at the columns with the missing values
print(twosigma[['building_id', 'price']].head())
All right, you've found out that 'building_id'
and 'price'
columns have missing values. Looking at the head of the DataFrame, we may conclude that 'price'
is a numerical feature, while 'building_id'
is a categorical feature that is encoding buildings as hashes.
Impute missing data
You've found that "price" and "building_id" columns have missing values in the Rental Listing Inquiries dataset. So, before passing the data to the models you need to impute these values.
Numerical feature "price" will be encoded with a mean value of non-missing prices.
Imputing categorical feature "building_id" with the most frequent category is a bad idea, because it would mean that all the apartments with a missing "building_id" are located in the most popular building. The better idea is to impute it with a new category.
The DataFrame rental_listings
with competition data is read for you.
Instructions 1/2
50 XP
- Create a SimpleImputer object with "mean" strategy.
- Impute missing prices with the mean value.
from sklearn.impute import SimpleImputer
# Create mean imputer
mean_imputer = SimpleImputer(strategy='mean')
# Price imputation
rental_listings[['price']] = mean_imputer.fit_transform(rental_listings[['price']])
from sklearn.impute import SimpleImputer
# Create constant imputer
constant_imputer = SimpleImputer(strategy='constant', fill_value='MISSING')
# building_id imputation
rental_listings[['building_id']] = constant_imputer.fit_transform(rental_listings[['building_id']])
Nice! Now our data is ready to be passed to any Machine Learning model. Move on to the next chapter to build and improve your models!
Replicate validation score
You've seen both validation and Public Leaderboard scores in the video. However, the code examples are available only for the test data. To get the validation scores you have to repeat the same process on the holdout set.
Throughout this chapter, you will work with New York City Taxi competition data. The problem is to predict the fare amount for a taxi ride in New York City. The competition metric is the root mean squared error.
The first goal is to evaluate the Baseline model on the validation data. You will replicate the simplest Baseline based on the mean of "fare_amount"
. Recall that as a validation strategy we used a 30% holdout split with validation_train
as train and validation_test
as holdout DataFrames. Both of them are available in your workspace.
Instructions
100 XP
- Calculate the mean of
"fare_amount"
over the wholevalidation_train
DataFrame. - Assign this naive prediction value to all the holdout predictions. Store them in the
"pred"
column.
import numpy as np
from sklearn.metrics import mean_squared_error
from math import sqrt
# Calculate the mean fare_amount on the validation_train data
naive_prediction = np.mean(validation_train['fare_amount'])
# Assign naive prediction to all the holdout observations
validation_test['pred'] = naive_prediction
# Measure the local RMSE
rmse = sqrt(mean_squared_error(validation_test['fare_amount'], validation_test['pred']))
print('Validation RMSE for Baseline I model: {:.3f}'.format(rmse))
It's exactly the same number you've seen in the slides, well done! So, to avoid overfitting you should fully replicate your models using the validation data. Go forward to create a couple of other baselines!
Baseline based on the date
We've already built 3 different baseline models. To get more practice, let's build a couple more. The first model is based on the grouping variables. It's clear that the ride fare could depend on the part of the day. For example, prices could be higher during the rush hours.
Your goal is to build a baseline model that will assign the average "fare_amount" for the corresponding hour. For now, you will create the model for the whole train
data and make predictions for the test
dataset.
The train
and test
DataFrames are available in your workspace. Moreover, the "pickup_datetime" column in both DataFrames is already converted to a datetime
object for you.
Instructions
100 XP
- Get the hour from the "pickup_datetime" column for the
train
andtest
DataFrames. - Calculate the mean "fare_amount" for each hour on the train data.
- Make
test
predictions usingpandas
'map()
method and the grouping obtained. - Write predictions to the file.
train['hour'] = train['pickup_datetime'].dt.hour
test['hour'] = test['pickup_datetime'].dt.hour
# Calculate average fare_amount grouped by pickup hour
hour_groups = train.groupby('hour')['fare_amount'].mean()
# Make predictions on the test set
test['fare_amount'] = test.hour.map(hour_groups)
# Write predictions
test[['id','fare_amount']].to_csv('hour_mean_sub.csv', index=False)
Great! Such baseline achieves 1409th place on the Public Leaderboard which is slightly better than grouping by the number of passengers. Also, remember to replicate all the results for the validation set as it was done in the previous exercise.
Baseline based on the gradient boosting
Let's build a final baseline based on the Random Forest. You've seen a huge score improvement moving from the grouping baseline to the Gradient Boosting in the video. Now, you will use sklearn
's Random Forest to further improve this score.
The goal of this exercise is to take numeric features and train a Random Forest model without any tuning. After that, you could make test predictions and validate the result on the Public Leaderboard. Note that you've already got an "hour"
feature which could also be used as an input to the model.
Instructions
100 XP
- Add the
"hour"
feature to the list of numeric features. - Fit the
RandomForestRegressor
on the train data with numeric features and"fare_amount"
as a target. - Use the trained Random Forest model to make predictions on the test data.
from sklearn.ensemble import RandomForestRegressor
# Select only numeric features
features = ['pickup_longitude', 'pickup_latitude', 'dropoff_longitude',
'dropoff_latitude', 'passenger_count', 'hour']
# Train a Random Forest model
rf = RandomForestRegressor()
rf.fit(train[features], train.fare_amount)
# Make predictions on the test data
test['fare_amount'] = rf.predict(test[features])
# Write predictions
test[['id','fare_amount']].to_csv('rf_sub.csv', index=False)
Congratulations! This final baseline achieves the 1051st place on the Public Leaderboard which is slightly better than the Gradient Boosting from the video. So, now you know how to build fast and simple baseline models to validate your initial pipeline.
Hyperparameter tuning
-
Ridge Regression
-
Least squares linear regression $$ \text{Loss} = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \rightarrow \text{min} $$
-
Ridge Regression $$\text{Loss} = \sum_{i=1}^{N} (y_i - \hat{y}_{i})^2 + \alpha \sum_{j=1}^K w_j^2 \rightarrow \text{min}$$
-
$\alpha$ is hyperparameter
-
-
Hyperparameter optimization strategies
- Grid Search - Choose the predefined grid of hyperparamter values
- Random Search - Choose the search space of hyperparamter values
- Bayesian optimization - Choose the search space of hyperparameter values
Grid search
Recall that we've created a baseline Gradient Boosting model in the previous lesson. Your goal now is to find the best max_depth
hyperparameter value for this Gradient Boosting model. This hyperparameter limits the number of nodes in each individual tree. You will be using K-fold cross-validation to measure the local performance of the model for each hyperparameter value.
You're given a function get_cv_score()
, which takes the train dataset and dictionary of the model parameters as arguments and returns the overall validation RMSE score over 3-fold cross-validation.
Instructions
100 XP
- Specify the grid for possible
max_depth
values with 3, 6, 9, 12 and 15. - Pass each hyperparameter candidate in the grid to the model
params
dictionary.
max_depth_grid = [3, 6, 9, 12, 15]
results = {}
# For each value in the grid
for max_depth_candidate in max_depth_grid:
# Specify parameters for the model
params = {'max_depth': max_depth_candidate}
# Calculate validation score for a particular hyperparameter
validation_score = get_cv_score(train, params)
# Save the results for each max depth value
results[max_depth_candidate] = validation_score
print(results)
Nice! We have a validation score for each value in the grid. It's clear that the optimal max depth value is located somewhere between 3 and 6. The next step could be to use a smaller grid, for example [3, 4, 5, 6] and repeat the same process. Moving from larger to smaller grids allows us to find the most optimal values. Keep going to try optimizing 2 hyperparameters simultaneously!
2D grid search
The drawback of tuning each hyperparameter independently is a potential dependency between different hyperparameters. The better approach is to try all the possible hyperparameter combinations. However, in such cases, the grid search space is rapidly expanding. For example, if we have 2 parameters with 10 possible values, it will yield 100 experiment runs.
Your goal is to find the best hyperparameter couple of max_depth
and subsample
for the Gradient Boosting model. subsample
is a fraction of observations to be used for fitting the individual trees.
You're given a function get_cv_score()
, which takes the train dataset and dictionary of the model parameters as arguments and returns the overall validation RMSE score over 3-fold cross-validation.
Instructions
100 XP
- Specify the grids for possible
max_depth
andsubsample
values. Formax_depth
: 3, 5 and 7. Forsubsample
: 0.8, 0.9 and 1.0. - Apply the
product()
function from theitertools
package to the hyperparameter grids. It returns all possible combinations for these two grids. - Pass each hyperparameters candidate couple to the model
params
dictionary.
import itertools
# Hyperparameter grids
max_depth_grid = [3, 5, 7]
subsample_grid = [0.8, 0.9, 1.0]
results = {}
# For each couple in the grid
for max_depth_candidate, subsample_candidate in itertools.product(max_depth_grid, subsample_grid):
params = {'max_depth': max_depth_candidate,
'subsample': subsample_candidate}
validation_score = get_cv_score(train, params)
# Save the results for each couple
results[(max_depth_candidate, subsample_candidate)] = validation_score
print(results)
Great! You can see that tuning multiple hyperparameters simultaneously achieves better results. In the previous exercise, tuning only the max_depth
parameter gave the best RMSE of $6.50. With `max_depth` equal to 7 and `subsample` equal to 0.8, the best RMSE is now $6.16. However, do not spend too much time on the hyperparameter tuning at the beginning of the competition! Another approach that almost always improves your solution is model ensembling. Go on for it!