Skip to the content.

Home | Portfolio | GitHub | LinkedIn | Medium | Stack Overflow | Terms | E-mail

XGBoost and Imbalanced Classes: Predicting Hotel Cancellations

Boosting is often referred to as an ensemble method. This is a technique whereby a series of individual models (or weak learners) are combined to build a model that yields superior predictive power (strong learner).

XGBoost is quite a popular boosting method — it stands for “extreme gradient boosting” and is an extension to gradient boosted decision trees.

In this example, boosting techniques are used to determine whether a customer will cancel their hotel booking or not.

Specifically, XGBoost will be used to build a model to predict hotel cancellations across the hotel booking dataset by Antonio, Almedia and Nunes (2019).

Data Overview and Feature Selection

The training data is imported from an AWS S3 bucket as follows:

import boto3
import botocore
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket = 'yourbucketname'
data_key_train = 'H1.csv'
data_location_train = 's3://{}/{}'.format(bucket, data_key_train)
train_df = pd.read_csv(data_location_train)

Hotel cancellations represent the response (or dependent) variable, where 1 = cancel, 0 = follow through with booking.

The features for analysis are as follows.


leadtime = train_df['LeadTime']
arrivaldateyear = train_df['ArrivalDateYear']
arrivaldateweekno = train_df['ArrivalDateWeekNumber']
arrivaldatedayofmonth = train_df['ArrivalDateDayOfMonth']
staysweekendnights = train_df['StaysInWeekendNights']
staysweeknights = train_df['StaysInWeekNights']
adults = train_df['Adults']
children = train_df['Children']
babies = train_df['Babies']
isrepeatedguest = train_df['IsRepeatedGuest'] 
previouscancellations = train_df['PreviousCancellations']
previousbookingsnotcanceled = train_df['PreviousBookingsNotCanceled']
bookingchanges = train_df['BookingChanges']
agent = train_df['Agent']
company = train_df['Company']
dayswaitinglist = train_df['DaysInWaitingList']
adr = train_df['ADR']
rcps = train_df['RequiredCarParkingSpaces']
totalsqr = train_df['TotalOfSpecialRequests']


arrivaldatemonth = train_df.ArrivalDateMonth.astype("category")

The identified features to be included in the analysis using both the ExtraTreesClassifier and forward and backward feature selection methods are as follows:

Boosting Techniques

XGBoost is a boosting technique that has become renowned for its execution speed and model performance, and is increasingly being relied upon as a default boosting method — this method implements the gradient boosting decision tree algorithm which works in a similar manner to adaptive boosting, but instance weights are no longer tweaked at every iteration as in the case of AdaBoost. Instead, an attempt is made to fit the new predictor to the residual errors that the previous predictor made.

Precision vs. Recall and f1-score

When comparing the accuracy scores, we see that numerous readings are provided in each confusion matrix.

However, a particularly important distinction exists between precision and recall.

The two readings are often at odds with each other, i.e. it is often not possible to increase precision without reducing recall, and vice versa.

An assessment as to the ideal metric to use depends in large part on the specific data under analysis. For example, cancer detection screenings that have false negatives (i.e. indicating patients do not have cancer when in fact they do), is a big no-no. Under this scenario, recall is the ideal metric.

However, for emails — one might prefer to avoid false positives, i.e. sending an important email to the spam folder when in fact it is legitimate.

The f1-score takes both precision and recall into account when devising a more general score.

Which would be more important for predicting hotel cancellations?

Well, from the point of view of a hotel — they would likely wish to identify customers who are ultimately going to cancel their booking with greater accuracy — this allows the hotel to better allocate rooms and resources. Identifying customers who are not going to cancel their bookings may not necessarily add value to the hotel’s analysis, as the hotel knows that a significant proportion of customers will ultimately follow through with their bookings in any case.


The data is firstly split into training and validation data for the H1 dataset, with the H2 dataset being used as the test set for comparing the XGBoost predictions with actual cancellation incidences.

Here is an implementation of the XGBoost algorithm:

import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.001,
                            max_depth = 1, 
                            n_estimators = 100,
                              scale_pos_weight=3), y_train)

Note that the scale_pos_weight parameter in this instance is set to 3. The higher the weight, the greater the penalty imposed for errors on the minor class, in this case any incidences of 1 in the response variable, i.e. hotel cancellations. The reason for doing this is because there are more 0s than 1s in the dataset — i.e. more customers follow through on their bookings than cancel.

Therefore, in order to have an unbiased model, errors on the minor class need to be penalised more severely.

Performance on Validation Set

Here is the accuracy on the training and validation set:

>>> import xgboost as xgb
>>> xgb_model = xgb.XGBClassifier(learning_rate=0.001, max_depth = 1, n_estimators = 100, scale_pos_weight=3)
>>>, y1_train)

>>> print("Accuracy on training set: {:.3f}".format(xgb_model.score(x1_train, y1_train)))
>>> print("Accuracy on validation set: {:.3f}".format(xgb_model.score(x1_val, y1_val)))

Accuracy on training set: 0.579
Accuracy on validation set: 0.571

The predictions are generated:

>>> xgb_predict=xgb_model.predict(x1_val)
>>> xgb_predict

array([1, 1, 1, ..., 0, 1, 1])

Here is a confusion matrix comparing the predicted vs. actual cancellations on the validation set:

>>> from sklearn.metrics import classification_report,confusion_matrix
>>> print(confusion_matrix(y1_val,xgb_predict))
>>> print(classification_report(y1_val,xgb_predict))

[[3159 4107]
 [ 194 2555]]
              precision    recall  f1-score   support

           0       0.94      0.43      0.59      7266
           1       0.38      0.93      0.54      2749

    accuracy                           0.57     10015
   macro avg       0.66      0.68      0.57     10015
weighted avg       0.79      0.57      0.58     10015

Note that while the accuracy in terms of the f1-score (57%) is modest — the recall score for class 1 (cancellations) is 93%. This means that the model is generating many false positives which reduces the overall accuracy — but this has had the effect of increasing recall to 93%, i.e. the model is 93% successful at identifying all the customers who will cancel their booking, even if this results in some false positives.

k-fold Cross Validation

In the above analysis, we can see that the accuracy and recall was gauged by a standard train-validation split, i.e. train the model on the training data and then compare the predictions to the validation data.

However, there is the risk that the model may have made strong predictions on the validation set by chance, but there is no guarantee that this would translate into making strong predictions on real-world data.

To attempt to mitigate this risk, a technique called k-fold cross validation can be used. This technique works by partitioning the data into a specified number of folds. Let us say that we wish to split the data into five separate groups, i.e. k=5.

By doing this, one part of the data is withheld as the test set — while the remaining four parts of the dataset are used as the training data. By alternating the choice of test set in each instance, we are now training five separate models to make predictions on five separate test sets.

Let us perform this technique and see what we come up with:

>>> from sklearn.model_selection import cross_validate
>>> cv_results = cross_validate(xgb_model, x1_train, y1_train, cv=5)
>>> cv_results

{'fit_time': array([0.09879589, 0.06915689, 0.06884503, 0.06944084, 0.06797481]),
 'score_time': array([0.00231791, 0.0023036 , 0.00233889, 0.00230122, 0.00229812]),
 'test_score': array([0.57547013, 0.58745216, 0.58029622, 0.57779997, 0.57580296])}

We can see that the test score is at least 0.57 across all five instances. Therefore, we can be more confident that the 0.57 f1-score for accuracy that was originally yielded would hold when testing across unseen data.

Given that recall is a metric of importance for this example, we can also gauge the recall score when performing k-fold cross validation:

>>> from sklearn.model_selection import cross_validate
>>> cv_results = cross_validate(xgb_model, x1_train, y1_train, scoring="recall", cv=5)
>>> cv_results

{'fit_time': array([0.09710097, 0.06710696, 0.06669283, 0.0674212 , 0.06662488]),
 'score_time': array([0.00415516, 0.00414419, 0.00426912, 0.00424576, 0.00424981]),
 'test_score': array([0.92358209, 0.93373134, 0.92298507, 0.92532855, 0.92353644])}

The test score for recall is 0.92 or higher across all five trials. As a result, we can be relatively more confident that such a recall score would hold when testing across unseen data.

Performance on Test Set

As previously, the test set is also imported from the relevant S3 bucket:

data_key_test = 'H2.csv'
data_location_test = 's3://{}/{}'.format(bucket, data_key_test)
h2data = pd.read_csv(data_location_test)

Here is the subsequent classification performance of the XGBoost model on H2, which is the test set in this instance.

>>> prh4 = xgb_model.predict(a)
>>> prh4

array([0, 1, 1, ..., 1, 1, 1])

>>> from sklearn.metrics import classification_report,confusion_matrix
>>> print(confusion_matrix(b,prh4))
>>> print(classification_report(b,prh4))

[[12650 33578]
 [ 1972 31130]]
              precision    recall  f1-score   support

           0       0.87      0.27      0.42     46228
           1       0.48      0.94      0.64     33102

    accuracy                           0.55     79330
   macro avg       0.67      0.61      0.53     79330
weighted avg       0.70      0.55      0.51     79330

The accuracy as indicated by the f1-score is slightly higher at 44%, but the recall accuracy for class 1 is at 100% once again.

Calibration: scale_pos_weight

In this instance, it is observed that using a scale_pos_weight of 3 resulted in a 94% recall while yielding an f1-score accuracy of 55%.

However, a high recall score can also be unreliable. For instance, suppose that the scale_pos_weight was set even higher — which meant that almost all of the predictions indicated a response of 1, i.e. all customers were predicted to cancel their booking.

This model has no inherent value if all the customers are predicted to cancel, since there is no longer any way of identifying the unique attributes of customers who are likely to cancel their booking versus those who do not.

In this regard, a more balanced solution is to have a high recall while also ensuring that the overall accuracy does not fall excessively low.

Here are the confusion matrix results for when respective weights of 2, 4, and 5 are used.

scale_pos_weight = 2

[[36926  9302]
 [12484 20618]]
              precision    recall  f1-score   support
           0       0.75      0.80      0.77     46228
           1       0.69      0.62      0.65     33102
    accuracy                           0.73     79330
   macro avg       0.72      0.71      0.71     79330
weighted avg       0.72      0.73      0.72     79330

scale_pos_weight = 4

[[ 1926 44302]
 [    0 33102]]
              precision    recall  f1-score   support
           0       1.00      0.04      0.08     46228
           1       0.43      1.00      0.60     33102
    accuracy                           0.44     79330
   macro avg       0.71      0.52      0.34     79330
weighted avg       0.76      0.44      0.30     79330

scale_pos_weight = 5

[[ 1926 44302]
 [    0 33102]]
              precision    recall  f1-score   support
           0       1.00      0.04      0.08     46228
           1       0.43      1.00      0.60     33102
    accuracy                           0.44     79330
   macro avg       0.71      0.52      0.34     79330
weighted avg       0.76      0.44      0.30     79330

When the scale_pos_weight was set to 3, recall came in at 94% while accuracy was at 55%. When the scale_pos_weight parameter is set to 5, recall is at 100% while the f1-score accuracy falls to 44%. Additionally, note that increasing the parameter from 4 to 5 does not result in any change in either recall or overall accuracy.

In this regard, using a weight of 3 allows for a high recall, while still allowing overall classification accuracy to remain above 50% and allows the hotel a baseline to differentiate between the attributes of customers who cancel their booking and those who do not.


In this example, you have seen the use of various boosting methods to predict hotel cancellations. As mentioned, the boosting method in this instance was set to impose greater penalties on the minor class, which had the result of lowering the overall accuracy as measure by the f1-score since there were more false positives present.

However, the recall score increased vastly as a result — if it is assumed that false positives are more tolerable than false negatives in this situation — then one could argue that the model has performed quite well on this basis. For reference, an SVM model run on the same dataset demonstrated an overall accuracy of 63%, while recall on class 1 decreased to 75%.

We have also seen how k-fold cross validation can be used to determine whether the accuracy and recall readings from testing the model remain consistent when testing across several folds.

The datasets and notebooks for this example are available at the MGCodesandStats GitHub repository, along with further research on this topic.

Useful References