Question: Econ 337 Assignment 1: Hotel Booking Data Analysis Using Machine Learning (Regression, Classification, LDA, QDA, Ridge & Lasso)
14 Feb 2025,10:46 AM
Data
The data hotel_booking.csv file (which, note, you can download on the "Econ 337 – Assignment 1 (Week 15)" section of the course’s Moodle page) includes hotel demand data.
The dataset includes 40,060 observations of a resort hotel and 79,330 observations of a city hotel (a total of 119,390 observations). Each observation represents a hotel booking due to arrive between the 1st of July 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.
These are the variables that you will encounter in hotel_booking.csv:
- hotel: whether hotel is a resort hotel or a city hotel
- is_canceled: value indicating if booking was canceled (=1) or not (=0)
- lead_time: number of days between booking and arrival
- arrival_date_year: year of arrival
- arrival_date_month: month of arrival
- stays_in_weekend_nights: number of weekend nights (Saturday or Sunday) guest stayed or booked to stay
- stays_in_week_nights: number of weekday nights (Monday to Friday) guest stayed or booked to stay
- adults: number of adults
- children: number of children
- babies: number of babies
- country: country of origin
- distribution_channel: booking distribution channel ("TA" stands for "Travel Agents", "TO" means "Tour Operators", "Direct" means it was directly booked by a customer, and "Corporate" means it was booked by a corporation)
- is_repeated_guest: value indicating if the booking name was from a repeated guest (=1) or not (=0)
- previous_cancellations: number of previous bookings that were cancelled by the customer prior to the current booking
- previous_bookings_not_canceled: number of previous bookings not cancelled by the customer prior to the current booking
- booking_changes: number of changes made to booking
- deposit_type: type of deposit
- adr: average daily rate (revenue per available room)
- required_car_parking_spaces: number of car parking spaces required by customer
- total_of_special_requests: number of special requests made by customer
To start, import pandas, numpy, and sklearn:
python
CopyEdit
import pandas as pd import numpy as np # for \texttt{sklearn}, check notebooks followed in class
After, load hotel_booking.csv as a pandas DataFrame:
python
CopyEdit
hotel_bookings = pd.read_csv("hotel_booking.csv")
In what follows, you will develop a model to predict whether a customer booking will be cancelled or not.
Exercises
Read each question carefully and answer ALL questions:
(a) Complete the following tasks which will be important for questions (b) to (f):
- The dataset contains missing observations; remove these observations.
- The type of a number of variables is not ready to be used in a regression. You can confirm this by running
hotel_bookings.dtypes
.
- Randomly split your data into a training and a test set with equal sizes. Set the random seed as your student number. Example:
python
CopyEdit
np.random.seed(WRITE_HERE_YOUR_STUDENT_ID)
- Use
np.random.choice
or train_test_split()
from sklearn
(set random_state
to your student ID).
🔹 NOTE: The training and test set created in (a.3) will be used in questions (b) to (f).
💡 (20 marks)
(b) This question is composed of 2 sub-questions:
- Using all or many predictors, fit:
- A linear model using least squares on your training set.
- A logistic regression for the same prediction task.
- Compare the test error of both models and discuss results.
- Instead of relying only on the Bayes Classifier (which assigns observations based on a 50% probability threshold), the hotel company wants to minimize overbooking risks (customers booking more rooms than available under the assumption that some will cancel but actually show up).
- Perform the prediction analyses from (b.1) using adjusted thresholds.
- Compare and discuss your results, focusing on how the adjusted threshold affects the prediction outcomes and aligns with the company’s goal.
💡 (20 marks)
(c) Perform LDA and QDA
- Using all or many predictors in your data, perform Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) on the training set to predict “is_canceled”.
- Evaluate the test errors for both models.
- Compare results with those obtained in question (b).
💡 (20 marks)
(d) Fit Ridge and Lasso regression models
- Using all predictors in the data, fit:
- Ridge regression
- Lasso regression
- Choose λ (lambda) using 5-fold and 10-fold cross-validation.
- Report test errors for each model.
- Report number of non-zero coefficient estimates as a function of λ.
- Discuss findings.
💡 (20 marks)
(e) Feature selection and model performance comparison
- Select 5 predictors of your choice from the dataset.
- Using these 5 predictors, perform:
- Linear and logistic regression
- LDA and QDA
- Ridge and Lasso regression
- Compare classification errors or mean squared errors with models that included all predictors.
- Did performance improve?
- Now, start with predictors chosen by Lasso in (d) and apply the methods from (b) to (d) again.
- Discuss results and differences observed.
💡 (20 marks)
Expert answer
DRAFT / STUDY TIPS:
(a) Data Preprocessing
How to Approach This Question
- Step 1: Load the dataset using
pandas
.
- Step 2: Handle missing values by dropping or imputing them.
- Step 3: Convert categorical variables into proper formats (
category
or dummy
variables).
- Step 4: Split the dataset into training and testing sets.
Solution
python
CopyEdit
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split # Load dataset hotel_bookings = pd.read_csv("hotel_booking.csv") # Step 1: Remove missing values hotel_bookings.dropna(inplace=True) # Step 2: Convert categorical variables categorical_cols = ["hotel", "arrival_date_month", "distribution_channel", "deposit_type"] hotel_bookings[categorical_cols] = hotel_bookings[categorical_cols].astype("category") # Step 3: Split dataset (50-50 split) np.random.seed(123456) # Replace with your student ID train_data, test_data = train_test_split(hotel_bookings, test_size=0.5, random_state=123456)
✅ Key Points:
- Using
dropna()
removes missing values.
- Converting categorical variables ensures they can be used in models.
train_test_split()
ensures a random but reproducible split.
(b) Linear and Logistic Regression
How to Approach This Question
- Step 1: Fit OLS (Ordinary Least Squares) Regression for binary classification.
- Step 2: Fit Logistic Regression.
- Step 3: Compute test error and compare both models.
Solution
python
CopyEdit
from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.metrics import mean_squared_error, accuracy_score # Define predictors and target variable X_train = train_data.drop(columns=["is_canceled"]) X_test = test_data.drop(columns=["is_canceled"]) y_train = train_data["is_canceled"] y_test = test_data["is_canceled"] # Step 1: OLS Regression linear_model = LinearRegression() linear_model.fit(X_train, y_train) y_pred_ols = linear_model.predict(X_test) # Convert continuous OLS predictions to binary (0 or 1) y_pred_ols_binary = (y_pred_ols > 0.5).astype(int) ols_mse = mean_squared_error(y_test, y_pred_ols_binary) # Step 2: Logistic Regression logistic_model = LogisticRegression(max_iter=500) logistic_model.fit(X_train, y_train) y_pred_logistic = logistic_model.predict(X_test) logistic_acc = accuracy_score(y_test, y_pred_logistic) # Step 3: Compare test errors print(f"OLS MSE: {ols_mse}") print(f"Logistic Regression Accuracy: {logistic_acc}")
✅ Key Points:
- OLS Regression is not ideal for binary classification but provides insights.
- Logistic Regression is better for predicting
is_canceled
.
- The
accuracy_score()
function evaluates model performance.
(c) LDA and QDA
How to Approach This Question
- Step 1: Fit Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).
- Step 2: Compute test errors.
- Step 3: Compare with Logistic Regression.
Solution
python
CopyEdit
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis # Step 1: Train LDA model lda = LinearDiscriminantAnalysis() lda.fit(X_train, y_train) y_pred_lda = lda.predict(X_test) # Step 2: Train QDA model qda = QuadraticDiscriminantAnalysis() qda.fit(X_train, y_train) y_pred_qda = qda.predict(X_test) # Step 3: Compute test errors lda_acc = accuracy_score(y_test, y_pred_lda) qda_acc = accuracy_score(y_test, y_pred_qda) print(f"LDA Accuracy: {lda_acc}") print(f"QDA Accuracy: {qda_acc}")
✅ Key Points:
- LDA works well if assumptions of normality hold.
- QDA is more flexible but can overfit if there isn’t enough data.
(d) Ridge and Lasso Regression
How to Approach This Question
- Step 1: Use cross-validation to choose the best
λ
(regularization parameter).
- Step 2: Fit Ridge and Lasso Regression.
- Step 3: Report test errors.
Solution
python
CopyEdit
from sklearn.linear_model import RidgeCV, LassoCV # Step 1: Define range of lambda values lambda_values = np.logspace(-4, 4, 100) # Step 2: Perform cross-validation for Ridge Regression ridge_cv = RidgeCV(alphas=lambda_values, store_cv_values=True) ridge_cv.fit(X_train, y_train) ridge_best_lambda = ridge_cv.alpha_ ridge_test_error = mean_squared_error(y_test, ridge_cv.predict(X_test)) # Step 3: Perform cross-validation for Lasso Regression lasso_cv = LassoCV(alphas=lambda_values, cv=10) lasso_cv.fit(X_train, y_train) lasso_best_lambda = lasso_cv.alpha_ lasso_test_error = mean_squared_error(y_test, lasso_cv.predict(X_test)) print(f"Best Ridge λ: {ridge_best_lambda}, Test Error: {ridge_test_error}") print(f"Best Lasso λ: {lasso_best_lambda}, Test Error: {lasso_test_error}")
✅ Key Points:
- Ridge regression keeps all variables but shrinks coefficients.
- Lasso regression selects important predictors by shrinking some coefficients to zero.
(e) Feature Selection and Performance Comparison
How to Approach This Question
- Step 1: Select 5 predictors.
- Step 2: Train models using only these predictors.
- Step 3: Compare classification errors with full models.
Solution
python
CopyEdit
# Step 1: Select top 5 predictors selected_features = ["lead_time", "adr", "stays_in_week_nights", "adults", "total_of_special_requests"] X_train_reduced = train_data[selected_features] X_test_reduced = test_data[selected_features] # Step 2: Train models with reduced features logistic_reduced = LogisticRegression(max_iter=500) logistic_reduced.fit(X_train_reduced, y_train) y_pred_reduced = logistic_reduced.predict(X_test_reduced) # Step 3: Compare accuracy reduced_acc = accuracy_score(y_test, y_pred_reduced) print(f"Reduced Model Accuracy: {reduced_acc}")
✅ Key Points:
- Using fewer variables can reduce overfitting.
- Performance may improve if irrelevant variables are removed.
Final Tips for Answering This Assignment
- Follow a structured approach: Always start with data cleaning, then move to model fitting and evaluation.
- Use proper evaluation metrics:
- Accuracy for classification models.
- Mean Squared Error (MSE) for regression models.
- Compare and interpret results:
- Do models perform differently?
- Does using fewer features improve accuracy?
- Does regularization help reduce overfitting?
- Use cross-validation to optimize hyperparameters.
- Explain each step clearly in the Jupyter Notebook submission
Stuck Looking For A Model Original Answer To This Or Any Other
Question?
Our skilled experts only need your instructions and deadline to help you produce an original and flawless paper.
Place Order Now