Question: Econ 337 Assignment 1: Hotel Booking Data Analysis Using Machine Learning (Regression, Classification, LDA, QDA, Ridge & Lasso)

14 Feb 2025,10:46 AM

Data

The data hotel_booking.csv file (which, note, you can download on the "Econ 337 – Assignment 1 (Week 15)" section of the course’s Moodle page) includes hotel demand data.

The dataset includes 40,060 observations of a resort hotel and 79,330 observations of a city hotel (a total of 119,390 observations). Each observation represents a hotel booking due to arrive between the 1st of July 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.

These are the variables that you will encounter in hotel_booking.csv:

hotel: whether hotel is a resort hotel or a city hotel
is_canceled: value indicating if booking was canceled (=1) or not (=0)
lead_time: number of days between booking and arrival
arrival_date_year: year of arrival
arrival_date_month: month of arrival
stays_in_weekend_nights: number of weekend nights (Saturday or Sunday) guest stayed or booked to stay
stays_in_week_nights: number of weekday nights (Monday to Friday) guest stayed or booked to stay
adults: number of adults
children: number of children
babies: number of babies
country: country of origin
distribution_channel: booking distribution channel ("TA" stands for "Travel Agents", "TO" means "Tour Operators", "Direct" means it was directly booked by a customer, and "Corporate" means it was booked by a corporation)
is_repeated_guest: value indicating if the booking name was from a repeated guest (=1) or not (=0)
previous_cancellations: number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled: number of previous bookings not cancelled by the customer prior to the current booking
booking_changes: number of changes made to booking
deposit_type: type of deposit
adr: average daily rate (revenue per available room)
required_car_parking_spaces: number of car parking spaces required by customer
total_of_special_requests: number of special requests made by customer

To start, import pandas, numpy, and sklearn:

python

CopyEdit

import pandas as pd import numpy as np # for \texttt{sklearn}, check notebooks followed in class

After, load hotel_booking.csv as a pandas DataFrame:

python

CopyEdit

hotel_bookings = pd.read_csv("hotel_booking.csv")

In what follows, you will develop a model to predict whether a customer booking will be cancelled or not.

Exercises

Read each question carefully and answer ALL questions:

(a) Complete the following tasks which will be important for questions (b) to (f):

The dataset contains missing observations; remove these observations.
The type of a number of variables is not ready to be used in a regression. You can confirm this by running hotel_bookings.dtypes.
- Detect all variables which are not ready to be used in a regression and change their type to “categorical” or “dummy”.
- Example code for changing the format of two variables:
```
 
```
  python
  
  CopyEdit
  
  cols = ["hotel", "arrival_date_month"] hotel_bookings[cols] = hotel_bookings[cols].astype("category")
Randomly split your data into a training and a test set with equal sizes. Set the random seed as your student number. Example:
```
 
```
python

CopyEdit

np.random.seed(WRITE_HERE_YOUR_STUDENT_ID)
- Use np.random.choice or train_test_split() from sklearn (set random_state to your student ID).

🔹 NOTE: The training and test set created in (a.3) will be used in questions (b) to (f).

💡 (20 marks)

(b) This question is composed of 2 sub-questions:

Using all or many predictors, fit:
- A linear model using least squares on your training set.
- A logistic regression for the same prediction task.
- Compare the test error of both models and discuss results.
Instead of relying only on the Bayes Classifier (which assigns observations based on a 50% probability threshold), the hotel company wants to minimize overbooking risks (customers booking more rooms than available under the assumption that some will cancel but actually show up).
- Perform the prediction analyses from (b.1) using adjusted thresholds.
- Compare and discuss your results, focusing on how the adjusted threshold affects the prediction outcomes and aligns with the company’s goal.

💡 (20 marks)

(c) Perform LDA and QDA

Using all or many predictors in your data, perform Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) on the training set to predict “is_canceled”.
Evaluate the test errors for both models.
Compare results with those obtained in question (b).

💡 (20 marks)

(d) Fit Ridge and Lasso regression models

Using all predictors in the data, fit:
- Ridge regression
- Lasso regression
Choose λ (lambda) using 5-fold and 10-fold cross-validation.
Report test errors for each model.
Report number of non-zero coefficient estimates as a function of λ.
Discuss findings.

💡 (20 marks)

(e) Feature selection and model performance comparison

Select 5 predictors of your choice from the dataset.
Using these 5 predictors, perform:
- Linear and logistic regression
- LDA and QDA
- Ridge and Lasso regression
Compare classification errors or mean squared errors with models that included all predictors.
Did performance improve?
Now, start with predictors chosen by Lasso in (d) and apply the methods from (b) to (d) again.
Discuss results and differences observed.

💡 (20 marks)

Expert answer

DRAFT / STUDY TIPS:

(a) Data Preprocessing

How to Approach This Question

Step 1: Load the dataset using pandas.
Step 2: Handle missing values by dropping or imputing them.
Step 3: Convert categorical variables into proper formats (category or dummy variables).
Step 4: Split the dataset into training and testing sets.

Solution

python

CopyEdit

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split # Load dataset hotel_bookings = pd.read_csv("hotel_booking.csv") # Step 1: Remove missing values hotel_bookings.dropna(inplace=True) # Step 2: Convert categorical variables categorical_cols = ["hotel", "arrival_date_month", "distribution_channel", "deposit_type"] hotel_bookings[categorical_cols] = hotel_bookings[categorical_cols].astype("category") # Step 3: Split dataset (50-50 split) np.random.seed(123456) # Replace with your student ID train_data, test_data = train_test_split(hotel_bookings, test_size=0.5, random_state=123456)

✅ Key Points:

Using dropna() removes missing values.
Converting categorical variables ensures they can be used in models.
train_test_split() ensures a random but reproducible split.

(b) Linear and Logistic Regression

How to Approach This Question

Step 1: Fit OLS (Ordinary Least Squares) Regression for binary classification.
Step 2: Fit Logistic Regression.
Step 3: Compute test error and compare both models.

Solution

python

CopyEdit

from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.metrics import mean_squared_error, accuracy_score # Define predictors and target variable X_train = train_data.drop(columns=["is_canceled"]) X_test = test_data.drop(columns=["is_canceled"]) y_train = train_data["is_canceled"] y_test = test_data["is_canceled"] # Step 1: OLS Regression linear_model = LinearRegression() linear_model.fit(X_train, y_train) y_pred_ols = linear_model.predict(X_test) # Convert continuous OLS predictions to binary (0 or 1) y_pred_ols_binary = (y_pred_ols > 0.5).astype(int) ols_mse = mean_squared_error(y_test, y_pred_ols_binary) # Step 2: Logistic Regression logistic_model = LogisticRegression(max_iter=500) logistic_model.fit(X_train, y_train) y_pred_logistic = logistic_model.predict(X_test) logistic_acc = accuracy_score(y_test, y_pred_logistic) # Step 3: Compare test errors print(f"OLS MSE: {ols_mse}") print(f"Logistic Regression Accuracy: {logistic_acc}")

✅ Key Points:

OLS Regression is not ideal for binary classification but provides insights.
Logistic Regression is better for predicting is_canceled.
The accuracy_score() function evaluates model performance.

(c) LDA and QDA

How to Approach This Question

Step 1: Fit Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).
Step 2: Compute test errors.
Step 3: Compare with Logistic Regression.

Solution

python

CopyEdit

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis # Step 1: Train LDA model lda = LinearDiscriminantAnalysis() lda.fit(X_train, y_train) y_pred_lda = lda.predict(X_test) # Step 2: Train QDA model qda = QuadraticDiscriminantAnalysis() qda.fit(X_train, y_train) y_pred_qda = qda.predict(X_test) # Step 3: Compute test errors lda_acc = accuracy_score(y_test, y_pred_lda) qda_acc = accuracy_score(y_test, y_pred_qda) print(f"LDA Accuracy: {lda_acc}") print(f"QDA Accuracy: {qda_acc}")

✅ Key Points:

LDA works well if assumptions of normality hold.
QDA is more flexible but can overfit if there isn’t enough data.

(d) Ridge and Lasso Regression

How to Approach This Question

Step 1: Use cross-validation to choose the best λ (regularization parameter).
Step 2: Fit Ridge and Lasso Regression.
Step 3: Report test errors.

Solution

python

CopyEdit

from sklearn.linear_model import RidgeCV, LassoCV # Step 1: Define range of lambda values lambda_values = np.logspace(-4, 4, 100) # Step 2: Perform cross-validation for Ridge Regression ridge_cv = RidgeCV(alphas=lambda_values, store_cv_values=True) ridge_cv.fit(X_train, y_train) ridge_best_lambda = ridge_cv.alpha_ ridge_test_error = mean_squared_error(y_test, ridge_cv.predict(X_test)) # Step 3: Perform cross-validation for Lasso Regression lasso_cv = LassoCV(alphas=lambda_values, cv=10) lasso_cv.fit(X_train, y_train) lasso_best_lambda = lasso_cv.alpha_ lasso_test_error = mean_squared_error(y_test, lasso_cv.predict(X_test)) print(f"Best Ridge λ: {ridge_best_lambda}, Test Error: {ridge_test_error}") print(f"Best Lasso λ: {lasso_best_lambda}, Test Error: {lasso_test_error}")

✅ Key Points:

Ridge regression keeps all variables but shrinks coefficients.
Lasso regression selects important predictors by shrinking some coefficients to zero.

(e) Feature Selection and Performance Comparison

How to Approach This Question

Step 1: Select 5 predictors.
Step 2: Train models using only these predictors.
Step 3: Compare classification errors with full models.

Solution

python

CopyEdit

# Step 1: Select top 5 predictors selected_features = ["lead_time", "adr", "stays_in_week_nights", "adults", "total_of_special_requests"] X_train_reduced = train_data[selected_features] X_test_reduced = test_data[selected_features] # Step 2: Train models with reduced features logistic_reduced = LogisticRegression(max_iter=500) logistic_reduced.fit(X_train_reduced, y_train) y_pred_reduced = logistic_reduced.predict(X_test_reduced) # Step 3: Compare accuracy reduced_acc = accuracy_score(y_test, y_pred_reduced) print(f"Reduced Model Accuracy: {reduced_acc}")

✅ Key Points:

Using fewer variables can reduce overfitting.
Performance may improve if irrelevant variables are removed.

Final Tips for Answering This Assignment

Follow a structured approach: Always start with data cleaning, then move to model fitting and evaluation.
Use proper evaluation metrics:
- Accuracy for classification models.
- Mean Squared Error (MSE) for regression models.
Compare and interpret results:
- Do models perform differently?
- Does using fewer features improve accuracy?
- Does regularization help reduce overfitting?
Use cross-validation to optimize hyperparameters.
Explain each step clearly in the Jupyter Notebook submission

Stuck Looking For A Model Original Answer To This Or Any Other
Question?
Our skilled experts only need your instructions and deadline to help you produce an original and flawless paper.

Question: Econ 337 Assignment 1: Hotel Booking Data Analysis Using Machine Learning (Regression, Classification, LDA, QDA, Ridge & Lasso)

Data

Exercises

(a) Complete the following tasks which will be important for questions (b) to (f):

(b) This question is composed of 2 sub-questions:

(c) Perform LDA and QDA

(d) Fit Ridge and Lasso regression models

(e) Feature selection and model performance comparison

Expert answer

DRAFT / STUDY TIPS:

(a) Data Preprocessing

How to Approach This Question

Solution

(b) Linear and Logistic Regression

How to Approach This Question

Solution

(c) LDA and QDA

How to Approach This Question

Solution

(d) Ridge and Lasso Regression

How to Approach This Question

Solution

(e) Feature Selection and Performance Comparison

How to Approach This Question

Solution

Final Tips for Answering This Assignment

Related Questions

How Have Cultural, Technological, and Economic Factors Shaped the Global Evolution of Sports Broadcasting? A Critical Analysis of Historical Development, Political Economy, National Policies, and Stakeholder Dynamics

Is Long-Term Proton Pump Inhibitor Use Necessary for Managing Chronic Gastroesophageal Reflux Disease? A Critical Analysis of Clinical Practice Guidelines and Pharmacological Treatment

The Influence of Personal Data on Everyday Life: Theories, Case Studies, and Ethical Implications of Data Monetization and Privacy Challenges

The Evolution of LGBTQ+ Identities, Legal Perceptions, and Activism (1885–1990): A Historical Analysis