HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
SCHOOL OF INFORMATION COMMUNICATION TECHNOLOGY
PROJECT REPORT
Hotel Price Prediction
Course: Introduction to Data Science
Supervisor: Assoc. Prof Than Quang Khoat
Group 9
Trần Văn Toàn 20214932
Nguyễn Đắc Tâm 20210763
Nguyễn Trần Nhật Quốc 20210726
Bùi Ánh Dương 20225489
Nguyễn Hoàng Sơn 20225525
Hanoi, December 2024
1 Project overview .............................................................................................................. 1
1.1 Problem Statement .................................................................................................. 1
1.2 Complexity of the Problem ..................................................................................... 1
1.3 Questions Addressed by the Project ........................................................................ 1
1.4 Application of the Project ....................................................................................... 2
2 Data collection ................................................................................................................. 3
2.1 Data crawling.......................................................................................................... 3
2.2 Preprocessing and data imputation .......................................................................... 4
2.2.1 Duplicate features................................................................................................ 4
2.2.2 Utilizing Regular Expressions (Regex) for Standardization ................................... 5
2.2.3 Ad-Hoc Logic for Edge Cases............................................................................... 5
2.2.4 Feature Consolidation and Reduction................................................................... 5
2.2.5 Benefits of Feature Consolidation......................................................................... 5
3 EDA ............................................................................................................................... 6
3.1 Overview of Dataset ................................................................................................ 6
3.2 Data Distribution ................................................................................................... 6
3.3 Correlation matrix .................................................................................................. 7
3.4 Feature Importance Visualization ............................................................................ 9
3.5 Key Insights .......................................................................................................... 10
3.5.1 Prioritize Key Features ...................................................................................... 10
3.5.2 Insights on Spatial Features ............................................................................... 10
3.5.3 Temporal Features (year, month) ....................................................................... 10
3.5.4 Capacity-Related Features ................................................................................. 10
3.5.5 Experience and Capacity ................................................................................... 10
4 Machine Learning models .............................................................................................. 11
4.1 Supervised Models ................................................................................................ 11
4.1.1 Method ............................................................................................................. 11
Linear Regression ...................................................................................................... 11
Random Forest .......................................................................................................... 11
XGBoost .................................................................................................................... 12
4.2.2 Evaluation metrics ................................................................................................. 12
Mean Absolute Percentage Error (MAPE)...................................................................... 13
4.2.3 Hyperparameter tuning ..................................................................................... 14
4.3 Time series Models ............................................................................................ 14
4.3.3 Prophet ............................................................................................................. 15
5. Results ....................................................................................................................... 16
5.1 Supervised Models ................................................................................................ 16
5.1.1 Experimental Results and Analysis .................................................................... 16
Key Observations ....................................................................................................... 17
5.2 Time Series Model ................................................................................................. 17
5.2.1 Experiment Results ........................................................................................... 17
5.2.2 Experiment Analysis.......................................................................................... 18
6 Challenges ..................................................................................................................... 19
7 Future Work ................................................................................................................. 20
8. Conclusion .................................................................................................................... 21
Contributions
Member Contributions
- Performed thorough data preprocessing by handling
missing values, duplicates, and irrelevant records.
- Developed an interactive Streamlit web application to
Bùi Ánh Dương
demonstrate the hotel price prediction model.
- Designed a user-friendly interface to input features and
display predicted prices.
- Performed thorough data preprocessing by handling
missing values, duplicates, and irrelevant records.
- Used regular expressions (Regex) and custom logic to
Nguyễn Đắc Tâm
standardize features such as amenities and room attributes.
- Designed and implemented multiple machine learning
models for hotel price prediction.
- Conducted comprehensive EDA to analyze the dataset and
identify key trends, patterns, and relationships between
features.
Nguyễn Trần Nhật - Created visualizations such as histograms, scatter plots,
Quốc and feature importance charts to provide insights into price
distributions and feature significance.
- Designed and implemented multiple machine learning
models for hotel price prediction.
- Developed and implemented the web scraping pipeline to
extract data from Airbnb using Selenium.
- Managed dynamic content handling and optimized the
Nguyễn Hoàng Sơn
crawling process.
- Performed thorough data preprocessing by handling
missing values, duplicates, and irrelevant records.
- Crawl data
- Designed and implemented multiple machine learning
Trần Văn Toàn
models for hotel price prediction.
- Testing model and evaluating different dataset splits.
1 Project overview
1.1 Problem Statement
In the hospitality industry, pricing strategies play a crucial role in maximizing
revenue and ensuring competitiveness. The ability to predict hotel room prices with
accuracy is fundamental for both hotel management and customers. The problem lies in
the inherent unpredictability of hotel prices, influenced by various dynamic factors such
as demand fluctuations, market competition, seasonal trends, local events, customer
preferences, and economic conditions. Traditionally, pricing has been a reactive process,
with limited capacity to anticipate price fluctuations, leading to revenue loss or missed
opportunities for both hoteliers and consumers.
The lack of a robust predictive framework in the hotel pricing process presents
significant challenges. Hoteliers need to balance between offering competitive prices
and maximizing profitability. Simultaneously, consumers seek more transparency and
predictability in the costs of their accommodation. Therefore, an effective hotel price
prediction system is needed to assist both sides in making informed decisions.
1.2 Complexity of the Problem
Hotel price prediction is a multifaceted problem that incorporates a range of
variables, often making it difficult to model accurately. Key challenges include:
1. Data Volatility: Hotel prices are highly volatile and can fluctuate rapidly due to
factors like last-minute bookings, demand surges (e.g., holidays or local events), and
competitor pricing adjustments.
2. External Influences: Prices can be influenced by external elements such as
geopolitical events, economic downturns, or shifts in travel behavior (e.g., following a
global pandemic).
3. Seasonality: Hotel prices are often subject to seasonal patterns, with peak and
off-peak periods affecting demand and price elasticity.
4. Multi-dimensional Data: Pricing decisions are driven by a variety of factors,
including location, room type, star rating, reviews, proximity to local attractions, and
the length of stay. Combining these different sources of information into a unified
prediction model adds to the complexity.
5. Consumer Behavior: Understanding consumer demand and behavior patterns
plays a significant role. Factors like booking lead time, customer profile, and past
booking trends must also be considered to predict accurate prices.
1.3 Questions Addressed by the Project
This project seeks to answer several critical questions that directly impact hotel
pricing strategies:
1
1. What are the key factors influencing hotel room prices, and how can they be
quantified?
2. Can we develop a predictive model that provides accurate hotel price forecasts
based on historical data?
3. How can a predictive model help in revenue management for hotels?
4. How can predictive pricing benefit consumers in terms of cost transparency and
booking decisions?
1.4 Application of the Project
The primary application of this project lies in the optimization of hotel revenue
management systems. By providing a more accurate and data-driven pricing model,
the system can help hotel managers adjust prices dynamically in response to changing
demand, competitor activity, and external events. This will lead to enhanced
profitability, especially in an industry that thrives on perishable inventory.
From a consumer perspective, a hotel price prediction system can empower
users with more information, helping them make timely decisions about their
accommodation choices. For example, consumers may be able to predict whether a hotel
price is likely to rise or fall based on upcoming events, enabling them to secure a better
rate. This could lead to a more satisfying customer experience, which is critical in
today’s competitive hospitality market.
Moreover, large-scale platforms such as OTAs (Online Travel Agencies) can
leverage this predictive model to enhance their pricing algorithms, offering real-time
price insights to customers and thereby increasing platform loyalty and trust. Machine
learning techniques can also be used to continually refine the model based on incoming
data, allowing the prediction system to evolve and adapt over time.
In the long term, this project can contribute to the evolution of automated
pricing tools for the hospitality industry, promoting the widespread adoption of AI-
driven solutions for more efficient and adaptive pricing strategies. Through the
integration of this predictive framework into existing hotel management software, the
industry can expect a more streamlined, data-centric approach to pricing.
By addressing both the operational and customer-facing aspects of hotel price
prediction, this project aims to create a comprehensive solution that benefits
stakeholders across the hospitality sector.
2
2 Data collection
2.1 Data crawling
The goal of the crawling process is to extract critical data from Airbnb to analyze
or model property trends. The key features to gather include:
• Price: The rental cost for the listed property.
• Location (Lat/Long): Geographical coordinates for mapping and spatial
analysis.
• Room Features: A one-hot encoded feature vector for amenities such as Wi-Fi,
safety options, bathroom availability, kitchen, etc.
Extract feature phases from price, check in checkout in booking process.
3
The data collection phase is a critical step in building an accurate hotel price
prediction model. A key aspect of this phase is gathering geographic data, such as
latitude and longitude, we extract it from dynamic map in airbnb values.
To collect data from hotel websites and Airbnb, web scraping techniques will be
used, with Selenium being the primary tool. Selenium allows for dynamic content
extraction from websites, which is crucial when dealing with sites like Airbnb, where
content is loaded asynchronously using JavaScript.
However, Selenium is known for being relatively slow compared to traditional
web scraping methods, as it operates by simulating a real user’s interaction with the web
page (clicking, scrolling, etc.). This means that extracting large datasets may take
considerable time, especially when dealing with multiple pages and dynamic elements.
To address this, we use multiprocessing to parallel crawling process which results in k
time being faster.
2.2 Preprocessing and data imputation
2.2.1 Duplicate features
In academic data science and machine learning workflows, the process of feature
consolidation and redundancy reduction is a critical aspect of preparing datasets for
analysis and model training. This process addresses the issue of duplicate
representations of the same underlying concept, which can arise from various sources
such as differences in naming conventions, languages, or data formats.
A common approach to addressing this issue is the use of functions like apply()
in conjunction with custom parsing functions (e.g., parse_room_info) to extract
individual components from composite features.
4
For instance, applying the following code snippet would result in the extraction
of relevant features such as the number of guests, beds, bedrooms, bathrooms, and the
type of bathroom, transforming them into discrete columns:
Similarly, we provide automated pipelines to parse and stream duplicate features.
2.2.2 Utilizing Regular Expressions (Regex) for Standardization
Regular expressions (regex) play a crucial role in identifying patterns within
unstructured data. In this context, regex is employed to:
• Extract Key Information: For example, extracting numerical values (e.g.,
“2” guests) or categorical indicators (e.g., “shared” bathroom type).
• Normalize Representations: Standardizing various formats for the same feature.
For instance, the text representation of numbers like “two” could be converted
into the numeric value 2, and abbreviations like “bd” for “bedroom” can be
expanded.
2.2.3 Ad-Hoc Logic for Edge Cases
We use regex to provides a robust mechanism for pattern matching for edge case
require bespoke logic. These might include:
• Synonym Mapping: Mapping common synonyms or abbreviations (e.g.,
mapping “bd” to “bedroom” or “bth” to “bathroom”).
• Data Conversion: Converting strings to numeric values where necessary (e.g.,
"two bedrooms" → 2).
• Handling Non-standard Formats: Dealing with unstructured or inconsistent
representations (e.g., "3 and a half bathrooms" might need special parsing to
correctly interpret the fractional bathroom count).
2.2.4 Feature Consolidation and Reduction
After the extraction, transformation, and standardization of features, the next step
is to consolidate these into a more coherent and concise set of variables. In this example,
the process reduces the number of original features (e.g., a broad set of room-related
attributes) from over 300 features to a refined set of just 40 useful features. This
reduction process helps eliminate redundancy and ensures that only the most relevant
attributes are retained for downstream analysis or model training.
2.2.5 Benefits of Feature Consolidation
The benefits of this feature consolidation and redundancy reduction process in an
academic setting include:
• Improved Dataset Efficiency: By reducing the number of features, the dataset
becomes easier to handle, both in terms of memory usage and processing time.
5
• Enhanced Model Performance: Fewer, more relevant features can reduce the
risk of overfitting, leading to more generalized models that perform better on
unseen data.
• Increased Interpretability: By consolidating features, the data becomes more
interpretable, making it easier for researchers to understand and analyze the
relationships between variables.
3 EDA
3.1 Overview of Dataset
The dataset used for this hotel price prediction project is sourced from Airbnb,
containing information on various properties listed for rental. The data includes features
that describe the characteristics, amenities, and location of each property, which are
critical for price prediction.
Raw Dataset:
• Size: 6,528 records.
• Features: 354 features, each representing a property attribute such as room type,
amenities, location, and availability.
• Data Collection Period: The data was crawled from January 2025 to
December 2026.
• Observation: While the raw dataset provides a comprehensive view of property
details, many features are irrelevant, redundant, or noisy for the price prediction
task.
Clean Dataset:
• Size: 5000 records with 19 selected features.
This ensures a more focused analysis, improves model performance, and reduces
the computational complexity, while retaining the most impactful variables for hotel
price prediction.
3.2 Data Distribution
6
Key Observations:
• A significant portion of hotel prices falls within the 0.3 to 1.2 million range.
• Fewer properties are priced above 1.5 million, indicating a diminishing demand
for high-cost accommodations.
• Outliers exist in the higher price ranges (2.5 million and above), representing
luxury or niche offerings.
3.3 Correlation matrix
3.3.1 Strong Correlation with price_1_day
The following features have significant positive correlations with price_1_day
(target variable), suggesting they are key predictors:
• guests (0.67): The number of guests the property accommodates strongly
correlates with price. Larger accommodations tend to have higher prices.
• beds (0.60): More beds are associated with higher prices, reflecting increased
accommodation capacity.
7
• bedrooms (0.61): The number of bedrooms also shows a strong correlation,
indicating that properties with more rooms typically charge more.
• bathrooms (0.59): Bathrooms contribute significantly to the price, suggesting
that more amenities and comfort are valued by customers.
These features collectively highlight the importance of property size and
capacity in determining prices.
3.3.2 Moderate Correlations
• experience (0.32): Customer experience or rating moderately correlates with
price_1_day, suggesting that higher-rated properties may charge higher prices.
• year (0.16): The year has a weaker but positive correlation, implying slight price
growth over time.
• TV_features, Bếp, and Chỗ đỗ xe miễn phí tại nơi ở show minimal correlations
with price, indicating limited influence.
3.3.3 Low or Negligible Correlations
Several features have near-zero or weak correlations with price_1_day, meaning
they contribute minimally to price prediction:
• lat and lon (latitude and longitude): Geographic location has very weak
correlation, indicating that the data may not capture strong spatial trends.
• Thang máy (elevator), giặt (laundry), and điều hòa (air conditioning): Despite
being amenities, they show little influence on price.
• hồ bơi (pool): Surprisingly weak correlation (0.03), suggesting it might not
strongly influence price in this dataset.
3.3.4 Multicollinearity Between Features
Some features are highly correlated with each other, which may result in
multicollinearity:
• guests, beds, bedrooms, and bathrooms: These features are strongly correlated
with one another (e.g., beds vs guests = 0.90), indicating overlap in the
information they provide. This redundancy may need to be addressed during
feature selection or modeling.
• experience vs beds (0.32) and bathrooms (0.32): These moderate correlations
suggest a partial overlap with capacity-related features.
3.3.5 Negative Correlations
• lat and lon have a moderate negative correlation (-0.55), suggesting that the data
covers a specific region where latitude and longitude inversely relate.
Key Takeaways for Modeling:
8
• Top predictors: guests, beds, bedrooms, and bathrooms are highly relevant for
predicting price_1_day.
• Redundancy: Address multicollinearity among capacity-related features to
avoid overfitting. Techniques like Principal Component Analysis (PCA) or
feature elimination could be helpful.
• Low-impact features: Features like hồ bơi, giặt, and Thang máy may be optional
for inclusion, as they have weak correlations with price.
• Geographic Features: The weak influence of lat and lon suggests a need for
more granular or meaningful spatial data.
3.4 Feature Importance Visualization
To identify the most desired features for hotel listings, we analyze the feature
importance scores derived from a Random Forest model. The bar chart above highlights
the features that significantly influence hotel prices and customer preferences.
Top Features:
• Guests: The most critical feature, accounting for ~35% importance, is the
number of guests a listing can accommodate. Customers prioritize properties
that can host their group size comfortably.
• Location (Latitude and Longitude): The latitude (25%) and longitude (~10%)
features play a pivotal role, emphasizing the importance of location. Listings
close to city centers, tourist attractions, or transport hubs tend to be highly
desirable.
• Bathrooms and Bedrooms: The number of bathrooms (~10%) and bedrooms
are important for customers seeking comfort, especially for families or groups.
Listings with multiple amenities appeal to higher-paying customers.
9
• Free Parking and Kitchen: Practical amenities like free parking and kitchen
availability are highly valued, particularly for travelers who prefer extended
stays or self-sufficiency.
• Host Experience and Features: Features like TV, laundry, and air
conditioning show moderate importance but are often necessary for customer
satisfaction and comfort.
3.5 Key Insights
3.5.1 Prioritize Key Features
Focus on guests, beds, bathrooms, experience, and lat as these are the strongest
predictors of price_1_day.
3.5.2 Insights on Spatial Features
lat has high importance in Random Forest but weak correlation with price_1_day.
This suggests non-linear geographic patterns in the dat.
3.5.3 Temporal Features (year, month)
Both year and month have very weak correlation with price. They also appear
insignificant in feature importance.
Actionable Focus: Explore temporal trends further or consider combining them
with other features (e.g., seasonality-based interactions).
3.5.4 Capacity-Related Features
Features such as guests, beds, bedrooms, and bathrooms are highly correlated
with one another.
• Example: guests and beds have a correlation of 0.90, indicating that properties
with higher guest capacity typically offer more beds.
• Similarly, bedrooms and bathrooms show strong relationships, reflecting the
capacity and comfort level of a property.
These features are also strongly linked to the target variable price_1_day,
making them key predictors for hotel pricing.
3.5.5 Experience and Capacity
The experience feature (customer ratings) moderately correlates with capacity-
related features like beds and bathrooms (0.32). This suggests that higher-rated
properties often provide more amenities and greater capacity.
10
4 Machine Learning models
4.1 Supervised Models
This section provides an overview of supervised machine learning algorithms
that we experiment with for regression models. The selected methods—Linear
Regression, Ridge Regression, Linear Support Vector Machines with Kernel
(SVM), Random Forest, XGBoost.
4.1.1 Method
Linear Regression
Linear Regression is a foundational method used for predicting continuous
outcomes. It establishes a linear relationship between the input features and the target
variable by minimizing the residual sum of squares. Despite its simplicity, it remains a
powerful tool for solving problems with a linear trend in the data.
Ridge Regression
Ridge Regression extends Linear Regression by adding an L2 regularization term to the
cost function. This penalizes large coefficients and helps to reduce overfitting, especially
in datasets with multicollinearity or a large number of features.
Linear Support Vector Machines with Kernel (SVM)
Support Vector Machines are applied to capture both linear and non-linear
relationships in the data. Using kernel functions such as radial basis functions (RBF),
SVMs project the input data into higher-dimensional spaces to identify complex
patterns. Linear SVMs with kernel transformations are particularly effective in capturing
non-linear dependencies in regression tasks.
Random Forest
A random forest is an ensemble learning method for classification, regression and
other tasks that operates by constructing a multitude of decision trees at training time
and outputting the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees.
Random forests are very powerful machine learning algorithms that can be used
for a wide variety of tasks. They are particularly well-suited for tasks where the data is
noisy or has a lot of missing values. Random forests work by building a large number
of decision trees on different subsets of the data. Each tree is built using a random sample
of the data, and each feature is randomly selected for each split in the tree. This means
that the trees in the forest are not correlated with each other, and they are able to learn
11
different patterns from the data, which makes them really powerful with the tabular data
as in this problem. Some of the advantages of using random forests can be seen as:
• They are very powerful and can be used for a wide variety of tasks.
• They are very robust to noise and outliers in the data.
• They are relatively easy to train and interpret.
Some of the disadvantages of using random forests include:
• They can be difficult to tune.
• They can overfit the data if they are not trained properly
XGBoost
Boosting is an ensemble technique where new models are added to correct the
errors made by existing models. Gradient boosting is a supervised learning approach
that combines the estimates of several weaker, simpler models in an effort to predict a
target variable with a high degree of accuracy.
Regression trees serve as the weak learners in gradient boosting regression; each
regression tree links an input data point to a score. By combining a convex loss function
(based on the difference between the predicted and target outputs) with a penalty term
for model complexity (i.e., the decision tree paramater), XGBoost minimizes a
regularized (L1 and L2) objective function. The training proceeds iteratively, adding
new trees that predict the residuals or errors of prior trees that are then combined with
previous trees to make the final prediction. It’s called gradient boosting because it uses
a gradient descent algorithm to minimize the loss when adding new models.
4.2.2 Evaluation metrics
Root Mean Squared Error (RMSE)
12
The Root Mean Squared Error (RMSE) is one of the most commonly used
metrics for measuring the accuracy of a regression model. It calculates the square root
of the average of the squared differences between predicted values and actual values.
The formula for RMSE is as follows:
𝑛
1
𝑅𝑀𝑆𝐸 = √ ∑(𝑦𝑖 − 𝑦̂𝑖 )2
𝑛
𝑖=1
Where:
• 𝑦𝑖 is the actual value.
• 𝑦̂𝑖 is the predicted value.
• 𝑛 is the number of data points.
Characteristics:
• RMSE is sensitive to large errors due to the squaring of differences, making it
particularly useful when large deviations are undesirable.
• It provides results in the same units as the target variable, making interpretation
straightforward.
• A lower RMSE indicates better model performance.
Mean Absolute Error (MAE)
The Mean Absolute Error (MAE) measures the average of the absolute
differences between predicted and actual values. Unlike RMSE, it does not square the
errors, thus treating all deviations equally. The formula for MAE is:
𝑛
1
𝑀𝐴𝐸 = ∑|𝑦𝑖 − 𝑦̂𝑖 |
𝑛
𝑖=1
Characteristics:
• MAE provides a simple, intuitive understanding of prediction errors.
• It is less sensitive to outliers compared to RMSE, as it considers absolute
deviations rather than squared deviations.
• A lower MAE indicates better predictive performance.
Mean Absolute Percentage Error (MAPE)
The Mean Absolute Percentage Error (MAPE) expresses prediction errors as a
percentage, making it scale-independent and allowing comparisons across datasets with
different units. The formula for MAPE is:
𝑛
1 𝑦𝑖 − 𝑦̂𝑖
𝑀𝐴𝑃𝐸 = ∑ | | × 100
𝑛 𝑦𝑖
𝑖=1
Characteristics:
13
• MAPE is particularly useful when understanding the relative accuracy of a model
is more important than the absolute error values.
• It is easy to interpret as it represents errors in percentage terms.
• However, MAPE is sensitive to small actual values (y_i), which can result in very
high error percentages and may skew the evaluation.
4.2.3 Hyperparameter tuning
Hyperparameter tuning involves adjusting the settings of a machine learning
model before training to enhance its efficacy. These settings, known as hyperparameters,
are predetermined and not derived from the training data. They influence the model’s
behavior, such as its learning rate, the complexity of a neural network, or the quantity
of trees in a random forest. The objective of this process is to identify an optimal set of
hyperparameters that leads to the most effective performance for a specific task. It’s a
methodical approach involving experimentation with various hyperparameter
combinations and assessing their impact on performance. The optimal combination is
chosen based on which yields the best results. In our project, we implement this methods
for this tuning process.
1. Grid Search with Cross-Validation: This is an exhaustive technique that tests
every potential mix of hyperparameters within a specified grid to identify the best one,
as judged by a performance metric. It utilizes cross-validation to assess each model’s
effectiveness and to safeguard against overfitting. Cross-validation segregates the
dataset into subsets for training and validation, with the model being trained on the
former and its performance measured on the latter. This is done repeatedly (k times) to
provide a reliable estimate of the model’s efficacy. Despite its clear approach, this
method has drawbacks, such as potentially missing the optimal hyperparameter
combination if the grid does not cover it or if the best combination lies between the
defined grid points.
4.3 Time series Models
4.3.1 ARIMA (Autoregressive Integrated Moving Average)
ARIMA is a fundamental model in time series forecasting that is well-suited for
data exhibiting linear relationships over time. It combines three components:
Autoregression (AR): Models the dependency between an observation and
several lagged observations (past values).
Differencing (I): Removes trends and converts the time series into a stationary
one, which is a key requirement for ARIMA models.
Moving Average (MA): Captures the dependency between an observation and
the residual errors from a moving average model applied to lagged observations.
ARIMA models are defined by three parameters: 𝑝 (order of autoregression), 𝑑
(degree of differencing), and 𝑞 (order of moving average). These parameters are selected
through diagnostic tools like the autocorrelation function (ACF) and partial
14
autocorrelation function (PACF) plots. ARIMA is primarily used for univariate time
series data where the relationship between past values and future observations is
assumed to be linear. Applications of ARIMA include stock price predictions, demand
forecasting, and economic trend analysis.
4.3.2 SARIMA (Seasonal Autoregressive Integrated Moving Average)
SARIMA is an extension of the ARIMA model that accounts for seasonality,
which is common in time series data from various domains. In addition to the three main
ARIMA components (AR, I, and MA), SARIMA incorporates a seasonal component
represented by four additional parameters: 𝑃 (seasonal autoregressive order), 𝐷
(seasonal differencing order), 𝑄 (seasonal moving average order), and 𝑠 (length of the
seasonal period).
The SARIMA model is expressed as 𝑆𝐴𝑅𝐼𝑀𝐴(𝑝,𝑑,𝑞)(𝑃,𝐷,𝑄,𝑠), where the
seasonal terms operate on data separated by the defined seasonal period 𝑠. For instance,
in monthly sales data with yearly seasonality, 𝑠 would be set to 12. SARIMA is
particularly effective for datasets where patterns repeat periodically, such as retail sales,
weather patterns, or electricity demand. Its ability to model both short-term
dependencies and long-term seasonality makes it a powerful tool for complex time series
forecasting tasks.
4.3.3 Prophet
Prophet is a flexible and user-friendly time series forecasting model developed
by Facebook. Unlike traditional statistical models, Prophet is designed to handle
irregularities in the data, such as missing values and outliers, while providing reliable
forecasts. It employs an additive model with the following components:
1. Trend: A piecewise linear or logistic growth curve that captures the overall
direction of the data over time.
2. Seasonality: Periodic fluctuations (e.g., yearly, weekly, daily) modeled using
Fourier series.
3. Holiday Effects: The ability to include custom or pre-defined holidays that affect
the time series.
Prophet uses a Bayesian framework to estimate model parameters and provides
uncertainty intervals for predictions, making it particularly useful for understanding
forecast reliability. Its intuitive design allows users to easily adjust parameters,
incorporate external regressors, and visualize model components. Prophet is widely used
for business forecasting tasks, including revenue, inventory, and website traffic
predictions, where human interpretable and flexible models are preferred.
15
5. Results
5.1 Supervised Models
5.1.1 Experimental Results and Analysis
Model RMSE MAE MAPE (%)
Ridge Regression 280,265.79 212,765.09 26.80
SVM with RBF 92,835.82 49,151.55 6.58
Kernel
Linear Regression 280,265.79 212,770.27 26.80
Random Forest 61,904.55 24,032.74 3.24
XGBoost 58,098.26 27,441.71 3.55
1. Ridge Regression and Linear Regression:
a. Both models perform poorly, with very high RMSE (280,265.79) and
MAE (212,765+).
b. Their MAPE (26.80%) indicates that the predictions deviate significantly
from the actual price.
c. These results suggest that these linear models fail to capture non-linear
relationships in the data, which are likely crucial for predicting hotel
prices.
2. SVM with RBF Kernel:
a. SVM shows significant improvement over linear models, achieving an
RMSE of 92,835.82, an MAE of 49,151.55, and a MAPE of 6.58%.
b. The non-linear kernel (RBF) helps SVM better model complex
relationships in the data, leading to more accurate predictions.
3. Random Forest Regression:
a. Random Forest outperforms SVM with an RMSE of 61,904.55, an MAE
of 24,032.74, and a MAPE of 3.24%.
b. The model's ensemble nature and ability to handle non-linearities make it
highly suitable for hotel price prediction.
4. XGBoost:
a. XGBoost achieves the best RMSE of 58,098.26, closely followed by
Random Forest.
b. While XGBoost has a slightly higher MAE (27,441.71) and MAPE
(3.55%) compared to Random Forest, it demonstrates excellent predictive
performance and is highly competitive.
c. Its superior performance stems from gradient boosting, which optimizes
predictions iteratively.
16
Key Observations
• Linear Models (Ridge and Linear Regression):
o Perform poorly for this task because they fail to capture non-linear and
interaction effects among features.
• Tree-Based Models (Random Forest and XGBoost):
o These models are highly effective for this problem due to their ability to
model non-linearities and feature interactions.
o XGBoost slightly edges out Random Forest in RMSE but has slightly
higher MAE and MAPE.
• SVM with RBF Kernel:
o While significantly better than linear models, SVM lags behind Random
Forest and XGBoost, likely due to its sensitivity to hyperparameter tuning
and scalability issues for larger datasets.
5.2 Time Series Model
5.2.1 Experiment Results
Models RMSE MAE
ARIMA 28,698.99 20,947.01
SARIMA 172632.26 156169.38
Prophet 46972.20 41091.77
1. ARIMA
• ARIMA performs better than Prophet in both RMSE and MAE, indicating a lower error
rate.
• ARIMA significantly outperforms SARIMA, with SARIMA's RMSE and MAE being
several times higher.
2. SARIMA
• SARIMA has much higher errors compared to both ARIMA and Prophet, with its
RMSE and MAE over six times larger than ARIMA's.
• It also performs worse than Prophet by a significant margin in both metrics.
3. Prophet
• Prophet performs worse than ARIMA, with errors approximately 1.6 times higher in
both RMSE and MAE.
• However, Prophet significantly outperforms SARIMA, with errors approximately 3–4
times lower.
17
5.2.2 Experiment Analysis
The ARIMA forecast line in the first plot appears almost flat over time, showing
very limited sensitivity to the actual data's fluctuations. Despite actual values
experiencing periodic drops and peaks, ARIMA maintains a constant trend, failing to
adapt to volatility in the observed data.
The widening confidence interval over the forecasting horizon indicates
increasing uncertainty as the model attempts to extrapolate future values without a
strong basis for underlying temporal patterns.
In datasets exhibiting non-stationarity, irregular trends, or seasonality,
ARIMA's simplistic assumptions make it inadequate for dynamic predictions.
The SARIMA model in the second plot demonstrates improved sensitivity to data
trends compared to ARIMA. SARIMA captures some volatility and seasonal-like
behavior with visible upward and downward adjustments in the forecasted line.
However, it appears to overestimate future values, particularly in the second half of the
forecast horizon. The confidence interval remains relatively wide, reflecting uncertainty.
18
Seasonality Handling: SARIMA extends ARIMA by introducing seasonal
components that allow it to model cyclic trends within the data. For example, if hotel
prices tend to increase during peak months (e.g., summer or holidays), SARIMA can
adjust the predictions to reflect this behavior. Volatility Sensitivity: SARIMA is more
reactive to short-term fluctuations in the data.
While SARIMA is well-suited for seasonal time series, it can overfit noisy data
or misidentify volatility as seasonal trends. Fine-tuning seasonal parameters is necessary
to improve prediction accuracy.
The forecast generated by Prophet in the third plot exhibits a smooth downward
trend with a relatively narrow confidence interval. Unlike ARIMA and SARIMA,
Prophet captures the general direction of the data without overreacting to local
fluctuations. This results in a more stable and robust forecast.
6 Challenges
During the development of the hotel price prediction project, we encountered
several challenges that required careful analysis and innovative solutions.
6.1 Web Scraping Limitations
The data was crawled from Airbnb using Selenium, which simulated real user
interactions. However, scraping dynamic and asynchronous content introduced delays
and required careful handling of rate limits and CAPTCHAs.
Solution: Use wait and scroll down to ensure the dynamic data has loaded
completely.
6.2 Inconsistent Data
Many property attributes were incomplete, containing null values or missing
records for key features like amenities and reviews.
19
Solution: Used data imputation techniques for numerical data and mode-based
imputation for categorical features.
6.3 High Dimensionality
The raw dataset contained 354 features, many of which were redundant or
irrelevant. Identifying the most impactful features for price prediction was challenging.
Solution: Gradually removed columns with more than 70% null values to
eliminate features with insufficient data, reducing noise and improving model
performance. This preprocessing step helped narrow the dataset to a more manageable
number of features, ensuring that only relevant and well-populated columns were
considered for price prediction.
6.4 Noise and Irrelevant Features
Some features had low importance and introduced noise into the model.
Similarly, some amenities provided minimal predictive power.
Solution: Removed features with close-to-zero importance and redundant
attributes to improve model efficiency.
7 Future Work
The progression of the hotel price prediction project involves several key
initiatives aimed at refining the model and aligning it with real-world application
requirements. These steps are detailed as follows:
7.1 Expanding Data Coverage
The dataset will be enriched by incorporating data from a broader range of
accommodation types, regions, and market segments. Factors like room types, seasonal
demand, promotional pricing, and customer feedback will be included to capture the
diverse dynamics of hotel pricing.
Synthetic data generation methods will be explored to simulate pricing patterns,
enabling better handling of sparse or underrepresented scenarios in the dataset.
7.2 Enhancing Predictive Variables
Greater focus will be placed on identifying and crafting advanced predictors that
can capture key pricing drivers. Examples include proximity to events or attractions,
user preferences, and booking lead times.
Techniques such as embedding temporal trends (e.g., holiday periods or high-
demand weekends) and leveraging advanced feature transformations will be applied to
improve model insights.
20
7.3 Introducing Sophisticated Models
Beyond traditional regression and tree-based models, advanced predictive
frameworks such as Neural Networks and hybrid ensemble techniques will be adopted
to capture complex relationships and pricing behaviors.
Adaptive modeling approaches, including time-series forecasting methods and
multi-task learning, will be evaluated to enhance responsiveness to evolving market
trends.
7.4 Optimizing Model Performance
A systematic approach to tuning model hyperparameters will be employed to
extract maximum predictive power. Automated search strategies, such as grid search or
genetic algorithms, will be used to achieve optimal settings.
The inclusion of validation schemes such as K-fold cross-validation will help
ensure robust performance across diverse data splits.
7.5 Operationalization and User Feedback
We aim to deploy the model as a fully operational system integrated into a user-
friendly platform or web application. This platform will enable hotel managers, travel
agents, and customers to input relevant features (e.g., location, capacity, amenities) and
obtain accurate price predictions.
To maintain its performance and adaptability, the model will require regular
updates incorporating fresh data and user feedback. This iterative process will ensure
the model remains accurate, relevant, and responsive to real-world changes in hotel
pricing dynamics.
8. Conclusion
This project explored the development of a robust hotel price prediction system,
with significant emphasis on data collection, preprocessing, and machine learning model
evaluation. A comprehensive data pipeline was established, including advanced web
scraping techniques and rigorous preprocessing methods to handle missing values,
redundant features, and noisy data. These efforts ensured the dataset was well-suited for
predictive modeling, improving both accuracy and interpretability.
The study highlighted the importance of clean and reliable data in achieving high-
performance models. Tree-based models such as Random Forest and XGBoost delivered
the best results, effectively capturing non-linear relationships and complex interactions
among features compared to time-series methods like ARIMA and SARIMA. While
promising, challenges in data sparsity, feature redundancy, and noise were addressed
using innovative preprocessing techniques.
21
Future work will expand data coverage to include additional pricing factors,
refine predictive features, and introduce advanced modeling techniques to improve
adaptability and accuracy. By integrating these enhancements, the project aims to
provide a comprehensive and practical solution for dynamic hotel pricing strategies,
benefiting both the hospitality industry and consumers.
References
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction. Springer.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and
Practice (3rd ed.). OTexts.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System.
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 785–794.
Box, G. E., Jenkins, G. M., & Reinsel, G. C. (2015). Time Series Analysis:
Forecasting and Control. Wiley Series in Probability and Statistics.
Taylor, S. J., & Letham, B. (2018). Prophet: Forecasting at Scale. American
Statistician, 72(1), 37–45.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32
22