0% found this document useful (0 votes)
75 views6 pages

Restaurant Recommendation Model Analysis

The document describes a restaurant recommendation system that aims to predict whether a user will have a positive or negative review for a given restaurant. It develops features based on user data, restaurant data, and the relationship between users and restaurants. These include raw data features, aggregated historical ratings for users and businesses, ratings based on matching categories between users and restaurants, and social network data. The top predictive features are identified through analysis and include average historical business ratings and ratings from friends.

Uploaded by

Raghav Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views6 pages

Restaurant Recommendation Model Analysis

The document describes a restaurant recommendation system that aims to predict whether a user will have a positive or negative review for a given restaurant. It develops features based on user data, restaurant data, and the relationship between users and restaurants. These include raw data features, aggregated historical ratings for users and businesses, ratings based on matching categories between users and restaurants, and social network data. The top predictive features are identified through analysis and include average historical business ratings and ratings from friends.

Uploaded by

Raghav Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

Restaurant Recommendation System

addition, we keep N months of data from the period ending


6/15/2014 as training data, where N is a variable to our
INTRODUCTION learning algorithm. The remaining data from the older period
serves a source for generating historical features. Here’s a
Local business review websites such as Yelp and Urbanspoon depiction of how the data is split based on time period:
are a very popular destination for a large number of people for
deciding on their eat-outs. Being able to recommend local
businesses to users is a functionality that would be a very
valuable addition to these sites’s functionality. In this paper we
aim to build a model that recommends restaurants to users. The Everything else N months Last 1 month
way we will model this is by predicting whether a user will
have a positive or a negative review for the business. We will
restrict to restaurants segment within the business category as Derived Historical features Training Set Test Set
recommendation is a very good fit in that system. One way this
model could be used in practice is by having an automatic The way the problem is now modeled is to learn from
‘Recommend: Yes/No’ message when a user visits a current data to make predictions in the future.
restaurant’s profile page. The way the problem is modeled is to
Yelp users give ratings on a 5-point scale, which are mapped
be able to predict yes/no for any given restaurant and user.
to a binary yes (4,5)/no(1,2,3) label. Hence, each example in
our training/test data is a single review with a binary label.
In this work, we will primarily explore the following Roughly, about 65% of labels are yes and 35% are no which
directions: 1) Optimization algorithms to predict the desired means that we can achieve a trivial baseline accuracy of 65%
label 2) Develop features that would help improve the by predicting everything as yes.
accuracy of this model.

PROBLEM STATEMENT METHODOLOGY


There are systems that exist today that recommend users We will first describe the features being used, and the
restaurants, but none of them model the problem in this way features that we developed to solve this problem. Given that
to predict a yes/no given a user and a restaurant. To my our input tuple is <user, restaurant> I have features of
knowledge, this is the first solution that attempts to following categories:
recommend a Yes/No given a user and a local business. One a) User-level features
assumption we make in this work is that the reviews data is b) Restaurant-level features
not biased by the label i.e. the majority of users are c) User-Restaurant features
uniformly writing reviews for restaurants they visit, and not
because of their good or bad experience. Summary of features

We also classify the features into the following buckets:

1. Raw features
DATA COLLECTION We have 407 raw features from the data itself,
comprising of 5 user-level features and 402 restaurant-
The data that we used in this project was obtained from the level features. User level features are such as number of
Yelp Dataset challenge. The dataset contains five different days in yelp, number of fans, number of votes etc.
tables: User, Business, Review, Check-In and Tips. The data Restaurant level features are such as binary features for
has 14092 restaurants, 252395 users, 402707 tips and 1124955 attributes (parking, take out, etc) and categories
reviews. The reviews span over 10 years of data.
(cuisine)
We hold out 1 month of reviews data as the test data, which
contains 22375 reviews from 6/16/2014 to 7/16/2014. In
2. Derived/Computed features Step 2: In the training and test data, we compute matching
As described in the previous section, we hold out features comparing user’s preference and the business
majority of the old reviews data for computation of categories. E.g. the best feature in this category was the
features. This old period is prior and not overlapping to average rating for this user averaged over all categories that
the periods from which we sample training and test data. matched the given business’s categories.
We compute 16 derived features from this holdout
‘historical’ period that are described in more detail in The intuition behind these features is that user’s personal
this section. These consist of B) user-level features such preference on certain categories of restaurants should be a
as average user historical rating and business-level strong signal to whether a user would like a future restaurant.
features such as average business-level rating, number
of reviews. Also includes C) user-category features such
as average rating from the user on that category given D. Collaborative Features
the current restaurant’s category and D) features from
user’s social network with friends’ preferences. The publicly available dataset also provided each user’s
social graph, i.e. the users’ friends. Using the intuition that a
Significant amount of work went into engineering these user’s friends’ likings are good representatives of a user’s
features, trying different ways to compute them. A lot of likings, we developed the following feature: Given a
improvements in the results came from the iterative creation business and user in the training/test set, average rating for
of new features. We’ll next go into the details of the this business from this user’s friends in the historical period.
features, and then we’ll summarize the results of adding As before, we used suitable variations for default values
these features. when the feature was missing a value.

A. Raw features

From the raw data we had five user-level features: number of


fans, number of days on yelp and three different vote counts. UNDERSTANDING THE FEATURES
There are 61 business attribute features such as binary
information about ambience, diet and facilities. There are Before delving into training a model, we want to analyze the
233 binary features about cuisine, style of restaurant. There features first. We used a simple measure such as F-Score to
are 108 binary features for the city in which this business is. identify the top features in our data. Here are the top 10
features based on F-Score on a 1 month training set:

B. Historical user and business aggregated features Index Feature FScore


1 business_averageBusinessRatingWithDefault 0.1011
The first set of features we implemented were based on
2 business_averageBusinessRating 0.0328
intuition that a business is likely to receive ratings
correspond to their historical ratings: 3 userbus_averageRatingForAttrMatchWithDefault 0.0123
User-level: Average historical rating from this user, # of 4 userbus_averageRatingForAttrMatchWithDefaultW 0.0121
reviews
5 user_averageUserRatingWithDefault 0.0120
Business-level: Average historical rating for this business, #
of reviews 6 userbus_averageRatingForCatMatchWithDefaultW 0.0120
Missing features – Since historical data for certain 7 userbus_averageRatingForCatMatchWithDefault 0.0113
users/business can be missing, we circumvented this by
8 avgfriendratingonthisbusinessD 0.0063
using some variations of this feature with default values
ranging from min to average to max. Using a default seems 9 business_attributes.[Link] 0.0039
to have helped across the board as we give the algorithm a 10 business_categories_Buffets 0.0035
way to treat missing values differently than just zero.
11 business_attributes.[Link] 0.0033
12 business_attributes.Caters 0.0025
C. User-business category based affinity 13 business_categories_Fast Food 0.0024
14 business_attributes.[Link] 0.0024
In order to improve the accuracy further, we decided to
implement features that model each user’s personal others …..
preference. These features are computed as follows:
Step 1: Compute each user’s personal preference on each of Historical user and business aggregated features: The top
the possible business categories and attributes. This is feature business_average Business Rating is the computed
computed as the average rating a user gave to each of the average rating from the historic hold out period.
business categories in the historical period. One such feature business_average Business Rating with Default was normalized
is avg_rating_for_thaicuisine_for_this_user. to always have a default value even if historical data is missing
for that business (the default value we use is the average rating
from all reviews).
user_average User Rating with Default is the average rating Table 1: SVM with RBF Kernel with Feature Set 3
for that user with default.
# training Training Training Accuracy Test Accuracy
data Data Size
User-business category based affinity: The feature
1 week 5000 97.88% 66.12%
User bus average Rating for Attr Match with Default 1 month 20000 95.10% 65.50%
measures the affinity of a user with a business based on 2 month 40000 92.88% 65.88%
historical ratings on its categories. 4 month 80000 90.18% 66.22%

Collaborative Features: user_average User Rating with


Default is the average rating for this restaurant from the The learning curve looks like follows:
user’s friends.

The remaining features on the list are binary features on Learning Curve with SVM: RBF Kernel
business categories and their names are self-explanatory. 100%
The top 8 performing features are all derived features
described in the previous section, with the top feature being 90%
historical average rating of the given restaurant. 80%

70%

60%
MODELING THE PROBLEM 1 week 1 month 2 month 4 month

In this section we will describe different algorithms we used TrainingAccuracy TestingAccuracy


varying parameters such as size of training data, subset of
features and evaluate their performance. *x-axis corresponds to size of training set
We experimented with a few different algorithms, variations
in training data, as well as features. We’ll describe the From the learning curve, it’s clear that some over-fitting is
results from each of them separately. Before proceeding happening with lesser training data, and adding more
we’ll define the feature set I used in the results: training data helps that. However, even with 4 months’
Feature Set 1: Consists only of the raw features defined in worth of training data we see that the testing accuracy only
III.A section marginally improves.
Feature Set 2: Consists of the raw and derived features
defined in III.A and B sections, i.e. this includes historical Reducing over-fitting:
average ratings per user and business and simple review The effect of high over-fitting is likely arising from the high
stats. Feature Set 3: Consists of all the raw and derived dimensional feature mapping from the RBF kernel. We
features defined in III.A, B, C and D sections. This includes iterated on some regularization methods to achieve
all the features previously mentioned, including user significantly better results (optimized gamma and C using a
category affinity and collaborative features. parameter sweep with cross validation).

Table 2: Regularized SVM with RBF Kernel w/ Feature Set 3


A. Learning Algorithms Used
# training Training Training Accuracy Test Accuracy
We experimented with the following algorithms to train the data Data Size
1 week 5000 73.10% 68.59%
rating predictor classifier:
- SVM with RBF Kernel 1 month 20000 70.78% 69.41%
- Linear SVM 2 month 40000 70.38% 69.33%
- Logistic regression 4 month 80000 70.16% 69.54%

A.1. SVM with RBF Kernel This clearly has lesser over-fitting and better test accuracy.
Our first approach was to train an SVM classifier using the
radial basis function kernel. Since the amount of training A.2. Logistic regression
data needed at this point is not clear, I vary the training data I applied the same training and test data with Feature set 3
size with reviews period ranking from 1 week, 1 month, 2 with logistic regression, and the results were as follows:
month to 4 months (the test data remains unchanged).
We measure accuracy defined as the percentage of reviews Table 3: Logistic regression w/ Feature Set 3
where we predicted a positive or negative review correctly.
# training Training Training Accuracy Test Accuracy
The table below shows the change in performance of the data Data Size
model with varying training data size with Feature Set 3. 1 week 5000 71.97% 69.10%
1 month 20000 69.93% 69.55%
2 month 40000 69.43% 69.68%
4 month 80000 69.40% 69.28%
A.3. SVM with Linear Kernel features and it consistently performs better than Feature set 2
We observed that given the vast feature set we have high for all varying training data, although only marginally better.
dimensional kernel for SVM did not add a lot of value. In
fact, there was over-fitting in the high dimensional space C. Impact of varying training data size
until significant regularization was added. We experimented
with SVM with a linear kernel which reduced over-fitting We see interesting results with varying training data size in
and produced comparable and even slightly better results as Table 5.
shown in the table below. Specifically, we see training accuracy go down with increase
in training data size. This indicates a good reduction in
Table 4: SVM with linear kernel variance in that the over-fitting problem is fixed with
increased training data.
# training Training Training Accuracy Test Accuracy
data Data Size
To analyze testing accuracy better, let’s present a zoomed-in
1/2 week 3000 73.02% 68.01% learning curve plot only for testing accuracy on feature set 3
1 week 5000 72.31% 68.94%
with linear SVM:
1 month 20000 70.10% 69.89%
2 month 40000 69.73% 69.77%
Testing Accuracy with training data size
4 month 80000 69.63% 69.49%

69.9%
69.4%
Here’s the learning curve for linear SVM:
68.9%
68.4%
Learning curve with linear SVM 67.9%
1/2 week 1 week 1 month 2 month 4 month 8 months
73%
Test Accuracy - Feature Set 3
71%

69%
We see test accuracy increase with increased training data
67% until about 1 month of training data, but it reduces with
1/2 week 1 week 1 month 2 month 4 month significantly increased training data such as 4 or 8 months.
We explain this behavior with the following hypotheses. The
TrainingAccuracy TestingAccuracy
way we sample training data size is not random, rather
increase in training data is done through going back more in
As we see, we achieve comparable training and test accuracy time and holding out more old data for training. This also
with large enough training data looking at periods of 1 means that the historical hold out period from which derived
month or more. There’s little to no over-fitting happening at features are computed also gets older with increase in
this stage, and this is the maximum we can learn from the training set size. We summarize the hypothesis here:
given set of features and training data. o Recent training data is more representative of reviews in
the upcoming period than older training data.
B. Impact of derived feature sets o Derived features from recent period are stronger signals
for predicting reviews in the upcoming period.
The following table compares the testing accuracies with
different feature sets with varying training data. Thus it is important in such machine learning algorithms
when using past data to predict the future results to optimize
Table 5: Testing Accuracy with different feature sets with linear on varying holding out data for features and training data.
SVM: We probably want to experiment with weighing training data
# training Feature Set 1 Feature Set 2 Feature Set 3 based on its age.
data (raw only) (all derived)
1/2 week 64.90% 67.91% 68.03%
1 week 65.68% 68.81% 68.91%
1 month 66.97% 69.79% 69.89%
SUMMARY OF RESULTS
2 month 66.97% 69.77% 69.82%
4 month 67.09% 69.45% 69.50% We summarize the results from the previous section as
8 month - 69.30% 69.42% follows:
1. We see comparable results from the different algorithms
The table clearly shows the superior results from using the that were used although linear SVM was least susceptible
derived features. Feature Set 3 is the set of all raw and derived to over-fitting and performed marginally better. We
achieved a testing accuracy of 69.89% with linear SVM,
feature set 3 and using 1 month of reviews as training problem. At this stage, it’s not clear as I have not yet
data. explored all the possible features, but it is a concern to
me.
2. We see a significant improvement from derived
features, specifically from using the following: 2. Unclear bias in reviews used for training and evaluation:
a. Historical average ratings for the business One assumption we make is that a user’s decision to
b. Affinity of user to a specific business category provide reviews to a restaurant is random, and not
c. Collaborative features biased by an unusually good or bad experience one has.

3. Increased training data reduced over-fitting, but there’s Future work would involve trying to identify stronger features
value in weighing training data based on the age of the beyond what is available in the datasets, as well as investing in
label. Recent data is more useful in learning than older an approach to gather training and evaluation data from
data. alternate means (such as explicit human judgment systems).

4. It was important to treat missing feature values


differently than zero by providing variations to the REFERENCES
model to learn from.
[1] Yelp dataset: [Link]
5. At the end, we perform about +5% in accuracy better [2] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support
than the trivial baseline of always predicting yes. vector machines. ACM Transactions on Intelligent Systems and
Technology, [Link]--27:27, 2011. Software available at
[Link]

FUTURE WORK
We acknowledge that the problem being solved is hard,
specifically because of the following reasons:

1. Unclear Predictability of reviews: Any supervised learning


problem aims to learn from the labels, given the provided
features. The underlying assumption I made here is that the
features we have access to are sufficient to predict a
positive/negative review. However, one can imagine that a
future review can depend highly on the experience the user
has at the restaurant that is not captured anywhere in the
features. This could cause correlation between the features
and the label to be lesser than what would be ideal for a
supervised learning

You might also like