0% found this document useful (0 votes)
524 views7 pages

Recommender System Evaluation Plan

The document outlines a plan to measure and evaluate a new recommender system for an e-commerce site. The goals are to increase revenue during back-to-school season by better matching users and items. Key metrics include revenue increase, customer loyalty, and marketing savings. Algorithms to be tested include content-based filtering, user-user collaborative filtering, item-item collaborative filtering, and matrix factorization. Models will be evaluated offline on Amazon data using RMSE, nDCG, diversity, and popularity. The best elements of each algorithm may be combined into hybrid models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
524 views7 pages

Recommender System Evaluation Plan

The document outlines a plan to measure and evaluate a new recommender system for an e-commerce site. The goals are to increase revenue during back-to-school season by better matching users and items. Key metrics include revenue increase, customer loyalty, and marketing savings. Algorithms to be tested include content-based filtering, user-user collaborative filtering, item-item collaborative filtering, and matrix factorization. Models will be evaluated offline on Amazon data using RMSE, nDCG, diversity, and popularity. The best elements of each algorithm may be combined into hybrid models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Part I: Designing a Measurement and Evaluation Plan

1. Translation of business goals and constraints into metrics and measurable criteria

1.1. Final Objectives

With the implementation of new user profile based recommender systems, we intend to
increase Nile-River.com website sales during “Back to School” times by using different
strategies to match similar users/items in order to be able to recommend better additional
products for users at our website.
The final KPI of interest with this implementation is:

- Revenue increase​: Simply put, we want to generate more revenue for the company
with the increase in the number of sold items during this special period. The board
directors expect the increase in sales approximate the current increase in sales
occured at our competitors.

Indirectly, we also expect this implementation to improve other metrics such as:

- Customer Loyalty​: By optimising their purchase experience, we expect customers to


come back more often to our webstore and for them to prefer to use our webstore
instead of our competitors

- Savings in Marketing Campaigns​: As the localized campaigns or free delivery


campaigns, didn’t performed as expected, and as we believe the new recommender
system could outperform and even replace in certain parts the marketing campaigns,
We believe whis would mean improvement of savings at the marketing department.

1.2. Recommender System Objectives

Specifically for the recommender system, we want it to fulfill these desirable characteristics:

- The recommender system will recommend items based on the historical user profile
solely, so we it will look at past user rating behaviour and not contextual situations,
such as current browsing or shopping cart

- Research indicates purchases in this time of period are fairly distributed along
different office products and price categories. As we were allotted only two
5-items-recommendation sections, we want these lists to have a high recommending
diversity among all of our office product categories available and also price.
Besides, as we are focused more on a sales increase, we will ​not focus on
coverage​, but on ​popularity​, as we want to recommend people the most frequently
bought items for our customers, increase our chance of a sale conversion.

- Lastly, in order to try to bring a different service for our customers, we will use the
second 5-product-recommendation to try to bring new products which customers
wouldn’t find in traditional stores and possibly now knowing them before. This criteria
also falls under the ​diversity criteria, but now recommending items with a low score
on the availability index as part of the criteria.
1.3. Base Algorithms Evaluation

Considering the constraints above, we think it is logical to test different algorithms of


recommendation systems in order to find the best that fits both in performance and criterias
satisfaction. As individual algorithms, we intent to use the following models:

- Content Based Recommendation (CBR)​: Content based algorithms are one of the
simplest version of personalised recommendation we can create and we could use it
to establish a quick baseline for the future implementations. CBR have the
disadvantage they need items descriptions in order to match users’ tastes and items.
As a first approach we can use the item category column in the dataset as first item
descriptor, both the leaf category and the possible full path of the data.
- User User Collaborative Filtering (UUCF): As a first version of collaborative filtering,
we don’t want to rely on the quality of the item description from the CBR in order to
provide good recommendations. We know how UUCF doesn’t scale well and the
difficulties to keep good user tastes’ behaviour, but our clear objective in this
algorithm is to provide some novel products which can be used in the second product
recommendation sector, feature that is more present than comparing with the
conservative Item Item Collaborative Filtering.
- Item Item Collaborative Filtering (IICF): We intend to use this algorithm and the
matrix factorization as main recommendation providers for the first recommending
section. They will be focused on providing good and popular recommendations,
which in turn will have the objective to increase customer purchase conversion.
Besides, we want to have an idea of the best neighbors for each user in order to
provide better explanations on why did we recommended an item for a user.
- Matrix Factorization: As the best predictive performance, we will use a version of
matrix factorization algorithm. Contrary to IICF, we will focus this algorithm basically
to provide recommendations without explaining too much on how did we get to this
item for this user.

- ​The evaluation form for this step will be just offline, as we don’t have yet a platform for
online evaluations, but we will describe how it could done at the end of this section.

For the offline evaluation, we will also work with a static dataset, as we don’t want to expend
too much time engineering data collecting pipelines in this experimentation phase. We will
use the Amazon ratings dataset, filtered only for office products and for items who had at
least 5 ratings. We will use off-the-shelf recommender systems package, such as lenskit or
Python’s Surprise package and try to predict what would be the rating a user would give to
certain item:

- The main criteria we will use for the models' performance, ​i.e.​, quality of rating, is
going to be ​Root Mean Square Error (RMSE). The score will be calculated by
predicting the error given by the model and the true rating given by the user.
Because we need the user’s rating data in this metric, we will evaluate the model’s
performance only over of what the user has already evaluated.
- As we are returning a list for the two advertisement areas in the website, we will also
evaluate the items’ order from the returned list. We think the order the items appear
at the promoted space helps for users to engage with the website, we will use
Normalised Discounted Cumulative Gain (nDCG) to determine whether the
recommended item list is good or not. For this, we will define the possible items’
scores as the following:

- score = 2 if the recommender returned an item the user has already rated and
his rating is ​more than 0.5 points above his average rating. This avoid the
difference in scale different customers are used to rating the items.
- score = 1 if the recommender returned an item the user has already rated and
his rating is ​more than 1 point below his average rating and ​until 0.5
points above his average​ rating.
- score = 0 if the recommender returned an item the ​user has not rated yet​.
We know this is not desirable, ​as we are punishing the recommender for
providing innovative recommendations​, but we intent to compensate this
when we go for the online evaluations.
- score = -1 if the recommender returned an item the user has already rated
and his rating is ​less than 1 points below his average rating​.

- Lastly, as part of one the business constraints presented before, we will also look at
the other metrics, such as ​diversity ​and ​popularity. ​In terms of price, the diversity
index will calculate the standard deviation of the prices of the returned items from the
recommender. For category, the diversity index will count the amount of different
categories present in the returned list. For popularity, we calculate the amount of
items that was returned from the system who have more than a later specified
threshold of number of ratings. A good popular system will return items which in
average all user buy.

The models are going to be trained in a segment of the original dataset and all the
performance statistics are going to be calculated on a different segment, a test set. As we
want to preserve the time dynamics of the ratings, we will not perform cross validation, such
as k-fold cross validation, as this would insert data from a future time in the training data and
we would use it to predict past data in the test set. In order to control this, we will take 70%
of the most initial data and leave the newest 30% for test and report of the statistics.

- ​Lastly, we want to evaluate opportunities for hybrid models usage. Usually, one model
doesn’t manage to have the best performance in all situations, such as all item categories,
customer segmentations or geographic localisation. As a first approach, we want to use
each algorithm know characteristics and apply them in a specific product/customer
segmentation. For example, as we know the performance of collaborative filtering and matrix
factorization depend heavily on a high number of customers and/or product ratings, we can
switch their recommendations for a content based approach when a certain item doesn’t
have a sufficient number of ratings/evaluations or when we deal a with a new user in our
shop (cold start problem). We can join their performance using a weighted average, in which
we can define the importance of each of the algorithms in our website.
2. Measurement:

For an offline evaluation, we used the Amazon’s dataset, containing ratings and evaluations
for a few items and users. All the metrics calculation and analysis were performed in a
Jupyter notebook and saved inside our code repository (notebook link ​here​)

3. Mixing the Algorithms:

The four proposed hybridization techniques are the following:

1. Linear ensemble
2. Non linear ensemble
3. Top 1 from each recommender
4. Recommender switching

Their construction, testing and sampling are shown in the section 6 of the notebook
(notebook link ​here​)

4. Proposal and Reflection:

4.1. Part I

The recommenders proposed above had the objective to improve the sales of ​Nile-River.com,
mainly in the office products category, which was not having the expected performance for
high seasons as expected.

As an attempt to improve the sales in this period, we wanted to create two allocated areas
for recommending products to users. The recommended products would have an opportunity
to suggest items the users had already bought, as well as suggest products in different price
ranges or availability status, ​i.e.​, bring a higher range of variety to these lists, in different
perspectives. By offering either known products, which the users can remember and buy
them again, or new products, which the user can find out by his purchase history, we wanted
to check the opportunities of bringing new customers and keeping the old ones, with a final
objective of increasing the company’s sales.

While researching on the recommenders algorithms, we wanted to make sure about two
things in the beginning:

1 - The predicted recommendations were accurate enough, ​i.e.,​ we could trust on the
recommender’s prediction.

2 - The recommender managed to recommend products the user had already bought. We
put this constraint as it was the easiest evaluation mechanism of the possible success of the
recommender, as the price or availability variability didn’t bring us more exact expectations
of the model’s success in terms of bringing more revenue to the company.

In order for the models to understand how users interacted with our products, we fed it
historical purchase and rated data, containing transactions of which user bought which item,
besides the evaluation given to it. As an evaluation criteria, we checked a few metrics:
In order to check the model’s accuracy we calculate, in average, by how much the
recommender predicted rating for a pair (user, item) missed the actual value. For example,
for a value of 0.5, it means the model predicted values that floated around 0.5 ratings points
around the actual existing values. If this was the case, it would be a reasonable result.

Secondly, the “precision at 5” metric. This metric calculates, in average, how many items that
the recommender suggested as the top 5 items that are within the list of all the items a user
has bought. The metric, varies from 0 to 1, being 0 the recommender suggested no items
that the user has bought and 1 the recommender suggested the 5 items, which were already
bought by the user before. Because this is a business where customers are used to buying
repetitive items, this metric was defined as the most significative for us, as it can really
create the idea of how often the users could interact with the suggested lists.

The other metrics can be abstracted as the secondary group. Price and Availability variability
and popularity measure how diverse are the suggested lists in terms of each specific topic. If
we had recommenders tying in terms of accuracy and precision at 5, we would look at these
other measures, as it could indicate to us more desirable features.

The recommenders tried in this research were the following: Content based, personalised
bias, user user collaborative filtering, item item collaborative filtering and matrix factorization.
Many more algorithms exist in the recommender systems’ literature, but we decided to move
on with these, as they bring different weights in terms of what are they advantages and
disadvantages. The list presented above brings the model in a order of less prediction power
but more explainable, to high accuracy power but no explainability. The business constraints
define whether we want an explanation behind the recommendations or not. Besides, in this
list, we have algorithms that performs best when we don’t have many ratings but are more
costly in terms of developing a item metadata infrastructure, such as the content based, and
models who use simple data, such as ratings, but only performs good when we have a
decent amount of data about an item or user, such as collaborative filtering.

Lastly, we don’t need to stay with only one recommender. As we said, the recommenders
can perform better or worse in different situations. A way to circumvent this is to mix
algorithms. In the notebook we presented different configurations we could use to improve
our metrics of interest. In highlight, we came to the result that the top_1_all recommender
was the best performing in our business settings. The top 1 all creates a 5 item list by
concatenating the top 1 of the 5 basic recommenders presented before. It had an accuracy
similar to the basic algorithms, but its precision was the best among all.
4.2. Part II
● Reflect on the process of translating business requirements to metrics. What was
easy or hard about that process? Do you feel you had adequate preparation in the
specialization to take on such a task?

This last part goes to final thought for the capstone project and the specialisation in general.

If I was to mention the two most difficult parts in these kinds of projects, they would be easily
defined. Defining the problem and converting (both ways) it to a recommender system or
machine learning context are the two most difficult steps. In this project, the former was already
defined, but still, capture the business requirements AND constraints in a bunch of statistical
metrics were a great challenge.

Another thing is, working with a static dataset (offline evaluation) showed to be much quicker
than thinking about the setup to perform an AB test somehow. However, it also show how difficult
is to prove that the increased metric in the evaluation/test set represents a fact that your model is
going to bring positive results to your company. The offline evaluation showed to be a quick
evaluation but it can not substitute an online evaluation, when possible.

In overall, the specialisation provided a good base ground for us to evolve in this field. It teached
us how these algorithms and process theoretically work for us and a taste of messy real life
hands on with this capstone. Next step it would be to have more hands-on experiences in order
to have a real grasp of what work and what doesn’t in real life business scenarios.

Reflect on the process of evaluating individual algorithms and creating hybrids. How easy
or hard was it to identify differences in performance of algorithms on different criteria?
How easy or hard was is to bring together elements from different algorithms through
hybrids? How confident are you that your final result is a good algorithm for the problem.

Because of the limited dataset, I had the feeling the models performed in general equally, with
some of them showing up as promising in terms of certain metrics of interest. The fact of creating
hybrids were somehow, demanding. The idea of possibilities and their whole underlying theory
was quite clear. The difficult for me was to theorize and show how these hybrids could be better
than the basis recommenders given the dataset we had. The performance of the chosen model
was a hybrid but, as stated in the previous question, I’m not 100% secure of the generalisation
capacity of the model because the offline test couldn’t provide an assuring answer with their
offline metrics. I’d like to check how this model would perform in a controlled AB testing situation,
with real, existing users and see how they would interact with the 5 recommended items slots.

Reflect on the tools you used for this capstone (whether spreadsheet, LensKit, or external
ones). Do you feel you had the experience and skill with the tools you needed for the
capstone? Do you feel the tool was a good match for the problem (and if not, what would
you have preferred)? Please identify areas where the tools were particularly helpful or
particularly challenging.

I personally tried to for the Honor’s section at the beginning of the specialisation. However, as my
JAVA background were not so strong, it took me much more time than I had in order to
understand the underlying structure of the programming assignments. Therefore, after 1
frustrated week, I decided to move to the non programming version. The spreadsheets were very
good in my opinion, as they provided the right balance between practicality and complexity. I
wouldn’t like to spend hundreds of hours in a spreadsheet application, just to enough to learn the
concepts. This was exactly the case, as I was able afterwards to migrate all the spreadsheet
coding to Python Jupyter Notebooks, my actual programming language. There, I was able to
generalise the learnt concepts and program the recommendations for all users, instead of just a
few test ones as in the spreadsheet.

Finally, reflect on the capstone experience as a whole. Did it achieve its goal of giving you
one project to bring together the diverse set of materials you learned in this
specialization? Do you feel more capable of (or confident in your ability for) taking on
applications of recommender systems? Other lessons you’re willing to share?

No lessons to share, just that it was a really great adventure :)

You might also like