0% found this document useful (0 votes)
41 views11 pages

Module 4

Uploaded by

gguru5749
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views11 pages

Module 4

Uploaded by

gguru5749
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Advanced AI and ML 21AI71

Module 4 - Recommender System

4.1 OVERVIEW
• Recommendation systems are a set of algorithms which recommend most relevant
items to users based on their preferences predicted using the algorithms.
• It acts on behavioral data, such as customer’s previous purchase, ratings or reviews to
predict their likelihood of buying a new product or service.
• Amazon’s “Customers who buy this item also bought”, Netflix’s “shows and movies
you may want to watch” are examples of recommendation systems.
• Recommender systems are very popular for recommending products such as movies,
music, news, books, articles, groceries and act as a backbone for cross-selling across
industries.

4.1.1 Datasets
For exploring the algorithms, we will be using the following two publicly available datasets
and build recommendations.
1. groceries.csv: This dataset contains transactions of a grocery store and can be
downloaded from
http://cox.csueastbay.edu/~esuess/stat452/

2. Movie Lens: This dataset contains 20000263 ratings and 465564 tag applications
across 27278 movies. As per the source of data, these data were created by 138493
users between January 09, 1995 and March 31, 2015. This dataset was generated on
October 17, 2016. Users were selected and included randomly. All selected users had
rated at least 20 movies. The dataset can be downloaded from the link
https://grouplens.org/datasets/movielens/

1 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

4.2 ASSOCIATION RULES (ASSOCIATION RULE MINING)

• Association rule finds combinations of items that frequently occur together in orders
or baskets (in a retail context).
• The items that frequently occur together are called itemsets. Itemsets help to discover
relationships between items that people buy together and use that as a basis for
creating strategies like combining products as combo offer or place products next to
each other in retail shelves to attract customer attention.
• An application of association rule mining is in Market Basket Analysis (MBA).
MBA is a technique used mostly by retailers to find associations between items
purchased by customers.

To illustrate the association rule mining concept, let us consider a set of baskets and the items
in those baskets purchased by customers as depicted in Figure.

Items purchased in different baskets are:


1. Basket 1: egg, beer, sugar, bread, diaper
2. Basket 2: egg, beer, cereal, bread, diaper
3. Basket 3: milk, beer, bread
4. Basket 4: cereal, diaper, bread

• The primary objective of a recommender system is to predict items that a customer


may purchase in the future based on his/her purchases so far.
• In future, if a customer buys beer, can we predict what he/she is most likely to buy
along with beer? To predict this, we need to find out which items have shown a strong
association with beer in previously purchased baskets. We can use association rule
mining technique to find this out.

2 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

• Association rule considers all possible combination of items in the previous baskets
and computes various measures such as support, confidence, and lift to identify rules
with stronger associations.
• One of the challenges in association rule mining is the number of combination of items
that need to be considered; as the number of unique items sold by the seller increases,
the number of associations can increase exponentially.
• One solution to this problem is to eliminate items that possibly cannot be part of any
itemsets. One such algorithm the association rules use Apriori algorithm.
• The Apriori algorithm was proposed by Agrawal and Srikant (1994).
The rules generated are represented as

which means that customers who purchased diapers also purchased beer in the same
basket. {diaper, beer} together is called itemset. {diaper} is called the antecedent and
the {beer} is called the consequent.

Both antecedents and consequents can have multiple items. The below example is also
a valid rule

4.2.1 Metrics
Concepts such as support, confidence, and lift are used to generate association rules.
1. Support
• Support indicates the frequencies of items appearing together in baskets with respect
to all possible baskets being considered (or in a sample).
• For example, the support for (beer, diaper) will be 2/4 (based on the data shown in
Figure 9.1), that is, 50% as it appears together in 2 baskets out of 4 baskets.

3 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

2. Confidence
• Confidence measures the proportion of the transactions that contain X, which also
contain Y. X is called antecedent and Y is called consequent.
• Confidence can be calculated using the following formula:

where P(Y|X) is the conditional probability of Y given X.

3. Lift
Lift is calculated using the following formula:

• Lift can be interpreted as the degree of association between two items.


• Lift value 1 indicates that the items are independent (no association), lift value of less
than 1 implies that the products are substitution (purchase one product will decrease
the probability of purchase of the other product) and lift value of greater than 1
indicates purchase of Product X will increase the probability of purchase of Product Y.
• Lift value of greater than 1 is a necessary condition of generating association rules.

4.2.2 Applying Association Rules


To understand and apply association rules using transaction data in groceries.csv. This will
involve loading, encoding, and analysing transaction data to uncover patterns and
associations in customer purchasing behaviors.

all_txns = []
with open('groceries.csv') as f:
content = f.readlines()
txns = [x.strip() for x in content] # Remove whitespace
for each_txn in txns:
all_txns.append(each_txn.split(','))

4 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

2. Encoding the Transactions


Convert the list of transactions into a one-hot-encoded matrix for easier rule generation.
Library: mlxtend provides OnehotTransactions for this purpose.

import pandas as pd
from mlxtend.preprocessing import OnehotTransactions

one_hot_encoding = OnehotTransactions()
one_hot_txns = one_hot_encoding.fit(all_txns).transform(all_txns)
one_hot_txns_df =
pd.DataFrame(one_hot_txns, columns=one_hot_encoding.columns_)

Matrix Structure: Rows represent transactions; columns represent items, with 1 for purchased
items and 0 otherwise.

3. Generating Association Rules


Use the Apriori algorithm to find frequent itemsets with a specified minimum support
threshold.
Apriori algorithm takes the following parameters:
1. df: pandas − DataFrame in a one-hot-encoded format.
2. min_support: float − A float between 0 and 1 for minimum support of the itemsets
returned. Default is 0.5.
3. use_colnames: boolean − If true, uses the DataFrames’ column names in the returned
DataFrame instead of column indices.

5 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

We will be using a minimum support of 0.02, that is, the itemset is available in at least 2% of
all transactions.

from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(one_hot_txns_df, min_support=0.02,


use_colnames=True)

frequent_itemsets.sample(10, random_state=90)

4. Creating Association Rules


Use association_rules to generate rules from frequent itemsets, with lift as the evaluation
metric.

6 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

The corresponding association rules are

Let us look at the top 10 association rules sorted by confidence. The rules stored in the
variable rules are sorted by confidence in descending order.

From Table 9.4, we can infer that the probability that a customer buys (whole milk), given
he/she has bought (yogurt, other vegetables), is 0.51.

7 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

4.3 COLLABORATIVE FILTERING


• Collaborative filtering is based on the notion of similarity (or distance).
• For example, if two users A and B have purchased the same products and have rated
them similarly on a common rating scale, then A and B can be considered similar in
their buying and preference behavior.
• Hence, if A buys a new product and rates high, then that product can be recommended
to B. Alternatively, the products that A has already bought and rated high can be
recommended to B, if not already bought by B.

4.3.1 How to Find Similarity between Users


• Similarity or the distance between users can be computed using the rating the users
have given to the common items purchased.
• If the users are similar, then the similarity measures such as Jaccard coefficient and
cosine similarity will have a value closer to 1 and distance measures such as Euclidian
distance will have low value.
• Example: The picture in Figure 9.2 depicts three users Rahul, Purvi, and Gaurav and
the books they have bought and rated.

8 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

The users are represented using their rating on the Euclidean space in Figure 9.3. Here the
dimensions are represented by the two books Into Thin Air and Missoula, which are the two
books commonly bought by Rahul, Purvi, and Gaurav.

Figure 9.3 shows that Rahul’s preferences are similar to Purvi’s rather than to Gaurav’s. So,
the other book, Into the Wild, which Rahul has bought and rated high, can now be
recommended to Purvi.

4.3.2 User-Based Similarity


This approach recommends items to a user based on the preferences of similar users. If two
users have rated the same items similarly, they’re considered similar. Therefore, items liked
by one user can be recommended to the other. This similarity is often computed using metrics
like cosine similarity, Pearson correlation, or Jaccard coefficient.

• We will use MovieLens dataset for finding similar users based on common movies the
users have watched and how they have rated those movies.
• The file ratings.csv in the dataset contains ratings given by users. Each line in this file
represents a rating given by a user to a movie.
• The ratings are on the scale of 1 to 5. The dataset has the following features:
1. userId
2. movieId
3. rating
4. timestamp

9 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

Example Using the MovieLens Dataset


In this example, we use the MovieLens dataset, which provides movie ratings by users, with
each rating recorded in a CSV file. The following steps outline how to perform collaborative
filtering using user-based similarity:

1. Data Preparation:
• Load the dataset and drop unnecessary columns, such as the timestamp.
• Create a pivot table where rows represent users, columns represent movies, and values
are the ratings. This pivot table, which is sparse, has NaNs where users haven’t rated
specific movies. These NaNs are then filled with 0s to facilitate similarity calculations.

Create a pivot table or matrix and represent users as rows and movies as columns. The values
of the matrix will be the ratings the users have given to those movies
Those movies that the users have not watched and rated yet, will be represented as NaN.

10 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru


Advanced AI and ML 21AI71

2. Calculating Cosine Similarity between Users


• Each row in user_movies_df represents a user. If we compute the similarity between
rows, it will represent the similarity between those users.
• sklearn.metrics.pairwise_distances can be used to compute distance between all pairs
of users.
• pairwise_distances() takes a metric parameter for what distance measure to use. We
will be using cosine similarity for finding similarity. Cosine similarity closer to 1
means user are very similar and closer to 0 means users are very dissimilar.

3. Finding Similar Users:


• For each user, the user with the highest similarity score is identified.
• For instance, if user 338 is most similar to user 2 based on a cosine similarity score of
0.58, this means user 338’s ratings are closely aligned with those of user 2.

11 Deepak D, Asst. Prof., Dept. of AIML, Canara Engineering College, Mangaluru

You might also like