0% found this document useful (0 votes)
25 views37 pages

AIML Mod 4

Uploaded by

sarika.mn.362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views37 pages

AIML Mod 4

Uploaded by

sarika.mn.362
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Module 3- Advanced Machine Learning (18AI72)

Module 3
Chapter 9 : Recommender Systems Chapter 10: Text Analytics
9.1| OVERVIEW 10.1 | OVERVIEW
9.1.1 | Datasets 10.2 | Sentiment Classification

9.2 | Association Rules (Association 13.3 | Naïve–Bayes Model For Sentiment


Classification
Rule Mining)
13.4 | Using Tf-Idf Vectorizer
9.3 | Collaborative Filtering
13.5 | Challenges Of Text Analytics
9.4 | Using Surprise Library
9.5 | Matrix Factorization

Chapter 9: Recommender Systems


9.1 | OVERVIEW
Q. Define Recommendation systems
 Marketing is about connecting the best products or best services to the right customers.
 In today’s digital world, personalization is essential for meeting customer’s needs more
effectively, thereby increasing customer satisfaction and the likelihood of repeat purchases.
 Recommendation systems are a set of algorithms which recommend most relevant items to
users based on their preferences predicted using the algorithms.
 Recommendation systems act on behavioral data, such as customer’s previous purchase,
ratings or reviews to predict their likelihood of buying a new product or service.

Amazon’s “Customers who buy this item also bought”, Netflix’s “shows and movies you may
want to watch” are examples of recommendation systems. Recommender systems are very popular
for recommending products such as movies, music, news, books, articles, groceries and act as a
backbone for cross-selling across industries.
Three algorithms that are widely used for building recommendation systems:
1. Association Rules
2. Collaborative Filtering
3. Matrix Factorization

9.1.1 | Datasets
For exploring the above specified algorithms, we will be using the following two publicly available
datasets and build recommendations.
1. groceries.csv: This dataset contains transactions of a grocery store and can be downloaded from

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv.
2. Movie Lens: This dataset contains 20000263 ratings and 465564 tag applications across 27278
movies. As per the source of data, these data were created by 138493 users between January 09, 1995
and March 31, 2015. This dataset was generated on October 17, 2016. Users were selected and
included randomly. All selected users had rated at least 20 movies. The dataset can be downloaded
from the link https://grouplens.org/datasets/movielens/.

9.2 | ASSOCIATION RULES (ASSOCIATION RULE MINING)


Q. Explain Association Rules with metrics support, confidence, and lift
Association rule finds combinations of items that frequently occur together in orders or baskets (in a
retail context). The items that frequently occur together are called itemsets. Itemsets help to discover
relationships between items that people buy together and use that as a basis for creating strategies like
combining products as combo offer or place products next to each other in retail shelves to attract
customer attention.
An application of association rule mining is in Market Basket Analysis (MBA). MBA is a technique
used mostly by retailers to find associations between items purchased by customers.
To illustrate the association rule mining concept, let us consider a set of baskets and the items in those
baskets purchased by customers as depicted in Figure 9.1.
Items purchased in different baskets are:
1. Basket 1: egg, beer, sugar, bread, diaper
2. Basket 2: egg, beer, cereal, bread, diaper
3. Basket 3: milk, beer, bread
4. Basket 4: cereal, diaper, bread

The primary objective of a recommender system is to predict items that a customer may purchase in
the future based on his/her purchases so far. In future, if a customer buys beer, can we predict what

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

he/she is most likely to buy along with beer? To predict this, we need to find out which items have
shown a strong association with beer in previously purchased baskets. We can use association rule
mining technique to find this out.
Association rule considers all possible combination of items in the previous baskets and computes
various measures such as support, confidence, and lift to identify rules with stronger associations.
One of the challenges in association rule mining is the number of combination of items that need to
be considered;
Association rules use “Apriori algorithm”, it eliminate the items that possibly cannot be part of any
itemsets. The rules generated are represented as
{diapers} {beer}
This means that customers who purchased diapers also purchased beer in the same basket. {diaper,
beer} together is called itemset. {diaper} is called the antecedent and the {beer} is called the
consequent.
Both antecedents and consequents can have multiple items, e.g. {diaper, milk}  {beer, bread} is
also a valid rule. Each rule is measured with a set of metrics.

9.2.1 | Metrics
Concepts such as support, confidence, and lift are used to generate association rules. These concepts
are explained below.
9.2.1.1 Support
Support indicates the frequencies of items appearing together in baskets with respect to all possible
baskets being considered (or in a sample). For example, the support for (beer, diaper) will be 2/4
(based on the data shown in Figure 9.1), that is, 50% as it appears together in 2 baskets out of 4
baskets.
Assume that X and Y are items being considered. Let
1. N be the total number of baskets.
2. NXY represent the number of baskets in which X and Y appear together.
3. NX represent the number of baskets in which X appears.
4. NY represent the number of baskets in which Y appears.
Then the support between X and Y, Support(X, Y), is given by

Apriori algorithm uses minimum support criteria to reduce the number of possible itemset
combinations, which in turn reduces computational requirements.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

If minimum support is set at 0.01, an association between X and Y will be considered if and only if
both X and Y have minimum support of 0.01.
Hence, apriori algorithm computes support for each item independently and eliminates items with
support less than minimum support. The support of each individual item can be calculated using Eq.
(9.1).

9.2.1.2 Confidence
Confidence measures the proportion of the transactions that contain X, which also contain Y. X is
called antecedent and Y is called consequent. Confidence can be calculated using the following
formula:

Where P (Y|X) is the conditional probability of Y given X.

9.2.1.3 Lift
Lift is calculated using the following formula:

Lift can be interpreted as the degree of association between two items.


1. Lift value 1 indicates that the items are independent (no association)
2. Lift value of less than 1 implies that the products are substitution (purchase one product will
decrease the probability of purchase of the other product) and
3. Lift value of greater than 1 indicates purchase of Product X will increase the probability of
purchase of Product Y. Lift value of greater than 1 is a necessary condition of generating
association rules.

9.2.2 | Applying Association Rules


Q. Write the python implementation to generate the Association Rules on the “groceries.csv
dataset”
To create association rules, we use the transactions data available in the groceries.csv dataset. Each
line in the dataset is an order and contains a variable number of items. Each item in each order is
separated by a comma in the dataset.
9.2.2.1 Loading the Dataset
Python’s open() method can be used to open the file and readlines() to read each line. The following

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

code block can be used for loading and reading the data:

In the end, the variable all_txns will contain a list of orders and list of items in each order. An order
is also called a transaction. We print the first five transactions.

9.2.2.2 Encoding the Transactions


Python library mlxtend provides methods to generate association rules from a list of transactions.
But these methods require the data to be fed in specific format. The transactions and items need to be
converted into a tabular or matrix format.
Each row represents a transaction and each column represents an item. So, the matrix size will be of
M × N, where M represents the total number of transactions and N represents all unique items
available across all transactions (or the number of items sold by the seller).
The items available in each transaction will be represented in one-hot-encoded format, that is, the
item is encoded 1 if it exists in the transaction or 0 otherwise. The mlxtend library has a feature pre-
processing implementation class called OnehotTransactions that will take all_txns as an input and
convert the transactions and items into one-hot-encoded format.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

Output:

It can be noticed from Table 9.1 that transaction with index 5 contains an item called butter (item
purchased by the customer) and transaction with index 7 contains an item bottled beer. All other
entries in Table 9.1 are 0, which implies that these items were not purchased.
9.2.2.3 Generating Association Rules
We will use apriori algorithms to generate itemset. The total number of itemset will depend on the
number of items that exist across all transactions, there are 171 items in the data set.
171
For itemset containing 2 items in each set, the total number of itemsets will be C2 , that is, the
number of itemset will be 14535. It is a very large number and computationally intensive. To limit
the number of generated rules, we will apply minimum support value of 0.02. All items that do not
have the minimum support will be removed from the possible itemset combinations.
Apriori algorithm takes the following parameters:
1. df: pandas − DataFrame in a one-hot-encoded format.
2. min_support: float − A float between 0 and 1 for minimum support of the itemsets returned.
Default is 0.5.
3. use_colnames: boolean − If true, uses the DataFrames’ column names in the returned DataFrame
instead of column indices.
We will be using a minimum support of 0.02, that is, the itemset is available in at least 2% of all

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

transactions. The following commands can be used for setting minimum support.

Output:

The apriori algorithm filters out frequent itemsets which have minimum support of greater than 2%.
From the above ouput we can infer that whole milk and yogurt appear together in about 5.6% of the
baskets. These itemsets can be passed to association_rules for generating rules and corresponding
metrics.
The following commands are used.
1. df : pandas − DataFrame of frequent itemsets with columns [‘support’, ‘itemsets’].
2. metric − In this use ‘confidence’ and ‘lift’ to evaluate if a rule is of interest. Default is ‘confidence’.
3. min_threshold − Minimal threshold for the evaluation metric to decide whether a candidate rule
is of interest.

Output:

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

9.2.2.4 Top Ten Rules


Let us look at the top 10 association rules sorted by confidence. The rules stored in the variable rules
are sorted by confidence in descending order.

Output:

From above output, we can infer that the probability that a customer buys (whole milk), given he/she
has bought (yogurt, other vegetables), is 0.51. Now, these rules can be used to create strategies like
keeping the items together in store shelves or cross-selling.

9.2.2.5 Pros and Cons of Association Rule Mining


Q. Write the Pros and Cons of Association Rule Mining
The following are advantages of using association rules:
1. Transactions data, which is used for generating rules, is always available and mostly
clean.
2. The rules generated are simple and can be interpreted.
However, association rules do not take the preference or ratings given by customers into account,
which is an important information pertaining for generating rules. If customers have bought two items
but disliked one of them, then the association should not be considered. Collaborative filtering takes
both, what customers bought and how they liked (rating) the items, into consideration before
recommending. Association rules mining is used across several use cases including product
recommendations, fraud detection from transaction sequences, medical diagnosis, weather prediction,
etc.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

9.3 | COLLABORATIVE FILTERING


Q. Explain collaborative filtering
Collaborative filtering is based on the concept of similarity (or distance). For example, if two users
A and B have purchased the same products and have rated them similarly on a common rating scale,
then A and B can be considered similar in their buying and preference behavior. Hence, if A buys a
new product and rates high, then that product can be recommended to B. Alternatively, the products
that A has already bought and rated high can be recommended to B, if not already bought by B.
9.3.1 | How to Find Similarity between Users?
Similarity or the distance between users can be computed using the rating the users have given to the
common items purchased. If the users are similar, then the similarity measures such as Jaccard
coefficient and cosine similarity will have a value closer to 1 and distance measures such as Euclidian
distance will have low value.
Most widely used distances or similarities are Euclidean distance, Jaccard coefficient, cosine
similarity, and Pearson correlation.
Let us consider collaborative filtering technique using the example described below. The picture in
Figure 9.2 depicts three users Rahul, Purvi, and Gaurav and the books they have bought and rated.

The users are represented using their rating on the Euclidean space in Figure 9.3. Here the dimensions
are represented by the two books Into Thin Air and Missoula, which are the two books commonly
bought by Rahul, Purvi, and Gaurav.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

Figure 9.3 shows that Rahul’s preferences are similar to Purvi’srather than to Gaurav’s. So, the other
book, Into the Wild, which Rahul has bought and rated high, can now be recommended to Purvi.
Collaborative filtering comes in two variations:
1. User-Based Similarity: Finds K similar users, based on common items they have bought.
2. Item-Based Similarity: Finds K similar items, based on common users who have bought
those items. Both algorithms are similar to K-Nearest Neighbors (KNN).

9.3.2 | User-Based Similarity


Q. With python implementation, explain User-Based Similarity on MovieLens dataset
Here MovieLens dataset (see https://grouplens.org/datasets/movielens/) is used for finding similar
users based on common movies the users have watched and how they have rated those movies. The
file ratings.csv in the dataset contains ratings given by users. Each line in this file represents a rating
given by a user to a movie. The ratings are on the scale of 1 to 5. The dataset has the following
features:
1. userId
2. movieId
3. rating
4. timestamp

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

9.3.2.1 Loading the Dataset


Load the file onto a DataFrame using pandas’ read_csv() method and print the first five records

Output:

The timestamp column will not be used in this example, so it can be dropped from the dataframe.

The number of unique users in the dataset can be found using method unique() on userId column.

Similarly, the number of unique movies in the dataset is,

Now we need to create a pivot table or matrix and represent users as rows and movies as columns.
The values of the matrix will be the ratings the users have given to those movies. As there are 671
users and 9066 movies, we will have a matrix of size 671 × 9066. The matrix will be very sparse as
very few cells will be filled with the ratings using only those movies that users have watched.
Those movies that the users have not watched and rated yet, will be represented as NaN. Pandas
DataFrame has pivot method which takes the following three parameters:
1. index: Column value to be used as DataFrame’s index. So, it will be userId column of
rating_df.
2. columns: Column values to be used as DataFrame’s columns. So, it will be movieId column
of rating_df.
3. values: Column to use for populating DataFrame’s values. So, it will be rating column of
rating_df.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

The DataFrame contains NaN for those entries where users have seen a movie and not rated. We can
impute those NaNs with 0 values using the following codes. The results are shown.

9.3.2.2 Calculating Cosine Similarity between Users


Each row in user_movies_df represents a user. If we compute the similarity between rows, it will
represent the similarity between those users. sklearn.metrics.pairwise_distances can be used to
compute distance between all pairs of users. pairwise_distances() takes a metric parameter for what
distance measure to use. We will be using cosine similarity for finding similarity.
Cosine similarity closer to 1 means users are very similar and closer to 0 means users are very
dissimilar. The following code can be used for calculating the similarity.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

We can print the similarity between first 5 users by using the following code. The result is shown.

user_sim_df matrix shape shows that it contains the cosine similarity between all possible pairs of
users. And each cell represents the cosine similarity between two specific users. For example, the
similarity between userid 1 and userid 5 is 0.016818 (Poor similarity).
The diagonal of the matrix shows the similarity of an user with itself (i.e., 1.0). This is true as each
user is most similar to himself or herself. But we need the algorithm to find other users who are similar
to a specific user. So, we will set the diagonal values as 0.0. All diagonal values are set to 0, which
helps to avoid selecting self as the most similar user.

9.3.2.3 Filtering Similar Users


To find most similar users, the maximum values of each column can be filtered. For example, the
most similar user to first 5 users with userid 1 to 5 can be obtained using the following code:

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

The above result shows user 325 is most similar to user 1, user 338 is most similar to user 2, and so
on. To dive a little deeper to understand the similarity, let us print the similarity values between user
2 and users ranging from 331 to 340.

The output shows that the cosine similarity between userid 2 and userid 338 is 0.581528 and highest.
But why is user 338 most similar to user 2? This can be explained intuitively if we can verify that
the two users have watched several movies in common and rated very similarly. For this, we need to
read movies dataset, which contains the movie id along with the movie name.
To find out the movies, user 2 and user 338 have watched in common and how they have rated each
one of them, we will filter out movies that both have rated at least 4 to limit the number of movies to
print.

From the table we can see that users 2 and 338 have watched 7 movies in common and have rated
almost on the same scale. Their preferences seem to be very similar.
Let us check users 2 and 332, whose cosine similarity is 0.002368.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

Users 2 and 332 have only one movie in common and have rated very differently. They indeed are
very dissimilar.
9.3.2.6 Challenges with User-Based Similarity
Finding user similarity does not work for new users. We need to wait until the new user buys a few
items and rates them. Only then users with similar preferences can be found and recommendations
can be made based on that. This is called cold start problem in recommender systems. This can be
overcome by using item-based similarity. Item-based similarity is based on the notion that if two
items have been bought by many users and rated similarly, then there must be some inherent
relationship between these two items. In other terms, in future, if a user buys one of those two items,
he or she will most likely buy the other one.

9.3.3 | Item-Based Similarity


Q. With python implementation, explain Item-Based Similarity on MovieLens dataset
If two movies, movie A and movie B, have been watched by several users and rated very similarly,
then movie A and movie B can be similar in taste. In other words, if a user watches movie A, then he
or she is very likely to watch B and vice versa.
9.3.3.1 Calculating Cosine Similarity between Movies
In this approach, we need to create a pivot table, where the rows represent movies, columns represent
users, and the cells in the matrix represent ratings the users have given to the movies. So, the pivot()
method will be called with movieId as index and userId as columns as described below:

Now, the following code is used to print similarity between the first 5 movies. The results are shown.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

9.3.3.2 Finding Most Similar Movies


In the following code, we write a method get_similar_movies() which takes a movieid as a parameter
and returns the similar movies based on cosine similarity. Note that movieid and index of the movie
record in the movies_df are not same. We need to find the index of the movie record from the movieid
and use that to find similarities in the movie_sim_df. It takes another parameter topN to specify how
many similar movies will be returned.

The above method get_similar_movies() takes movie id as an argument and returns other movies
which are similar to it. Let us find out how the similarities play out by finding out movies which are
similar to the movie Godfather. The movie id for the movie Godfather is 858.

Output shows the movies that are similar to Godfather.

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

It can be observed from Table 9.13 that users who watched ‘Godfather, The’, also watched
‘Godfather: Part II’ the most. This makes absolute sense! It also indicates that the users have watched
Goodfellas (1990), One Flew Over the Cuckoo’s Nest (1975), and Untouchables, The (1987).
So, in future, if any user watches ‘Godfather, The’, the other movies can be recommended to them.

9.5 | MATRIX FACTORIZATION


Q. What is matrix factorization? Explain.
Matrix factorization is a matrix decomposition technique. Matrix decomposition is an approach for
reducing a matrix into its constituent parts. Matrix factorization algorithms decompose the user-item
matrix into the product of two lower dimensional rectangular matrices.
In Figure 9.4, the original matrix contains users as rows, movies as columns, and rating as values.
The matrix can be decomposed into two lower dimensional rectangular matrices.

The Users–Movies matrix contains the ratings of 3 users (U1, U2, U3) for 5 movies (M1 through
M5). This Users–Movies matrix is factorized into a (3, 3) Users–Factors matrix and (3, 5) Factors–
Movies matrix. Multiplying the Users–Factors and Factors–Movies matrix will result in the original
Users–Movies matrix.
The idea behind matrix factorization is that there are latent factors that determine why a user rates a
movie, and the way he/she rates. The factors could be the story or actors or any other specific
attributes of the movies. But we may never know what these factors actually represent. That is why
they are called latent factors. A matrix with size (n, m), where n is the number of users and m is the
number of movies, can be factorized into (n, k) and (k, m) matrices, where k is the number of factors.
The Users–Factors matrix represents that there are three factors and how each user has preferences
towards these factors. Factors–Movies matrix represents the attributes the movies possess.
In the above example, U1 has the highest preference for factor F2, whereas U2 has the highest
preference for factor F3. Similarly, the F2 factor is high in movies M2 and M4. Probably this is the

Dept. of AIML, GMIT, Davangere


Module 3- Advanced Machine Learning (18AI72)

reason why U1 has given high ratings to movies M2 (4) and M4 (5).
One of the popular techniques for matrix factorization is Singular Vector Decomposition (SVD).
Python Surprise library provides SVD algorithm, which takes the number of factors (n_factors) as a
parameter.

Dept. of AIML, GMIT, Davangere


Module 4 - Text Analytics

Text Analytics is the process of extracting meaningful information, patterns, and insights
from unstructured text data. It uses techniques from Natural Language Processing (NLP),
machine learning, and statistics to analyze textual data and derive actionable insights.

Key Components of Text Analytics

ud
• 1. Text Preprocessing: Cleaning and preparing text data for analysis. Techniques include
removing punctuation, numbers, special characters, converting text to lowercase,
removing stop words, tokenization, stemming, and lemmatization.
• 2. Feature Extraction: Converting text into numerical formats for analysis. Techniques
include Bag of Words (BoW), TF-IDF, and Word Embeddings (e.g., Word2Vec, GloVe).
• 3. Sentiment Analysis: Determines the sentiment or emotion expressed in text (e.g.,
positive, negative, neutral). Applications include product reviews and social media
monitoring.


Factorization (NMF).
lo
4. Topic Modeling: Identifies hidden themes or topics in a collection of documents.
Common algorithms: Latent Dirichlet Allocation (LDA), Non-Negative Matrix

5. Text Classification: Categorizes text into predefined labels (e.g., spam vs. not spam).
Algorithms include Naive Bayes, Support Vector Machines (SVM), and Neural Networks.
• 6. Named Entity Recognition (NER): Identifies and classifies entities in text (e.g., names,
C
dates, locations).
• 7. Text Summarization: Generates concise summaries of longer documents. Types:
Extractive and Abstractive summarization.
tu

Applications of Text Analytics


• Customer Feedback Analysis: Analyze product reviews, survey responses, and
complaints.
• Social Media Monitoring: Track brand sentiment and trending topics on platforms like
Twitter.
• Fraud Detection: Identify fraudulent activities by analyzing textual patterns in claims or
logs.
V

• Healthcare: Extract insights from patient records, clinical notes, and research papers.
• Market Research: Understand consumer behavior and preferences from open-ended
survey responses.
• Content Recommendation: Suggest relevant articles, videos, or products based on user
preferences.
Techniques and Tools
• Natural Language Processing (NLP) Libraries: NLTK, spaCy, TextBlob, Gensim, and
Hugging Face Transformers.
• Machine Learning: Supervised and unsupervised learning algorithms for text
classification, clustering, and sentiment analysis.
• Visualization: Word clouds, bar charts, and heatmaps to visualize text data insights.

Example: Sentiment Analysis with Python

ud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample dataset
texts = ["I love this product!", "This is the worst experience.", "Absolutely fantastic!", "Not
good at all."]

# Split dataset
lo
labels = [1, 0, 1, 0] # 1: Positive, 0: Negative

X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.25,


random_state=42)
C
# Create pipeline
model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train model
tu

model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)

# Evaluate
V

print("Accuracy:", accuracy_score(y_test, predictions))

Conclusion
Text Analytics is a powerful tool for transforming unstructured text into valuable insights.
By leveraging advanced NLP techniques and tools, organizations can enhance decision-
making, automate processes, and improve customer experiences.
Naïve Bayes Model for Sentiment Classification
The Naïve Bayes model is a probabilistic machine learning algorithm based on Bayes'
Theorem. It is particularly effective for text classification tasks like sentiment analysis due
to its simplicity, efficiency, and robustness with high-dimensional data.

Key Concepts of Naïve Bayes


1. **Bayes' Theorem**:

P(A|B) = [P(B|A) * P(A)] / P(B)


Where:

ud
- P(A|B): Probability of A given B.
- P(B|A): Probability of B given A.
- P(A): Prior probability of A.
- P(B): Prior probability of B.

2. **Naïve Assumption**:
- Assumes independence among features (words in text).
- Simplifies computation by treating each feature as independent.

3. **Classes**: lo
- Sentiment classification typically involves two classes:
- Positive Sentiment: P(positive).
- Negative Sentiment: P(negative).
C
4. **Feature Extraction**:
- Text is converted into numerical features using techniques like Bag of Words or TF-IDF.

Advantages of Naïve Bayes


• Fast and Efficient: Works well with large datasets.
tu

• Robust to Irrelevant Features: Can handle noisy data.


• Effective for Text Data: Especially suitable for high-dimensional sparse datasets.

Limitations of Naïve Bayes


• Independence Assumption: Words in text are often dependent on each other, which the
model does not consider.
V

• Imbalance in Classes: Performs poorly when one class dominates the dataset.

Example: Naïve Bayes for Sentiment Classification

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Sample dataset
texts = [
"I love this product, it's amazing!",
"This is the worst purchase I ever made.",
"Absolutely fantastic quality and design!",
"Not happy with the experience at all.",
"The product is okay, but could be better.",
"Horrible service, I will not buy again.",
"Great value for the price, very satisfied.",
"Terrible quality, broke within a week."

ud
]

# Labels: 1 for positive sentiment, 0 for negative sentiment


labels = [1, 0, 1, 0, 1, 0, 1, 0]

# Split dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.25,
random_state=42)

vectorizer = CountVectorizer()
lo
# Convert text to numerical features using Bag of Words

X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)
C
# Train Naïve Bayes model
model = MultinomialNB()
model.fit(X_train_features, y_train)
tu

# Make predictions
predictions = model.predict(X_test_features)

# Evaluate the model


print("Accuracy:", accuracy_score(y_test, predictions))
print("Classification Report:")
V

print(classification_report(y_test, predictions))

Applications of Naïve Bayes in Sentiment Classification


• Product Reviews: Analyze customer feedback to identify sentiments.
• Social Media Monitoring: Classify tweets or posts as positive or negative.
• Customer Support: Automatically tag support tickets based on sentiment.
• Market Research: Gauge public opinion on brands or products.
Conclusion
The Naïve Bayes model is a straightforward yet powerful tool for sentiment classification.
While it has limitations like the independence assumption, its speed, scalability, and
effectiveness with text data make it a popular choice for sentiment analysis tasks.

Naïve Bayes for Sentiment Classification Using TF-IDF


This example demonstrates using the TF-IDF Vectorizer for feature extraction in a Naïve

ud
Bayes model for sentiment classification. TF-IDF assigns importance to terms based on their
frequency within a document and across the corpus, making it a powerful tool for text
classification tasks.

Code Example

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.model_selection import train_test_split

lo
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset
texts = [
"I love this product, it's amazing!",
C
"This is the worst purchase I ever made.",
"Absolutely fantastic quality and design!",
"Not happy with the experience at all.",
"The product is okay, but could be better.",
tu

"Horrible service, I will not buy again.",


"Great value for the price, very satisfied.",
"Terrible quality, broke within a week."
]

# Labels: 1 for positive sentiment, 0 for negative sentiment


labels = [1, 0, 1, 0, 1, 0, 1, 0]
V

# Split dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.25,
random_state=42)

# Convert text to numerical features using TF-IDF


vectorizer = TfidfVectorizer()
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)
# Train Naïve Bayes model
model = MultinomialNB()
model.fit(X_train_features, y_train)

# Make predictions
predictions = model.predict(X_test_features)

# Evaluate the model


print("Accuracy:", accuracy_score(y_test, predictions))

ud
print("Classification Report:")
print(classification_report(y_test, predictions))

Explanation of TF-IDF
1. **TF-IDF**:

- **TF** (Term Frequency): Measures how frequently a word appears in a document.

Formula:
lo
- **IDF** (Inverse Document Frequency): Reduces the weight of common words and
increases the importance of rare words across the corpus.

TF-IDF(t, d) = TF(t, d) × IDF(t)


Where:
C
- TF: Number of occurrences of term t in document d / Total terms in document d.
- IDF: log(Total number of documents / Number of documents containing t).

Advantages of TF-IDF
• Assigns importance to terms based on their occurrence.
tu

• Reduces the weight of stop words (e.g., 'and,' 'the').


• Enhances the performance of text classification models by focusing on relevant terms.

Conclusion
Using TF-IDF in conjunction with the Naïve Bayes model provides a robust and efficient
method for sentiment classification. It improves the model's ability to focus on meaningful
V

terms while downplaying common, less informative words.

Challenges of Text Analytics


Text analytics is a powerful tool, but it comes with several challenges that arise from the
inherent complexity and variability of natural language. Below are the primary challenges
faced in text analytics:
1. Data Preprocessing
Text data is often unstructured and noisy, requiring significant preprocessing before
analysis.

• Challenges:
• Handling spelling errors, grammatical errors, and abbreviations.
• Removing irrelevant information (e.g., advertisements in social media posts).
• Dealing with different text encodings and formats.

2. Ambiguity in Language

ud
Natural language is inherently ambiguous, and words or sentences can have multiple
meanings depending on the context.

• Examples:
• The word 'bank' could mean a financial institution or the side of a river.
• Sarcasm and irony are difficult to detect.

3. Domain-Specific Language



Challenges:
lo
Text from different domains (e.g., medical, legal, technical) often requires domain-specific
knowledge for effective analysis.


Building custom dictionaries or ontologies for specialized domains.
Adapting general models to work in niche fields.
C
4. Multilingual Text
Analyzing text in multiple languages increases the complexity of processing.

• Challenges:
tu

• Translating text without losing meaning.


• Handling languages with different grammar, syntax, or writing systems (e.g., Chinese,
Arabic).

5. Sentiment Analysis Challenges


Sentiment analysis involves identifying emotions, but text often contains mixed or
V

conflicting sentiments.

• Examples:
• "The product is great, but the service was awful" contains both positive and negative
sentiments.
• Sarcasm and humor can mislead sentiment models.

6. Scalability
Processing large volumes of text data in real-time or at scale can be computationally
intensive.
• Challenges:
• Efficiently storing and indexing massive datasets.
• Scaling algorithms to handle big data.

7. Contextual Understanding
Capturing the meaning of a sentence requires understanding its context.

• Challenges:
• Resolving coreferences (e.g., identifying that 'he' refers to 'John').
• Understanding long-term dependencies in lengthy documents.

ud
8. Evaluation Metrics
Measuring the performance of text analytics models is not straightforward.

• Challenges:
• Lack of standardized benchmarks for certain tasks.
• Difficulty in defining accuracy for subjective tasks like sentiment analysis.

9. Evolving Language



Challenges:
lo
Language evolves over time, introducing new words, phrases, and slang.


Keeping models updated with the latest vocabulary and trends.
Adapting to changes in user behavior and text patterns.
C
10. Data Privacy and Ethics
Text analytics often involves processing sensitive or personal information.

• Challenges:
tu

• Ensuring compliance with privacy regulations (e.g., GDPR).


• Avoiding biases in text analytics models.

11. Bias in Text Data


Text data can contain biases that, if not addressed, may lead to unfair outcomes.

• Challenges:
V

• Identifying and mitigating biases in training data.


• Ensuring fairness and inclusivity in text analytics models.

Conclusion
While text analytics offers tremendous opportunities, addressing these challenges requires
advanced techniques, domain expertise, and continuous model improvement. By
overcoming these hurdles, organizations can unlock the full potential of their textual data.
Module 4 - Recommender System
What is a Recommender System?
A recommender system is a type of machine learning application designed to suggest
relevant items to users based on their preferences, behavior, and other contextual factors. It
is widely used in various domains, such as e-commerce, entertainment, education, and
social networks, to improve user experience and drive engagement.

ud
Types of recommender systems include:

• Content-Based Filtering
• Collaborative Filtering
• Hybrid Systems
• Knowledge-Based Systems
• Deep Learning-Based Systems

lo
Recommender systems analyze user behavior and preferences to improve decision-making,
engagement, and satisfaction.

Types of Recommender Systems

1. Content-Based Filtering
C
Content-based filtering recommends items similar to those the user has interacted with in
the past by analyzing the item's attributes. It matches item features with user preferences
and uses similarity metrics to suggest items.

Advantages:
tu

• Personalized to the user


• Does not require data about other users

Challenges:

• Limited to item attributes


• Struggles with cold start problem
V

Example: Recommending movies with similar genres, actors, or directors to those already
watched.

2. Collaborative Filtering
Collaborative filtering relies on the preferences and actions of other users to make
recommendations. It includes user-based and item-based approaches.

Advantages:
• No need to understand item attributes
• Handles diverse datasets well

Challenges:

• Cold start problem for new users or items


• Sparse data can reduce effectiveness

Example: Amazon's 'Customers who bought this also bought...' feature.

3. Hybrid Systems

ud
Hybrid systems combine multiple approaches (e.g., content-based and collaborative
filtering) to improve accuracy. They mitigate weaknesses of individual systems and are
more robust.

Challenges: Higher complexity and requires careful tuning to balance methods.

Example: Netflix combining collaborative filtering and content-based filtering.

4. Knowledge-Based Systems
lo
Knowledge-based systems leverage specific knowledge about how certain item features
satisfy user requirements. They do not rely on user history and work well for new users and
items.

Challenges: Requires detailed item and user information and can be less adaptive.
C
Example: A travel booking system recommending trips based on budget, preferences, and
travel dates.

5. Deep Learning-Based Systems


Deep learning-based systems use neural networks to capture complex relationships
tu

between users and items. They analyze user behavior over time and handle large-scale data.

Advantages: Captures subtle patterns and is suitable for large datasets.

Challenges: Requires significant computational resources and large training data.

Example: Spotify's 'Discover Weekly' playlist.


V
Association Rules
Association rules are a data mining technique used to find relationships or patterns among
items in large datasets. These rules are widely used in market basket analysis, where the
goal is to discover relationships between items that customers frequently buy together.

Key Concepts of Association Rules

1. Itemset
A collection of one or more items.

ud
Example: {milk, bread, butter} is an itemset.

2. Support
Measures how frequently an itemset appears in the dataset.

**Formula**:

Support(A) = (Transactions containing A) / (Total transactions)

3. Confidence
lo
Example: If {milk, bread} appears in 3 out of 10 transactions, support is 0.3 (30%).

Measures the likelihood that item B is purchased when item A is purchased.

**Formula**:
C
Confidence(A → B) = Support(A ∪ B) / Support(A)

Example: If 50% of transactions with {milk} also include {bread}, confidence is 0.5 (50%).

4. Lift
tu

Indicates the strength of a rule by comparing the observed support to the expected support
if A and B were independent.

**Formula**:

Lift(A → B) = Confidence(A → B) / Support(B)

Example: If {milk} and {bread} are purchased together more often than expected by chance,
V

the lift is greater than 1.

Structure of Association Rules


Rules are expressed as A → B, where:

- A: Antecedent (item or set of items).

- B: Consequent (item or set of items).

Example: {milk} → {bread} means 'If a customer buys milk, they are likely to buy bread.'
Steps to Generate Association Rules

1. Generate Frequent Itemsets


Identify all itemsets with a support value greater than a predefined threshold.

Algorithm: Apriori or FP-Growth.

2. Generate Strong Rules


From the frequent itemsets, generate rules with high confidence and lift values.

ud
Applications of Association Rules
• Market Basket Analysis: Discover products often purchased together to improve cross-
selling and upselling. Example: If customers buy {diapers}, they often buy {beer}.
• Recommendation Systems: Suggest items based on association rules. Example: E-
commerce suggesting 'Frequently Bought Together' products.
• Fraud Detection: Identify unusual patterns in financial transactions.
• Healthcare: Find associations between symptoms and diseases or medications and side
effects.

Advantages


lo
Easy to understand and interpret.
Applicable to a wide range of industries.

Limitations
C
• May generate too many rules, making it hard to analyze.
• Does not capture time-based patterns or sequences.
tu
V
ud
Collaborative Filtering
Collaborative filtering is a popular technique in recommendation systems that predicts a
user's interest in items by analyzing their past behavior and the behavior of other users. It is
based on the assumption that if users have agreed in the past, they are likely to agree in the
future.

lo
Types of Collaborative Filtering

1. User-Based Collaborative Filtering


Finds users similar to the target user based on their preferences and recommends items
that those similar users liked.
C
Example:

Alice and Bob have similar preferences. If Bob likes a new movie, Alice might like it too.

Advantages:
tu

• Easy to understand and implement.

Challenges:

• Not scalable for a large number of users.

2. Item-Based Collaborative Filtering


V

Focuses on the similarity between items rather than users. It recommends items that are
frequently liked or purchased together.

Example:

If many users who bought a smartphone also bought a case, the system recommends a case
when a smartphone is purchased.

Advantages:
• More scalable than user-based filtering.

Challenges:

• Requires item similarity computation, which can be intensive for large datasets.

3. Matrix Factorization
Decomposes the user-item interaction matrix into latent factors representing user and item
characteristics. This is often implemented using techniques like Singular Value
Decomposition (SVD).

ud
Advantages:

• Handles sparse data effectively.


• Reduces dimensionality.

Challenges:

• Requires proper parameter tuning.

lo
Key Components of Collaborative Filtering

1. Interaction Matrix
A matrix where rows represent users and columns represent items. Example: If a user rates
C
a movie, the matrix records the rating at the corresponding row and column.

2. Similarity Metrics
Measures how similar users or items are to one another. Common metrics include Cosine
similarity, Pearson correlation, and Jaccard similarity.
tu

3. Prediction
Predicts the likelihood of a user liking an item based on the preferences of similar users or
items.

Steps to Implement Collaborative Filtering


1. Collect Data: Gather user-item interaction data, such as ratings, purchases, or clicks.
V

2. Compute Similarity: Calculate similarity scores between users or items using a


similarity metric.
3. Generate Recommendations: Recommend items based on the preferences of similar
users or frequently associated items.

Applications of Collaborative Filtering


• E-commerce: Recommending products to users based on their purchase history and the
preferences of similar customers.
• Streaming Services: Suggesting movies, TV shows, or music tracks based on user
viewing or listening patterns.
• Education: Recommending courses or learning resources based on user engagement
and preferences.
• Social Media: Suggesting friends, groups, or content based on user behavior.

Advantages
• Works well when user and item metadata are unavailable.
• Captures complex relationships among users and items.

ud
Limitations
• Cold Start Problem: Struggles to recommend items to new users or suggest new items
with no prior interactions.
• Data Sparsity: Many users interact with only a small subset of items, leading to sparse
interaction matrices.
• Scalability: Computing similarity in large datasets can be resource-intensive.

Surprise Library
lo
Surprise is a Python library designed specifically for building and evaluating
recommendation systems. It is widely used for collaborative filtering, particularly matrix
C
factorization, and other algorithms that predict user preferences based on historical data.

Key Features of Surprise


• Supports Various Collaborative Filtering Algorithms: Algorithms like Singular Value
Decomposition (SVD), k-Nearest Neighbors (k-NN), and more.
tu

• Customizable: Allows users to define their own similarity metrics and prediction
algorithms.
• Efficient: Optimized for performance, especially for matrix factorization methods.
• Dataset Management: Provides utilities for loading, splitting, and managing datasets.
• Evaluation Tools: Built-in functions for model evaluation using metrics like RMSE and
MAE.
V

Installation
Install the Surprise library using pip:

pip install scikit-surprise

Key Components of Surprise


1. Dataset: Tools to load datasets from files, built-in datasets (e.g., MovieLens), or directly
from data in Python objects.
2. Algorithms: Collaborative filtering techniques such as:
- BaselineOnly: Predicts based on baseline estimates.
- SVD: Singular Value Decomposition for matrix factorization.
- KNNBasic: Basic k-Nearest Neighbors approach.
- CoClustering: Co-clustering-based collaborative filtering.

3. Similarity Metrics: Cosine similarity, Pearson correlation, and custom distance metrics.

4. Evaluation: Tools for cross-validation and computing performance metrics.

ud
Example: Building a Recommendation System
This example demonstrates how to build a recommendation system using a built-in dataset:

from surprise import SVD


from surprise import Dataset
from surprise.model_selection import cross_validate

# Load a built-in dataset (e.g., MovieLens 100k)

# Use SVD algorithm


algo = SVD()
lo
data = Dataset.load_builtin('ml-100k')
C
# Perform cross-validation
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Custom Dataset Example


tu

For datasets in custom formats (e.g., CSV):

from surprise import Reader, Dataset


from surprise import SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
V

# Define a Reader with rating scale


reader = Reader(rating_scale=(1, 5))

# Load data from a custom file (replace 'data.csv' with your file)
data = Dataset.load_from_file('data.csv', reader=reader)

# Split into train and test sets


trainset, testset = train_test_split(data, test_size=0.25)
# Train a model
algo = SVD()
algo.fit(trainset)

# Test the model


predictions = algo.test(testset)

# Calculate RMSE
accuracy.rmse(predictions)

ud
Applications of Surprise
• Movie Recommendations: Recommending movies based on user ratings (e.g., Netflix).
• E-commerce: Suggesting products based on purchase behavior.
• Content Personalization: Tailoring recommendations for individual users.

Advantages



lo
Easy to use and well-documented.
Extensive support for collaborative filtering techniques.
Optimized for performance.

Limitations
C
• Primarily focused on collaborative filtering.
• Requires familiarity with Python for effective use.
tu

Matrix Factorization
Matrix Factorization is a popular approach used in recommendation systems, particularly in
collaborative filtering, to predict user-item interactions. It decomposes a large, sparse

interaction matrix (e.g., user-item rating matrix) into smaller matrices, capturing latent
factors that represent user preferences and item characteristics.
V

Key Concepts
• 1. Interaction Matrix: A matrix where rows represent users, columns represent items,
and values represent interactions (e.g., ratings, clicks, purchases).
ud
• 2. Latent Factors: Hidden dimensions that describe user preferences and item attributes
(e.g., genres for movies).
• 3. Decomposition: The interaction matrix R is decomposed into two matrices: User
Matrix (P) and Item Matrix (Q). R ≈ P × Q^T, where R is reconstructed using the dot
product of P and Q.


lo
Matrix Factorization Techniques
• Singular Value Decomposition (SVD): Decomposes R into three matrices (U, Σ, V^T).
Efficient for dense matrices but struggles with sparse data.
Alternating Least Squares (ALS): Optimizes one matrix at a time (e.g., fixing P and
solving for Q, then vice versa). Commonly used in large-scale systems.
C
• Non-Negative Matrix Factorization (NMF): Ensures that values in P and Q are non-
negative. Useful for interactions with non-negative values like counts.

Advantages of Matrix Factorization


tu

• Captures Latent Relationships: Reveals hidden patterns in user preferences and item
attributes.
• Dimensionality Reduction: Reduces the complexity of large interaction matrices.
• Personalized Recommendations: Predicts missing values (e.g., ratings) using user and
item latent factors.

Limitations
V

• Cold Start Problem: Cannot recommend for new users or items with no prior
interactions.
• Scalability: Large datasets can be computationally expensive.
• Interpretability: Latent factors are abstract and may not have clear meanings.

Applications of Matrix Factorization


• Movie Recommendations: Suggests movies based on user ratings and latent factors (e.g.,
genres, directors).
• E-commerce: Recommends products by analyzing purchase history and item features.
• Music Streaming: Predicts songs a user might like based on listening habits and song
features.
• Education: Personalizes course recommendations based on student preferences and
engagement.

Example: Matrix Factorization with Python (Using Surprise Library)

from surprise import SVD


from surprise import Dataset
from surprise.model_selection import train_test_split

ud
from surprise import accuracy

# Load dataset
data = Dataset.load_builtin('ml-100k')

# Split into train and test sets


trainset, testset = train_test_split(data, test_size=0.2)

# Apply SVD
algo = SVD()
algo.fit(trainset)

# Test the model


lo
C
predictions = algo.test(testset)

# Calculate RMSE
accuracy.rmse(predictions)
tu

Conclusion
Matrix Factorization is a powerful technique for recommendation systems, offering
personalized and scalable solutions by leveraging latent factors. However, it requires
adequate data and computational resources to achieve effective results.
V

You might also like