AIML Mod 4
AIML Mod 4
Module 3
Chapter 9 : Recommender Systems Chapter 10: Text Analytics
9.1| OVERVIEW 10.1 | OVERVIEW
9.1.1 | Datasets 10.2 | Sentiment Classification
Amazon’s “Customers who buy this item also bought”, Netflix’s “shows and movies you may
want to watch” are examples of recommendation systems. Recommender systems are very popular
for recommending products such as movies, music, news, books, articles, groceries and act as a
backbone for cross-selling across industries.
Three algorithms that are widely used for building recommendation systems:
1. Association Rules
2. Collaborative Filtering
3. Matrix Factorization
9.1.1 | Datasets
For exploring the above specified algorithms, we will be using the following two publicly available
datasets and build recommendations.
1. groceries.csv: This dataset contains transactions of a grocery store and can be downloaded from
http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv.
2. Movie Lens: This dataset contains 20000263 ratings and 465564 tag applications across 27278
movies. As per the source of data, these data were created by 138493 users between January 09, 1995
and March 31, 2015. This dataset was generated on October 17, 2016. Users were selected and
included randomly. All selected users had rated at least 20 movies. The dataset can be downloaded
from the link https://grouplens.org/datasets/movielens/.
The primary objective of a recommender system is to predict items that a customer may purchase in
the future based on his/her purchases so far. In future, if a customer buys beer, can we predict what
he/she is most likely to buy along with beer? To predict this, we need to find out which items have
shown a strong association with beer in previously purchased baskets. We can use association rule
mining technique to find this out.
Association rule considers all possible combination of items in the previous baskets and computes
various measures such as support, confidence, and lift to identify rules with stronger associations.
One of the challenges in association rule mining is the number of combination of items that need to
be considered;
Association rules use “Apriori algorithm”, it eliminate the items that possibly cannot be part of any
itemsets. The rules generated are represented as
{diapers} {beer}
This means that customers who purchased diapers also purchased beer in the same basket. {diaper,
beer} together is called itemset. {diaper} is called the antecedent and the {beer} is called the
consequent.
Both antecedents and consequents can have multiple items, e.g. {diaper, milk} {beer, bread} is
also a valid rule. Each rule is measured with a set of metrics.
9.2.1 | Metrics
Concepts such as support, confidence, and lift are used to generate association rules. These concepts
are explained below.
9.2.1.1 Support
Support indicates the frequencies of items appearing together in baskets with respect to all possible
baskets being considered (or in a sample). For example, the support for (beer, diaper) will be 2/4
(based on the data shown in Figure 9.1), that is, 50% as it appears together in 2 baskets out of 4
baskets.
Assume that X and Y are items being considered. Let
1. N be the total number of baskets.
2. NXY represent the number of baskets in which X and Y appear together.
3. NX represent the number of baskets in which X appears.
4. NY represent the number of baskets in which Y appears.
Then the support between X and Y, Support(X, Y), is given by
Apriori algorithm uses minimum support criteria to reduce the number of possible itemset
combinations, which in turn reduces computational requirements.
If minimum support is set at 0.01, an association between X and Y will be considered if and only if
both X and Y have minimum support of 0.01.
Hence, apriori algorithm computes support for each item independently and eliminates items with
support less than minimum support. The support of each individual item can be calculated using Eq.
(9.1).
9.2.1.2 Confidence
Confidence measures the proportion of the transactions that contain X, which also contain Y. X is
called antecedent and Y is called consequent. Confidence can be calculated using the following
formula:
9.2.1.3 Lift
Lift is calculated using the following formula:
code block can be used for loading and reading the data:
In the end, the variable all_txns will contain a list of orders and list of items in each order. An order
is also called a transaction. We print the first five transactions.
Output:
It can be noticed from Table 9.1 that transaction with index 5 contains an item called butter (item
purchased by the customer) and transaction with index 7 contains an item bottled beer. All other
entries in Table 9.1 are 0, which implies that these items were not purchased.
9.2.2.3 Generating Association Rules
We will use apriori algorithms to generate itemset. The total number of itemset will depend on the
number of items that exist across all transactions, there are 171 items in the data set.
171
For itemset containing 2 items in each set, the total number of itemsets will be C2 , that is, the
number of itemset will be 14535. It is a very large number and computationally intensive. To limit
the number of generated rules, we will apply minimum support value of 0.02. All items that do not
have the minimum support will be removed from the possible itemset combinations.
Apriori algorithm takes the following parameters:
1. df: pandas − DataFrame in a one-hot-encoded format.
2. min_support: float − A float between 0 and 1 for minimum support of the itemsets returned.
Default is 0.5.
3. use_colnames: boolean − If true, uses the DataFrames’ column names in the returned DataFrame
instead of column indices.
We will be using a minimum support of 0.02, that is, the itemset is available in at least 2% of all
transactions. The following commands can be used for setting minimum support.
Output:
The apriori algorithm filters out frequent itemsets which have minimum support of greater than 2%.
From the above ouput we can infer that whole milk and yogurt appear together in about 5.6% of the
baskets. These itemsets can be passed to association_rules for generating rules and corresponding
metrics.
The following commands are used.
1. df : pandas − DataFrame of frequent itemsets with columns [‘support’, ‘itemsets’].
2. metric − In this use ‘confidence’ and ‘lift’ to evaluate if a rule is of interest. Default is ‘confidence’.
3. min_threshold − Minimal threshold for the evaluation metric to decide whether a candidate rule
is of interest.
Output:
Output:
From above output, we can infer that the probability that a customer buys (whole milk), given he/she
has bought (yogurt, other vegetables), is 0.51. Now, these rules can be used to create strategies like
keeping the items together in store shelves or cross-selling.
The users are represented using their rating on the Euclidean space in Figure 9.3. Here the dimensions
are represented by the two books Into Thin Air and Missoula, which are the two books commonly
bought by Rahul, Purvi, and Gaurav.
Figure 9.3 shows that Rahul’s preferences are similar to Purvi’srather than to Gaurav’s. So, the other
book, Into the Wild, which Rahul has bought and rated high, can now be recommended to Purvi.
Collaborative filtering comes in two variations:
1. User-Based Similarity: Finds K similar users, based on common items they have bought.
2. Item-Based Similarity: Finds K similar items, based on common users who have bought
those items. Both algorithms are similar to K-Nearest Neighbors (KNN).
Output:
The timestamp column will not be used in this example, so it can be dropped from the dataframe.
The number of unique users in the dataset can be found using method unique() on userId column.
Now we need to create a pivot table or matrix and represent users as rows and movies as columns.
The values of the matrix will be the ratings the users have given to those movies. As there are 671
users and 9066 movies, we will have a matrix of size 671 × 9066. The matrix will be very sparse as
very few cells will be filled with the ratings using only those movies that users have watched.
Those movies that the users have not watched and rated yet, will be represented as NaN. Pandas
DataFrame has pivot method which takes the following three parameters:
1. index: Column value to be used as DataFrame’s index. So, it will be userId column of
rating_df.
2. columns: Column values to be used as DataFrame’s columns. So, it will be movieId column
of rating_df.
3. values: Column to use for populating DataFrame’s values. So, it will be rating column of
rating_df.
The DataFrame contains NaN for those entries where users have seen a movie and not rated. We can
impute those NaNs with 0 values using the following codes. The results are shown.
We can print the similarity between first 5 users by using the following code. The result is shown.
user_sim_df matrix shape shows that it contains the cosine similarity between all possible pairs of
users. And each cell represents the cosine similarity between two specific users. For example, the
similarity between userid 1 and userid 5 is 0.016818 (Poor similarity).
The diagonal of the matrix shows the similarity of an user with itself (i.e., 1.0). This is true as each
user is most similar to himself or herself. But we need the algorithm to find other users who are similar
to a specific user. So, we will set the diagonal values as 0.0. All diagonal values are set to 0, which
helps to avoid selecting self as the most similar user.
The above result shows user 325 is most similar to user 1, user 338 is most similar to user 2, and so
on. To dive a little deeper to understand the similarity, let us print the similarity values between user
2 and users ranging from 331 to 340.
The output shows that the cosine similarity between userid 2 and userid 338 is 0.581528 and highest.
But why is user 338 most similar to user 2? This can be explained intuitively if we can verify that
the two users have watched several movies in common and rated very similarly. For this, we need to
read movies dataset, which contains the movie id along with the movie name.
To find out the movies, user 2 and user 338 have watched in common and how they have rated each
one of them, we will filter out movies that both have rated at least 4 to limit the number of movies to
print.
From the table we can see that users 2 and 338 have watched 7 movies in common and have rated
almost on the same scale. Their preferences seem to be very similar.
Let us check users 2 and 332, whose cosine similarity is 0.002368.
Users 2 and 332 have only one movie in common and have rated very differently. They indeed are
very dissimilar.
9.3.2.6 Challenges with User-Based Similarity
Finding user similarity does not work for new users. We need to wait until the new user buys a few
items and rates them. Only then users with similar preferences can be found and recommendations
can be made based on that. This is called cold start problem in recommender systems. This can be
overcome by using item-based similarity. Item-based similarity is based on the notion that if two
items have been bought by many users and rated similarly, then there must be some inherent
relationship between these two items. In other terms, in future, if a user buys one of those two items,
he or she will most likely buy the other one.
Now, the following code is used to print similarity between the first 5 movies. The results are shown.
The above method get_similar_movies() takes movie id as an argument and returns other movies
which are similar to it. Let us find out how the similarities play out by finding out movies which are
similar to the movie Godfather. The movie id for the movie Godfather is 858.
It can be observed from Table 9.13 that users who watched ‘Godfather, The’, also watched
‘Godfather: Part II’ the most. This makes absolute sense! It also indicates that the users have watched
Goodfellas (1990), One Flew Over the Cuckoo’s Nest (1975), and Untouchables, The (1987).
So, in future, if any user watches ‘Godfather, The’, the other movies can be recommended to them.
The Users–Movies matrix contains the ratings of 3 users (U1, U2, U3) for 5 movies (M1 through
M5). This Users–Movies matrix is factorized into a (3, 3) Users–Factors matrix and (3, 5) Factors–
Movies matrix. Multiplying the Users–Factors and Factors–Movies matrix will result in the original
Users–Movies matrix.
The idea behind matrix factorization is that there are latent factors that determine why a user rates a
movie, and the way he/she rates. The factors could be the story or actors or any other specific
attributes of the movies. But we may never know what these factors actually represent. That is why
they are called latent factors. A matrix with size (n, m), where n is the number of users and m is the
number of movies, can be factorized into (n, k) and (k, m) matrices, where k is the number of factors.
The Users–Factors matrix represents that there are three factors and how each user has preferences
towards these factors. Factors–Movies matrix represents the attributes the movies possess.
In the above example, U1 has the highest preference for factor F2, whereas U2 has the highest
preference for factor F3. Similarly, the F2 factor is high in movies M2 and M4. Probably this is the
reason why U1 has given high ratings to movies M2 (4) and M4 (5).
One of the popular techniques for matrix factorization is Singular Vector Decomposition (SVD).
Python Surprise library provides SVD algorithm, which takes the number of factors (n_factors) as a
parameter.
Text Analytics is the process of extracting meaningful information, patterns, and insights
from unstructured text data. It uses techniques from Natural Language Processing (NLP),
machine learning, and statistics to analyze textual data and derive actionable insights.
ud
• 1. Text Preprocessing: Cleaning and preparing text data for analysis. Techniques include
removing punctuation, numbers, special characters, converting text to lowercase,
removing stop words, tokenization, stemming, and lemmatization.
• 2. Feature Extraction: Converting text into numerical formats for analysis. Techniques
include Bag of Words (BoW), TF-IDF, and Word Embeddings (e.g., Word2Vec, GloVe).
• 3. Sentiment Analysis: Determines the sentiment or emotion expressed in text (e.g.,
positive, negative, neutral). Applications include product reviews and social media
monitoring.
•
•
Factorization (NMF).
lo
4. Topic Modeling: Identifies hidden themes or topics in a collection of documents.
Common algorithms: Latent Dirichlet Allocation (LDA), Non-Negative Matrix
5. Text Classification: Categorizes text into predefined labels (e.g., spam vs. not spam).
Algorithms include Naive Bayes, Support Vector Machines (SVM), and Neural Networks.
• 6. Named Entity Recognition (NER): Identifies and classifies entities in text (e.g., names,
C
dates, locations).
• 7. Text Summarization: Generates concise summaries of longer documents. Types:
Extractive and Abstractive summarization.
tu
• Healthcare: Extract insights from patient records, clinical notes, and research papers.
• Market Research: Understand consumer behavior and preferences from open-ended
survey responses.
• Content Recommendation: Suggest relevant articles, videos, or products based on user
preferences.
Techniques and Tools
• Natural Language Processing (NLP) Libraries: NLTK, spaCy, TextBlob, Gensim, and
Hugging Face Transformers.
• Machine Learning: Supervised and unsupervised learning algorithms for text
classification, clustering, and sentiment analysis.
• Visualization: Word clouds, bar charts, and heatmaps to visualize text data insights.
ud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Sample dataset
texts = ["I love this product!", "This is the worst experience.", "Absolutely fantastic!", "Not
good at all."]
# Split dataset
lo
labels = [1, 0, 1, 0] # 1: Positive, 0: Negative
# Train model
tu
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
# Evaluate
V
Conclusion
Text Analytics is a powerful tool for transforming unstructured text into valuable insights.
By leveraging advanced NLP techniques and tools, organizations can enhance decision-
making, automate processes, and improve customer experiences.
Naïve Bayes Model for Sentiment Classification
The Naïve Bayes model is a probabilistic machine learning algorithm based on Bayes'
Theorem. It is particularly effective for text classification tasks like sentiment analysis due
to its simplicity, efficiency, and robustness with high-dimensional data.
ud
- P(A|B): Probability of A given B.
- P(B|A): Probability of B given A.
- P(A): Prior probability of A.
- P(B): Prior probability of B.
2. **Naïve Assumption**:
- Assumes independence among features (words in text).
- Simplifies computation by treating each feature as independent.
3. **Classes**: lo
- Sentiment classification typically involves two classes:
- Positive Sentiment: P(positive).
- Negative Sentiment: P(negative).
C
4. **Feature Extraction**:
- Text is converted into numerical features using techniques like Bag of Words or TF-IDF.
• Imbalance in Classes: Performs poorly when one class dominates the dataset.
ud
]
vectorizer = CountVectorizer()
lo
# Convert text to numerical features using Bag of Words
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)
C
# Train Naïve Bayes model
model = MultinomialNB()
model.fit(X_train_features, y_train)
tu
# Make predictions
predictions = model.predict(X_test_features)
print(classification_report(y_test, predictions))
ud
Bayes model for sentiment classification. TF-IDF assigns importance to terms based on their
frequency within a document and across the corpus, making it a powerful tool for text
classification tasks.
Code Example
lo
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
# Sample dataset
texts = [
"I love this product, it's amazing!",
C
"This is the worst purchase I ever made.",
"Absolutely fantastic quality and design!",
"Not happy with the experience at all.",
"The product is okay, but could be better.",
tu
# Make predictions
predictions = model.predict(X_test_features)
ud
print("Classification Report:")
print(classification_report(y_test, predictions))
Explanation of TF-IDF
1. **TF-IDF**:
Formula:
lo
- **IDF** (Inverse Document Frequency): Reduces the weight of common words and
increases the importance of rare words across the corpus.
Advantages of TF-IDF
• Assigns importance to terms based on their occurrence.
tu
Conclusion
Using TF-IDF in conjunction with the Naïve Bayes model provides a robust and efficient
method for sentiment classification. It improves the model's ability to focus on meaningful
V
• Challenges:
• Handling spelling errors, grammatical errors, and abbreviations.
• Removing irrelevant information (e.g., advertisements in social media posts).
• Dealing with different text encodings and formats.
2. Ambiguity in Language
ud
Natural language is inherently ambiguous, and words or sentences can have multiple
meanings depending on the context.
• Examples:
• The word 'bank' could mean a financial institution or the side of a river.
• Sarcasm and irony are difficult to detect.
3. Domain-Specific Language
•
•
Challenges:
lo
Text from different domains (e.g., medical, legal, technical) often requires domain-specific
knowledge for effective analysis.
•
Building custom dictionaries or ontologies for specialized domains.
Adapting general models to work in niche fields.
C
4. Multilingual Text
Analyzing text in multiple languages increases the complexity of processing.
• Challenges:
tu
conflicting sentiments.
• Examples:
• "The product is great, but the service was awful" contains both positive and negative
sentiments.
• Sarcasm and humor can mislead sentiment models.
6. Scalability
Processing large volumes of text data in real-time or at scale can be computationally
intensive.
• Challenges:
• Efficiently storing and indexing massive datasets.
• Scaling algorithms to handle big data.
7. Contextual Understanding
Capturing the meaning of a sentence requires understanding its context.
• Challenges:
• Resolving coreferences (e.g., identifying that 'he' refers to 'John').
• Understanding long-term dependencies in lengthy documents.
ud
8. Evaluation Metrics
Measuring the performance of text analytics models is not straightforward.
• Challenges:
• Lack of standardized benchmarks for certain tasks.
• Difficulty in defining accuracy for subjective tasks like sentiment analysis.
9. Evolving Language
•
•
Challenges:
lo
Language evolves over time, introducing new words, phrases, and slang.
•
Keeping models updated with the latest vocabulary and trends.
Adapting to changes in user behavior and text patterns.
C
10. Data Privacy and Ethics
Text analytics often involves processing sensitive or personal information.
• Challenges:
tu
• Challenges:
V
Conclusion
While text analytics offers tremendous opportunities, addressing these challenges requires
advanced techniques, domain expertise, and continuous model improvement. By
overcoming these hurdles, organizations can unlock the full potential of their textual data.
Module 4 - Recommender System
What is a Recommender System?
A recommender system is a type of machine learning application designed to suggest
relevant items to users based on their preferences, behavior, and other contextual factors. It
is widely used in various domains, such as e-commerce, entertainment, education, and
social networks, to improve user experience and drive engagement.
ud
Types of recommender systems include:
• Content-Based Filtering
• Collaborative Filtering
• Hybrid Systems
• Knowledge-Based Systems
• Deep Learning-Based Systems
lo
Recommender systems analyze user behavior and preferences to improve decision-making,
engagement, and satisfaction.
1. Content-Based Filtering
C
Content-based filtering recommends items similar to those the user has interacted with in
the past by analyzing the item's attributes. It matches item features with user preferences
and uses similarity metrics to suggest items.
Advantages:
tu
Challenges:
Example: Recommending movies with similar genres, actors, or directors to those already
watched.
2. Collaborative Filtering
Collaborative filtering relies on the preferences and actions of other users to make
recommendations. It includes user-based and item-based approaches.
Advantages:
• No need to understand item attributes
• Handles diverse datasets well
Challenges:
3. Hybrid Systems
ud
Hybrid systems combine multiple approaches (e.g., content-based and collaborative
filtering) to improve accuracy. They mitigate weaknesses of individual systems and are
more robust.
4. Knowledge-Based Systems
lo
Knowledge-based systems leverage specific knowledge about how certain item features
satisfy user requirements. They do not rely on user history and work well for new users and
items.
Challenges: Requires detailed item and user information and can be less adaptive.
C
Example: A travel booking system recommending trips based on budget, preferences, and
travel dates.
between users and items. They analyze user behavior over time and handle large-scale data.
1. Itemset
A collection of one or more items.
ud
Example: {milk, bread, butter} is an itemset.
2. Support
Measures how frequently an itemset appears in the dataset.
**Formula**:
3. Confidence
lo
Example: If {milk, bread} appears in 3 out of 10 transactions, support is 0.3 (30%).
**Formula**:
C
Confidence(A → B) = Support(A ∪ B) / Support(A)
Example: If 50% of transactions with {milk} also include {bread}, confidence is 0.5 (50%).
4. Lift
tu
Indicates the strength of a rule by comparing the observed support to the expected support
if A and B were independent.
**Formula**:
Example: If {milk} and {bread} are purchased together more often than expected by chance,
V
Example: {milk} → {bread} means 'If a customer buys milk, they are likely to buy bread.'
Steps to Generate Association Rules
ud
Applications of Association Rules
• Market Basket Analysis: Discover products often purchased together to improve cross-
selling and upselling. Example: If customers buy {diapers}, they often buy {beer}.
• Recommendation Systems: Suggest items based on association rules. Example: E-
commerce suggesting 'Frequently Bought Together' products.
• Fraud Detection: Identify unusual patterns in financial transactions.
• Healthcare: Find associations between symptoms and diseases or medications and side
effects.
Advantages
•
•
lo
Easy to understand and interpret.
Applicable to a wide range of industries.
Limitations
C
• May generate too many rules, making it hard to analyze.
• Does not capture time-based patterns or sequences.
tu
V
ud
Collaborative Filtering
Collaborative filtering is a popular technique in recommendation systems that predicts a
user's interest in items by analyzing their past behavior and the behavior of other users. It is
based on the assumption that if users have agreed in the past, they are likely to agree in the
future.
lo
Types of Collaborative Filtering
Alice and Bob have similar preferences. If Bob likes a new movie, Alice might like it too.
Advantages:
tu
Challenges:
Focuses on the similarity between items rather than users. It recommends items that are
frequently liked or purchased together.
Example:
If many users who bought a smartphone also bought a case, the system recommends a case
when a smartphone is purchased.
Advantages:
• More scalable than user-based filtering.
Challenges:
• Requires item similarity computation, which can be intensive for large datasets.
3. Matrix Factorization
Decomposes the user-item interaction matrix into latent factors representing user and item
characteristics. This is often implemented using techniques like Singular Value
Decomposition (SVD).
ud
Advantages:
Challenges:
lo
Key Components of Collaborative Filtering
1. Interaction Matrix
A matrix where rows represent users and columns represent items. Example: If a user rates
C
a movie, the matrix records the rating at the corresponding row and column.
2. Similarity Metrics
Measures how similar users or items are to one another. Common metrics include Cosine
similarity, Pearson correlation, and Jaccard similarity.
tu
3. Prediction
Predicts the likelihood of a user liking an item based on the preferences of similar users or
items.
Advantages
• Works well when user and item metadata are unavailable.
• Captures complex relationships among users and items.
ud
Limitations
• Cold Start Problem: Struggles to recommend items to new users or suggest new items
with no prior interactions.
• Data Sparsity: Many users interact with only a small subset of items, leading to sparse
interaction matrices.
• Scalability: Computing similarity in large datasets can be resource-intensive.
Surprise Library
lo
Surprise is a Python library designed specifically for building and evaluating
recommendation systems. It is widely used for collaborative filtering, particularly matrix
C
factorization, and other algorithms that predict user preferences based on historical data.
• Customizable: Allows users to define their own similarity metrics and prediction
algorithms.
• Efficient: Optimized for performance, especially for matrix factorization methods.
• Dataset Management: Provides utilities for loading, splitting, and managing datasets.
• Evaluation Tools: Built-in functions for model evaluation using metrics like RMSE and
MAE.
V
Installation
Install the Surprise library using pip:
3. Similarity Metrics: Cosine similarity, Pearson correlation, and custom distance metrics.
ud
Example: Building a Recommendation System
This example demonstrates how to build a recommendation system using a built-in dataset:
# Load data from a custom file (replace 'data.csv' with your file)
data = Dataset.load_from_file('data.csv', reader=reader)
# Calculate RMSE
accuracy.rmse(predictions)
ud
Applications of Surprise
• Movie Recommendations: Recommending movies based on user ratings (e.g., Netflix).
• E-commerce: Suggesting products based on purchase behavior.
• Content Personalization: Tailoring recommendations for individual users.
Advantages
•
•
•
lo
Easy to use and well-documented.
Extensive support for collaborative filtering techniques.
Optimized for performance.
Limitations
C
• Primarily focused on collaborative filtering.
• Requires familiarity with Python for effective use.
tu
Matrix Factorization
Matrix Factorization is a popular approach used in recommendation systems, particularly in
collaborative filtering, to predict user-item interactions. It decomposes a large, sparse
interaction matrix (e.g., user-item rating matrix) into smaller matrices, capturing latent
factors that represent user preferences and item characteristics.
V
Key Concepts
• 1. Interaction Matrix: A matrix where rows represent users, columns represent items,
and values represent interactions (e.g., ratings, clicks, purchases).
ud
• 2. Latent Factors: Hidden dimensions that describe user preferences and item attributes
(e.g., genres for movies).
• 3. Decomposition: The interaction matrix R is decomposed into two matrices: User
Matrix (P) and Item Matrix (Q). R ≈ P × Q^T, where R is reconstructed using the dot
product of P and Q.
•
lo
Matrix Factorization Techniques
• Singular Value Decomposition (SVD): Decomposes R into three matrices (U, Σ, V^T).
Efficient for dense matrices but struggles with sparse data.
Alternating Least Squares (ALS): Optimizes one matrix at a time (e.g., fixing P and
solving for Q, then vice versa). Commonly used in large-scale systems.
C
• Non-Negative Matrix Factorization (NMF): Ensures that values in P and Q are non-
negative. Useful for interactions with non-negative values like counts.
• Captures Latent Relationships: Reveals hidden patterns in user preferences and item
attributes.
• Dimensionality Reduction: Reduces the complexity of large interaction matrices.
• Personalized Recommendations: Predicts missing values (e.g., ratings) using user and
item latent factors.
Limitations
V
• Cold Start Problem: Cannot recommend for new users or items with no prior
interactions.
• Scalability: Large datasets can be computationally expensive.
• Interpretability: Latent factors are abstract and may not have clear meanings.
ud
from surprise import accuracy
# Load dataset
data = Dataset.load_builtin('ml-100k')
# Apply SVD
algo = SVD()
algo.fit(trainset)
# Calculate RMSE
accuracy.rmse(predictions)
tu
Conclusion
Matrix Factorization is a powerful technique for recommendation systems, offering
personalized and scalable solutions by leveraging latent factors. However, it requires
adequate data and computational resources to achieve effective results.
V