0% found this document useful (0 votes)
2 views98 pages

Class Notes ML4

Unsupervised learning is a machine learning technique that operates on unlabeled data to discover hidden patterns without human supervision. Key methods include clustering, dimensionality reduction, and anomaly detection, with algorithms like K-Means and PCA being commonly used. While unsupervised learning offers insights into data structure, it faces challenges such as sensitivity to outliers and the need for predefined parameters like the number of clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views98 pages

Class Notes ML4

Unsupervised learning is a machine learning technique that operates on unlabeled data to discover hidden patterns without human supervision. Key methods include clustering, dimensionality reduction, and anomaly detection, with algorithms like K-Means and PCA being commonly used. While unsupervised learning offers insights into data structure, it faces challenges such as sensitivity to outliers and the need for predefined parameters like the number of clusters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

UNIT 4:UNSUPERVISED

LEARNING
UNSUPERVISED LEARNING
Unsupervised learning is a machine learning technique in which
developers don’t need to supervise the model.
Instead, this type of learning allows the model to work
independently without any supervision to discover hidden patterns and
information that was previously undetected.
It mainly deals with the unlabeled data, while supervised learning, as
we remember, deals with labeled data.

03/09/24 2
KEY CHARACTERISTICS

Unlabeled Data: The core characteristic is the absence of


predefined labels or categories in the training data.
Pattern Discovery: Algorithms independently identify patterns,
relationships, and structures within the data.
Minimal Human Intervention: Unlike supervised learning, it
requires no direct instruction or supervision from humans to label
the data.
Exploratory Analysis: It's often used as an initial step in data
analysis to understand the data's inherent structure before further
modeling.
COMMON TECHNIQUES
Clustering: Algorithms group similar data points into clusters based
on their inherent similarities. Eg. KMeans
Dimensionality Reduction: Techniques that reduce the number of
features in a dataset while retaining important information, simplifying
data complexity. Eg. Principal Component Analysis
Anomaly Detection: Identifying unusual or rare data points that
deviate significantly from the norm.
WORKING OF UNSUPERVISED
LEARNING
1. Collect Unlabeled Data: Gather a dataset without predefined labels or
categories. Example: Images of various animals without any tags.
2. Select an Algorithm Choose a suitable unsupervised algorithm such as
clustering like K-Means, or dimensionality reduction like PCA based on the
goal.
3. Train the Model on Raw Data; Feed the entire unlabeled dataset to the
algorithm. The algorithm looks for similarities, relationships or hidden
structures within the data.
4. Group or Transform Data
The algorithm organizes data into groups (clusters), rules or
lower-dimensional forms without human input. Example: It may group
similar animals together or extract key patterns from large datasets.
5. Interpret and Use Results : Analyze the discovered groups, rules or
features to gain insights or use them for further tasks like visualization,
anomaly detection or as input for other models.
CLUSTERING
Clustering is an unsupervised machine learning technique that groups
unlabeled data into clusters based on similarity.
Its goal is to discover patterns or relationships within the data without
any prior knowledge of categories or labels.
Groups data points that share similar features or characteristics.
Helps find natural groupings in raw, unclassified data.
Commonly used for customer segmentation, anomaly detection and
data organization.
Works purely from the input data without any output labels.
Enables understanding of data structure for further analysis or
decision-making.
The k-means algorithm defines the centroid of a cluster as the mean
value of the points within the cluster.
It proceeds as follows: First, it randomly selects k of the objects in D,
each of which initially represents a cluster mean or center.
For each of the remaining objects, an object is assigned to the cluster to
which it is the most similar, based on the Euclidean distance between the
object and the cluster mean.
The k-means algorithm then iteratively improves the within-cluster
variation. For each cluster, it computes the new mean using the objects
assigned to the cluster in the previous iteration.
All the objects are then reassigned using the updated means as the new
cluster centers.
Iterations continue until the assignment is stable, that is, the clusters
formed in the current round are the same as those formed in the previous
round.
METHODS TO FIND THE BEST VALUE OF K: ELBOW CURVE
METHOD
In the Elbow Method, we systematically experiment with different numbers of
clusters (K) ranging from 1 to 10.
With each K value, we compute the Within-Cluster Sum of Squares (WCSS).
When we plot WCSS against K, the resulting graph resembles an elbow.
As we increase the number of clusters, the WCSS value begins to decrease.
Notably, the WCSS is at its highest when K=1.
METHODS TO FIND THE BEST VALUE OF K: SILHOUTTE SCORE
Silhouette coefficient is used to measure the quality of the clusters by checking how
similar a data point is within a cluster compared to the other clusters.
Silhouette analysis can be used to study the distance between the resulting clusters.
This discrete measure ranges between -1 and 1:

+1 indicates that the data point is far away from the neighboring cluster and thus
optimally positioned.
0 indicates either it is on or very close to the decision boundary between two
neighbor clusters.
-1 indicates that the data point is assigned to the wrong cluster.
To find an optimal value for the number of clusters K, we use a silhouette plot to
display a measure of how close each point in one cluster is to a point in the
neighboring clusters and thus provide a way to assess parameters like the number of
clusters visually.
WHAT IS THE SILHOUETTE SCORE?
Silhouette Score
Cohesion (a(i)): How close a data point is to other points in its own cluster.
Separation (b(i)): How far a data point is from the points in the nearest
neighboring cluster.
The mean Silhouette Score across all data points in a dataset provides an
overall measure of cluster quality:
A higher mean score indicates well-defined and compact clusters.
A lower mean score suggests overlapping or poorly separated clusters.
K-MEANS SHORTCOMINGS
The k-means algorithm is sensitive to outliers because such objects
are far away from the majority of the data, and thus, when assigned
to a cluster, they can dramatically distort the mean value of the
cluster.
This inadvertently affects the assignment of other objects to clusters.
This effect is particularly exacerbated due to the use of the
squared-error function.
K-MEANS SHORTCOMINGS

The k-means method is not guaranteed to converge to the global


optimum and often terminates at a local optimum. The results may
depend on the initial random selection of cluster centers. If the initial
centroids are poorly positioned—for example, if they are clustered
too close together—the algorithm may find a suboptimal solution.
To obtain good results in practice, it is common to run the k-means
algorithm multiple times with different initial cluster centers.
The k-means method can be applied only when the mean of a set of
objects is defined. This may not be the case in some applications such
as when data with nominal attributes are involved.
K-MEANS SHORTCOMINGS
The necessity for users to specify k, the number of clusters, in
advance can be seen as a disadvantage.
The k-means method is not suitable for discovering clusters
with nonconvex shapes or clusters of very different size.
Moreover, it is sensitive to noise and outlier data points
because a small number of such data can substantially
influence the mean value.
K-MEANS SHORTCOMINGS
K MEANS NUMERICAL EXAMPLE

Find euclidean distance of points wrt the initial Centroids


Choose Minimum distance for cluster assignment

Find new Centroids by averaging the coordinates of x,y belonging to that cluster
Find euclidean distance of points wrt the new Centroids

Check if there is any new cluster assignment.


Cluster assignment changed. So, find new centroids

Calculate ED of data points wrt new centroids. If yes, find new centroids
Find distance of data points wrt new centroids. If no cluster assignment changes.
The algorithm has converged. Stop.
CURSE OF DIMENSIONALITY

Two different ways to remove curse of dimensionality:


Feature selection, Feature Extraction
WHY DIMENSIONALITY REDUCTION?
Prevent Curse of Dimensionality
To improve the performance of the model
To visualize the data-> understand the data
FEATURE SELECTION
FEATURE SELECTION
FEATURE EXTRACTION
If many feature have relationship between dependent and
independent features, then we take independent features, apply
some transformation to extract new features
PCA
PCA results into a smaller number of uncorrelated variables. This
is done by projecting (dot product) the original data into the
reduced PCA space using the eigenvectors of the
covariance/correlation matrix aka the principal components (PCs).
The resulting projected data are essentially linear combinations of
the original data capturing most of the variance in the data .
PCA is an orthogonal transformation of the data into a series
of uncorrelated data living in the reduced PCA space such that the
first component explains the most variance in the data with each
subsequent component explaining less.
PCA
PCA is a method of compressing a lot of data into something
that captures the essence of the original data.
PCA takes a dataset with a lot of dimensions and flattens it to 2
or 3 dimensions so we can look at it.
PCA is a variance-maximising technique that projects the
original data onto a direction that maximizes variance.
PCA performs a linear mapping of the original data to a
lower-dimensional space such that the variance of the data in
the low-dimensional representation is maximized.
WHEN/WHY TO USE PCA
PCA technique is particularly useful in processing data where
multi–colinearity exists between the features /variables.
Multicollinearity occurs when two or more independent
variables in a regression model are highly correlated, affecting
the model's reliability.
PCA can be used when the dimensions of the input features
are high (e.g. a lot of variables).
PCA can be also used for denoising and data compression.
PRINCIPAL COMPONENT
ANALYSIS(PCA)
Consider two attributes – house size (sq.ft.) and the number of rooms in that
house. These two attributes are highly correlated as bigger houses tend to
have more rooms.
We can see that there is a positive correlation between house size and the
number of rooms. Also, there seems to be more variance in house sizes when
compared to the number of rooms.
Look at the scale of x and y. We can see that these two attributes
use two different scales; hence the above picture is not really
representative.
Since PCA can bias towards specific features, it is important to
evaluate whether normalization of data is needed. Data should
reflect a normal distribution with a mean of zero and a standard
deviation of one.
We can see that the data is now centered around the origin because both
standardized attributes have a mean of 0. At the same time, it is now
easier to visually compare the spread (variance) of the two attributes. As
previously speculated, there is indeed a wider spread in house sizes than
the number of rooms.
FINDING PRINCIPAL COMPONENTS
Now that we have the data standardized and centered around the
origin, we can look for the "best-fit" line. This line must go
through the origin, and we can find it by minimizing the
distances from data points to their projections on the line. It
looks like this:
Interestingly, this "best-fit" line also happens to be the axis for Principal
Component 1 (PC1). This is because minimizing the distances from the
line also maximizes the spread of data point projections on that
same line.
In other words, we have found a new axis that captures the maximum
amount of variance of the data in that dimension.
Principal Component Analysis is not just about dimensionality
reduction. Hence, we can find as many principal components as there
are attributes (dimensions) in our data.
PC1 would be the dimension that captures the highest proportion of the
data variance, with PC2 being the dimension that captures the highest
proportion of the remaining variance that PC1 could not capture.
Similarly, PC3 would be the dimension capturing the highest proportion
of the remaining variance that PC1 and PC2 could not capture, etc.
COMPUTE THE COVARIANCE MATRIX TO IDENTIFY CORRELATIONS

Covariance (cov) measures how strongly correlated two or more variables are.
The covariance matrix summarizes the covariances associated with all pair
combinations of the initial variables in the dataset.
This data matrix is a symmetric matrix, meaning the variable combinations can
be represented as d × d, where d is the number of dimensions.
The sign of the variables in the matrix tells us whether combinations are
correlated:
Positive (the variables are correlated and increase or decrease at the same time)
Negative (the variables are not correlated, meaning that one decreases while
the other increases) .
Zero (the variables are not related to each other)
COMPUTE THE EIGENVECTORS AND
EIGENVALUES OF THE COVARIANCE
MATRIX
Here, we calculate the eigenvectors (principal components) and
eigenvalues of the covariance matrix. As eigenvectors, the
principal components represent the directions of maximum
variance in the data.
The eigenvalues represent the amount of variance in each
component.
Ranking the eigenvectors by eigen value identifies the order of
principal components.
SELECT THE PRINCIPAL COMPONENTS
Here, we decide which components to keep and those to
discard.
Components with low eigenvalues typically will not be
as significant.
Scree plots usually plot the proportion of total variance
explained and the cumulative proportion of variance.
These metrics help one to determine the optimal number
of components to retain.
The point at which the Y axis of eigenvalues or total
variance explained creates an "elbow" will generally
indicate how many PCA components that we want to
include.
TRANSFORM THE DATA INTO THE NEW
COORDINATE SYSTEM
Finally, the data is transformed into the new coordinate
system defined by the principal components.
That is, the feature vector created from the eigenvectors
of the covariance matrix projects the data onto the new
axes defined by the principal components.
This creates new data, capturing most of the information
but with fewer dimensions than the original dataset.
INTERPRETING PCA RESULTS
A PCA plot is a scatter plot created by using the first two principal
components as axes.
The first principal component (PC1) is the x-axis, and the second
principal component (PC2) is the y-axis.
The scatter plot shows the relationships between observations (data
points) and the new variables (the principal components).
The position of each point shows the values of PC1 and PC2 for that
observation.
The direction and length of the plot arrows indicate the loadings of
the variables, that is, how each variable contributes to the principal
components. If a variable has a high loading for a particular
component, it is strongly correlated with that component. This can
highlight which variables have a significant impact on data
variations.
INTERPRETING PCA RESULTS
The number of principal components that remain after applying
PCA can help you interpret the data output.
The first principal component explains the most data variance,
and each later component accounts for less variance.
Thus, the number of components can indicate the amount of
information retained from the original dataset.
Fewer components after applying PCA could mean that you
didn’t capture much data variation.
More components indicate more data variation, but the results
may be harder to interpret.
You can decide the optimal number of components to retain
using either a scree plot or the cumulative explained variance.
SCREE PLOT
A PCA scree plot is a line graph plotting the eigenvalues
(y-axis) against the principal component (PC) number (x-axis)
to determine how many PCs to retain in a Principal
Component Analysis (PCA).
The plot displays a downward curve, and the "elbow" where
the curve flattens out indicates the point where components
should be selected, as earlier components capture more
variance.
eigenvalues ≥ 1
PCA-DISADVANTAGES
The run-time of PCA is cubic in relation to the number of dimensions of the data.
This can be computationally expensive at times for large datasets.
PCA transforms the original input variables into new principal components (or
dimensions). The new dimensions offer no interpretability.
While PCA simplifies the data and removes noise, it always leads to some loss of
information when we reduce dimensions.
PCA is a linear dimensionality reduction technique, but not all real-world datasets
may be linear.
PCA gets affected by outliers. This can distort the principal components and
affect the accuracy of the results.
PCA-ADVANTAGES
By reducing the data to two dimensions, you can easily visualize
it.
PCA removes multicollinearity.
PCA removes noise. By reducing the number of dimensions in
the data, PCA can help remove noisy and irrelevant features.
PCA reduces model parameters: PCA can help reduce the
number of parameters in machine learning models.
PCA reduces model training time. By reducing the number of
dimensions, PCA simplifies the calculations involved in a model,
leading to faster training times.
INDEPENDENT COMPONENT ANALYSIS
ICA is a powerful tool for separating mixed signals into their
independent components. This is useful in a variety of
applications, such as signal processing, image analysis, and data
compression.
ICA is a non-parametric approach, which means that it does not
require assumptions about the underlying probability
distribution of the data.
ICA is an unsupervised learning technique.
ICA can be used for feature extraction, which means that it can
identify important features in the data that can be used for other
tasks, such as classification.
INDEPENDENT COMPONENT ANALYSIS
PCA is about finding correlation between variable by maximizing variance.
It reconstructs the new variables.
ICA is about maximizing independence. It tries to find a linear
transformation of your feature space into a new feature space such that each
of the individual new features are mutually independent.
X1 to y1, X2 to Y2,… Xi to Yi such that each one of the new features are
statistically independent of each other. Yi is orthogonal to yj. The mutual
information I(Yi,Yj)=0. This is achieved through linear transformation.
Also mutual information between all of the features Y and the original
feature space X is as high as possible. So, we can say that we reconstruct
the data to be able to predict X from Y and Y from X. While at the same
time making sure that each of the new dimension is in fact mutually
independent
ICA

Typical natural causes have rich statistics (more complex than


Gaussian). Blind source separation decomposes observed
signal mixtures into independent and maximally non-Gaussian
sources (causes).
Statistical dependencies may remain even after signals have
been whitened (scaled to zero mean and unit variance) and
decorrelated (by Principal Component Analysis).
ICA decomposes signals such as to eliminate these remaining
dependencies (in higher moments), maximizing both
independence and non-Gaussianity.
INDEPENDENT COMPONENT ANALYSIS
ICA was designed to solve blind source separation problem.
SIZES OF DIFFERENT MATRICES IN ICA

3 speakers (p=3), 4 time steps (N=4)


S11 S12 S13 S14
S21 S22 S23 S24
S31 S33 S33 S34
S=R(p*N)=(3*4)

Mixing Matrix
Mixing Mike 3, (n=3)
A= a b c
a*S1+b*S2+c*S3
d e f
d*S1+e*S2+f*S3
g h i
g*S1+h*S2+i*S3
A=R(n*p)=(3*3)
X= a b c S11 S12 S13 S14
d e f S21 S22 S23 S24
g h i S31 S33 S33 S34

X=R(n*N) =3*4
MATRIX FACTORIZATION
MATRIX FACTORIZATION
Used for recommendation systems
MATRIX FACTORIZATION

Looks more realistic, Rows 1 and 3 are same. Column


1, 4 are same. Dependency between row 2,3,4 ( 4
loves both types of movies), Columns (2,3,5) 5 is
average of 2,3
In real life there are relations between rows and columns.
These dependencies can be used to guess ratings.

How to figure out several


dependencies at the same
time? Matrix factorization
MATRIX FACTORIZATION
Mathematically speaking, given a large N*M matrix, then using
matrix factorization , this NxM matrix can be decomposed into
simpler multiple matrices such that on matrix multiplication ,
we get the same matrix NxM.
Saves space
Movie M1 Movie M5
SPACE SAVING USING MATRIX FACTORIZATION
HOW TO FIND THE RIGHT FACTORS
Hit and Trial
Gradient Descent: We start with some random values.
To decrease error , we
use Gradient Descent.
Find the derivative and
apply repeatedly on the
entries.
HOW TO PREDICT RATINGS
CONTENT BASED FILTERING
It is a ML technique that uses similarities in features to make
decisions.
Used in recommender systems based on knowledge
accumulated about the user.
The method compares user interests to product features.
Most overlapping features with user interests are
recommended.
The users are given list of features to choose from which they
find the most significant.
Or the algorithm keeps track of the products the user has
chosen previously and add those features to the user’s data.
CONTENT BASED FILTERING
1. In the above graph, on the
left-hand side, we have cited
individual preferences, wherein 4
individuals have been asked to
provide a rating on safety and
mileage. If the individuals like a
feature, then we assign the value
1, and if they do not like that
particular feature, we assign 0.
Now we can see that Persons A
and C prefer safety, Person B
chooses mileage and Person D
opts for both Safety and Mileage.
2. Cars are been rated based on
the number of features (items)
they offer. A ranking of 4 implies
high features, and 1 depicts fewer
features.
CONTENT BASED FILTERING
• B is rated high on mileage and
low on safety
• C has an average rating and
offers both safety and mileage
• D is low on both mileage and
safety
The blue colored ? mark is the
sparse value, wherein either
person does not know about the
car or is not part of the
consideration list for buying the
car or has forgotten to rate.
How the matrix at the center has
arrived. This matrix represents
the overall rating of all 4 cars,
given by the individuals. Person
A has given an overall rating of 4
for Car A, 1 to Cars B and D, and
2 to Car C.
This has been arrived, through the
MATRIX FACTORIZATION
following calculations-

For Car A = (1 of safety preference x 4 of


safety features) +(0 of mileage
preference x 1 of mileage feature) = 4;
For Car B = (1 of safety preference x 1 of
safety features) +(0 of mileage
preference x 4 of mileage feature) = 1;
For Car C = (1 of safety preference x 2
of safety features) +(0 of mileage
preference x 2 of mileage feature) = 2;
Car D = (1 of safety preference x 1 of
safety features) +(0 of mileage
preference x 2 of mileage feature) = 1
MATRIX FACTORIZATION
Now on the basis of the above
calculations, we can predict the
overall rating for each person and
all the cars.

If person C is asking the search


engine to recommend options
available for cars, basis the
content available, and his
preference, the machine will
compute the table below:
• It will rank cars based on
overall rating.
• It will recommend car A,
followed by car C.
MATHEMATICS BEHIND THE RECOMMENDATIONS MADE USING
CONTENT-BASED FILTERING
In the above example, we had two matrices, that is, individual preferences and car
features, and 4 observations to enable the comparison. Now, if there are ‘n’ number
of observations in both matrix a and b, then-

Dot Product
The dot product of two length- n vectors a and b are defined as:

The dot product is only defined for vectors of the same length, which means there
should be equal number of ‘n’ observations.
MATHEMATICS BEHIND THE RECOMMENDATIONS MADE USING
CONTENT-BASED FILTERING
Cosine
There are alternative approaches or algorithms available to solve a problem.
Cosine is used as an approach to measure similarities. It is used to determine the
nearest user to provide recommendations. In our chocolate example above, the
search engine can make such a recommendation using cosine. It measures
similarities between two non zero vectors. The similarity between two vectors a and
b is, in fact, the ratio between their dot product and the product of their magnitudes.

By applying the definition of similarity, this will equal 1 if the two vectors are identical,
and it will be 0, if the two are orthogonal. In other words, similarity is a number
ranging from 0 to 1 and tells us the extent of similarity between the two vectors.
Closer to 1 is highly similar and closer to 0 is dissimilar.
EXAMPLE OF CONTENT-BASED
FILTERING
Suppose you’re subscribing to Netflix, and there are only
six movies on that platform: Batman Begins, The Dark
Knight, Twilight, Toy Story, Love Actually, and Up. Out
of these six movies, you’ve seen three of them, and you
give each of them a rating score that ranges between 1 to
10, with 1 being the worst and 10 being the best.
User Matrix:
ITEM MATRIX Usually binary encoded
computing the weighted item matrix based on the movie rating that we have in the user matrix

user-item matrix.

sum the
weighted score
of each feature,

normalize the
score
once again use the dot product between the normalized
user-item matrix above with the user vector
We do the same step for every movie that we haven’t watched yet,
and this is what the final result looks like:

Thus, the content-based filtering will then recommend The


Dark Knight, followed by Up and Love Actually.
CONTENT BASED FILTERING- ADVANTAGES DISADVANTAGES
Advantages
The model doesn't need any data about other users, since the recommendations
are specific to this user. This makes it easier to scale to a large number of
users.
The model can capture the specific interests of a user, and can recommend
niche items that very few other users are interested in.
Disadvantages
Since the feature representation of the items are hand-engineered to some
extent (experts must manually define the characteristics (features) of items,
such as a movie's genre, director, or actors, to represent them for the system ) ,
this technique requires a lot of domain knowledge. Therefore, the model can
only be as good as the hand-engineered features.
The model can only make recommendations based on existing interests of the
user. In other words, the model has limited ability to expand on the users'
existing interests.
COLLABORATIVE FILTERING
Collaborative filtering recommends items based on the preferences
of similar users , while Content based filtering recommends items
similar to what a user has liked in the past , based on item attributes.
Collaborative Filtering requires data from many users to find
patterns, whereas content based filtering only need a users history
and the item’s content.
Collaborative filtering is highly memory extensive as it is a NxM
matrix.
Only the famous items are recommended again and again. If a movie
is good but less people have rated it, it is less likely to be
recommended.
New items might not get recommended at all.
USER-BASED COLLABORATIVE
FILTERING
It is a technique used to predict the items that a user might like on
the basis of ratings given to that item by other users who have
similar taste with that of the target user. Many websites use
collaborative filtering for building their recommendation system.
Steps for User-Based Collaborative Filtering:
Step 1: Finding the similarity of users to the target user U:
Similarity for any two users ‘a’ and ‘b’ is calculated using the
formula

where rup: rating of user U against item p


USER-BASED COLLABORATIVE
FILTERING

Step 2: Prediction of missing rating of an item Now,


the target user might be very similar to some users and
may not be much similar to others. Hence, the ratings
given to a particular item by the more similar users
should be given more weightage than those given by less
similar users and so on. This problem can be solved by
using a weighted average approach. In this approach, you
multiply the rating of each user with a similarity factor
calculated using the above mention formula. The missing
rating can be calculated as
EXAMPLE
Consider a matrix that shows four users Alice, U1, U2
and U3 rating on different news apps. The rating range is
from 1 to 5 on the basis of users' likability of the news
app. The '?' indicates that the user has not rated the app.
Assume we have 3 users and 8 shows. Blank means that user has not
watched that show. Out of those shows which are not watched,
which one to recommend?
Case 1: Take average. For e.g. user 1 has not watched show 4 but
user 2 and user 3 has watched it. So take the average across entire
users. Both shows 4 and 5 have an average rating of 3.5. Still do not
know which one to recommend?
Now, take into account similarities between users. We only include
pairs where both of them have reviewed

U1 5 4 2 1 User 1 and user 2 are ranking items very


U2 5 3 1 1 similarly. This means that if user 2 likes
something, user 1 might also like it.

U1 5 1 4 2 1 User 1 and user 3 are ranking items very


U3 1 4 2 5 4 differently
S12=Similarity (U1,U2)=.9829
S13=Similarity(U1,U3)=.5742
S12=Similarity (U1,U2)=.9829
S13=Similarity(U1,U3)=.5742
Estimated rating=(S12*2+S13*5)/(S12+S13)

Collaborative filtering does rely on your matrix not being too sparse. If nobody
is rating, then, it can be done. Popularity bias, scalability , cold start for new
items are some other challenges.

You might also like