UNIT II - Recommender Systems
UNIT II - Recommender Systems
High-level architecture of content-based systems - Item profiles, Representing item profiles, Methods
for learning user profiles, Similarity-based retrieval, and Classification algorithms.
Content-based Information Filtering (IF) systems need proper techniques for representing the items
and producing the user profile, and some strategies for comparing the user profile with the item
representation. The high-level architecture of a content-based recommender system is depicted in
Figure. The recommendation process is performed in three steps, each of which is handled by a
separate component:
CONTENT ANALYZER – When information has no structure (e.g. text), some kind of pre-
processing step is needed to extract structured relevant information. The main responsibility of the
component is to represent the content of items (e.g. documents, Web pages, news, product
descriptions, etc.) coming from information sources in a form suitable for the next processing steps.
Data items are analyzed by feature extraction techniques in order to shift item representation from the
original information space to the target one (e.g. Web pages represented as keyword vectors). This
representation is the input to the PROFILE LEARNER and FILTERING COMPONENT;
PROFILE LEARNER – This module collects data representative of the user preferences and tries to
generalize this data, in order to construct the user profile. Usually, the generalization strategy is
realized through machine learning techniques [61], which are able to infer a model of user interests
starting from items liked or disliked in the past. For instance, the PROFILE LEARNER of a Web
page recommender can implement a relevance feedback method [75] in which the learning technique
combines vectors of positive and negative examples into a prototype vector representing the user
profile. Training examples are Web pages on which positive or negative feedback has been provided
by the user;
FILTERING COMPONENT – This module exploits the user profile to suggest relevant items by
matching the profile representation against that of items to be recommended. The result is a binary or
continuous relevance judgment (computed using some similarity metrics [42]), the latter case
resulting in a ranked list of potentially interesting items. In the above-mentioned example, the
matching is realized by computing the cosine similarity between the prototype vector and the item
vectors.
The first step of the recommendation process is the one performed by the CONTENT ANALYZER,
that usually borrows techniques from Information Retrieval systems [80, 6]. Item descriptions coming
from Information Source are processed by the CONTENT ANALYZER, that extracts features
(keywords, n-grams, concepts, . . . ) from unstructured text to produce a structured item
representation, stored in the repository Represented Items.
In order to construct and update the profile of the active user ua (user for which recommendations
must be provided) her reactions to items are collected in some way and recorded in the repository
Feedback. These reactions, called annotations [39] or feedback, together with the related item
descriptions, are exploited during the process of learning a model useful to predict the actual
relevance of newly presented items. Users can also explicitly define their areas of interest as an initial
profile without providing any feedback.
Typically, it is possible to distinguish between two kinds of relevance feedback: positive information
(inferring features liked by the user) and negative information (i.e., inferring features the user is not
interested in [43]).
Two different techniques can be adopted for recording user’s feedback. When a system requires the
user to explicitly evaluate items, this technique is usually referred to as “explicit feedback”; the other
technique, called “implicit feedback”, does not require any active user involvement, in the sense that
feedback is derived from monitoring and analyzing user’s activities.
Explicit evaluations indicate how relevant or interesting an item is to the user [74]. There are three
main approaches to get explicit relevance feedback:
• like/dislike – items are classified as “relevant” or “not relevant” by adopting a simple binary rating
scale, such as in [12];
• ratings – a discrete numeric scale is usually adopted to judge items, such as in [86]. Alternatively,
symbolic ratings are mapped to a numeric scale, such as in Syskill & Webert [70], where users have
the possibility of rating a Web page as hot, lukewarm, or cold;
• text comments – Comments about a single item are collected and presented to the users as a means
of facilitating the decision-making process, such as in [72]. For instance, customer’s feedback at
Amazon.com or eBay.com might help users in deciding whether an item has been appreciated by the
community. Textual comments are helpful, but they can overload the active user because she must
read and interpret each comment to decide if it is positive or negative, and to what degree. The
literature proposes advanced techniques from the affective computing research area [71] to make
content-based recommenders able to automatically perform this kind of analysis.
Explicit feedback has the advantage of simplicity, albeit the adoption of numeric/symbolic scales
increases the cognitive load on the user, and may not be adequate for catching user’s feeling about
items. Implicit feedback methods are based on assigning a relevance score to specific user actions on
an item, such as saving, discarding, printing, bookmarking, etc. The main advantage is that they do
not require a direct user involvement, even though biasing is likely to occur, e.g. interruption of phone
calls while reading.
In order to build the profile of the active user ua, the training set TRa for ua must be defined. TRa is a
set of pairs Ik,rk , where rk is the rating provided by ua on the item representation Ik. Given a set of
item representation labeled with ratings, the PROFILE LEARNER applies supervised learning
algorithms to generate a predictive model – the user profile – which is usually stored in a profile
repository for later use by the FILTERING COMPONENT. Given a new item representation, the
FILTERING COMPONENT predicts whether it is likely to be of interest for the active user, by
comparing features in the item representation to those in the representation of user preferences (stored
in the user profile). Usually, the FILTERING COMPONENT implements some strategies to rank
potentially interesting items according to the relevance with respect to the user profile. Top-ranked
items are included in a list of recommendations La, that is presented to us. User tastes usually change
in time, therefore up-to-date information must be maintained and provided to the PROFILE
LEARNER in order to automatically update the user profile. Further feedback is gathered on
generated recommendations by letting users state their satisfaction or dissatisfaction with items in La.
After gathering that feedback, the learning process is performed again on the new training set, and the
resulting profile is adapted to the 78 Pasquale Lops, Marco de Gemmis and Giovanni Semeraro
updated user interests. The iteration of the feedback-learning cycle over time allows the system to take
into account the dynamic nature of user preferences.
The adoption of the content-based recommendation paradigm has several advantages when compared
to the collaborative one:
USER INDEPENDENCE - Content-based recommenders exploit solely ratings provided by the
active user to build her own profile. Instead, collaborative filtering methods need ratings from other
users in order to find the “nearest neighbors” of the active user, i.e., users that have similar tastes
since they rated the same items similarly. Then, only the items that are most liked by the neighbors of
the active user will be recommended;
TRANSPARENCY - Explanations on how the recommender system works can be provided by
explicitly listing content features or descriptions that caused an item to occur in the list of
recommendations. Those features are indicators to consult in order to decide whether to trust a
recommendation. Conversely, collaborative systems are black boxes since the only explanation for an
item recommendation is that unknown users with similar tastes liked that item;
NEW ITEM - Content-based recommenders are capable of recommending items not yet rated by any
user. As a consequence, they do not suffer from the first-rater problem, which affects collaborative
recommenders which rely solely on users’ preferences to make recommendations. Therefore, until the
new item is rated by a substantial number of users, the system would not be able to recommend it.
Nonetheless, content-based systems have several shortcomings:
LIMITED CONTENT ANALYSIS - Content-based techniques have a natural limit in the number
and type of features that are associated, whether automatically or manually, with the objects they
recommend. Domain knowledge is often needed, e.g., for movie recommendations the system needs
to know the actors and directors, and sometimes, domain ontologies are also needed. No content-
based recommendation system can provide suitable suggestions if the analyzed content does not
contain enough information to discriminate items the user likes from items the user does not like.
Some representations capture only certain aspects of the content, but there are many others that would
influence a user’s experience. For instance, often there is not enough information in the word
frequency to model the user interests in jokes or poems, while techniques for affective computing
would be most appropriate. Again, for Web pages, feature extraction techniques from text completely
ignore aesthetic qualities and additional multimedia information. To sum up, both automatic and
manually assignment of features to items could not be sufficient to define distinguishing aspects of
items that turn out to be necessary for the elicitation of user interests.
NEW USER - Enough ratings have to be collected before a content-based recommender system can
really understand user preferences and provide accurate recommendations. Therefore, when few
ratings are available, for a new user, the system will not be able to provide reliable recommendations.
In the following, some strategies for tackling the above-mentioned problems will be presented and
discussed. More specifically, novel techniques for enhancing the content representation using
common-sense and domain-specific knowledge will be described (Sections 3.3.1.3-3.3.1.4). This may
help to overcome the limitations of traditional content analysis methods by providing new features,
such as WordNet [60, 32] or Wikipedia concepts, which help to represent the items to be
recommended in a more accurate and transparent way. Moreover, the integration of user-defined
lexicons, such as folksonomies, in the process of generating recommendations will be presented in
Section 3.4.1, as a way for taking into account evolving vocabularies.
Possible ways to feed users with serendipitous recommendations, that is to say, interesting items with
a high degree of novelty, will be analyzed as a solution to the over-specialization problem (Section
3.4.2).
Finally, different strategies for overcoming the new user problem will be presented. Among them,
social tags provided by users in a community can be exploited as feedback on which
recommendations are produced when few or no ratings for a specific user are available to the system
(Section 3.4.1.1).
Item Profiles
In a content-based system, we must construct for each item a profile, which is a record or collection
of records representing important characteristics of that item. In simple cases, the profile consists of
some characteristics of the item that are easily discovered. For example, consider the features of a
Movie that might be relevant to a recommendation system.
1. The set of actors of the movie. Some viewers prefer movies with their favorite actors.
2. The director. Some viewers have a preference for the work of certain directors.
3. The year in which the movie was made. Some viewers prefer old movies; others watch only the
latest releases.
4. The genre or general type of movie. Some viewers like only comedies, others dramas or romances.
There are many other features of movies that could be used as well. Except for the last, genre, the
information is readily available from descriptions of movies. Genre is a vaguer concept. However,
movie reviews generally assign a genre from a set of commonly used terms. For example, the Internet
Movie
Database (IMDB) assigns a genre or genres to every movie. Many other classes of items also allow us
to obtain features from available data, even if that data must at some point be entered by hand. For
instance,
products often have descriptions written by the manufacturer, giving features relevant to that class of
product (e.g., the screen size and cabinet color for a TV). Books have descriptions similar to those for
movies, so we can obtain features such as author, year of publication, and genre. Music products such
as CDs and MP3 downloads have available features such as artist, composer, and genre.
Unfortunately, these classes of documents do not tend to have readily available information-
giving features. A substitute that has been useful in practice is the identification of words that
characterize the topic of a document. How we do the identification was outlined in Section 1.3.1.
First, eliminate stop words – the several hundred most common words, which tend to say little about
the topic of a document. For the remaining words, compute the TF.IDF score for each word in the
document. The ones with the highest scores are the words that characterize the document.
We may then take as the features of a document the n words with the highest TF.IDF scores.
It is possible to pick n to be the same for all documents, or to let n be a fixed percentage of the words
in the document. We could also choose to make all words whose TF.IDF scores are above a given
threshold to be a part of the feature set.
Now, documents are represented by sets of words. Intuitively, we expect these words to
express the subjects or main ideas of the document. For example, in a news article, we would expect
the words with the highest TF.IDF score to include the names of people discussed in the article,
unusual properties of the event described, and the location of the event. To measure the similarity of
two documents, there are several natural distance measures we can use:
1. We could use the Jaccard distance between the sets of words (recall Section 3.5.3).
2. We could use the cosine distance (recall Section 3.5.4) between the sets, treated as vectors.
To compute the cosine distance in option (2), think of the sets of high TF.IDF words as a
vector, with one component for each possible word. The vector has 1 if the word is in the set and 0 if
not. Since between two documents, there are only a finite number of words among their two sets, the
infinite
the dimensionality of the vectors is unimportant. Almost all components are 0 in both, and 0’s do not
impact the value of the dot product. To be precise, the dot product is the size of the intersection of the
two sets of words, and the lengths of the vectors are the square roots of the numbers of words in each
set. That calculation lets us compute the cosine of the angle between the vectors as the dot product
divided by the product of the vector lengths.
There have been a number of attempts to obtain information about features of items by
inviting users to tag the items by entering words or phrases that describe the item. Thus, one picture
with a lot of red might be tagged “Tiananmen Square,” while another is tagged “sunset at Malibu.”
The distinction is not something that could be discovered by existing image-analysis programs.
Almost any kind of data can have its features described by tags. One of the earliest attempts
to tag massive amounts of data was the site del.icio.us, later bought by Yahoo!, which invited users to
tag Web pages. The goal of this tagging was to make a new method of search available, where users
entered a set of tags as their search query, and the system retrieved the Web pages that had been
tagged that way. However, it is also possible to use the tags as a recommendation system. If it is
observed that a user retrieves or bookmarks many pages with a certain set of tags, then we can
recommend other
pages with
the same tags.
The problem with tagging as an approach to feature discovery is that the process only works
if users are willing to take the trouble to create the tags, and there are enough tags that occasional
erroneous ones will not bias the system too much.
Our ultimate goal for content-based recommendation is to create both an item profile consisting of
feature-value pairs and a user profile summarizing the preferences of the user, based of their row of the
utility matrix. In Section 9.2.2 we suggested how an item profile could be constructed. We imagined a
vector of 0’s and 1’s, where a 1 represented the occurrence of a high-TF.IDF word in the document. Since
features for documents were all words, it was easy to represent profiles this way.
We shall try to generalize this vector approach to all sorts of features. It is easy to do so for
features that are sets of discrete values. For example, if one feature of movies is the set of actors, then
imagine that there is a component for each actor, with 1 if the actor is in the movie, and 0 if not.
Likewise, we can have a component for each possible director, and each possible genre. All these
features can be represented using only 0’s and 1’s.
There is another class of features that is not readily represented by Boolean vectors: those
features that are numerical. For instance, we might take the average rating for movies to be a feature,2
and this average is a real number. It does not make sense to have one component for each of the
possible average ratings, and doing so would cause us to lose the structure implicit in numbers. That
is, two ratings that are close but not identical should be considered more similar than widely differing
ratings. Likewise, numerical features of products, such as screen size or disk capacity for PC’s,
should be considered similar if their values do not differ greatly.
Numerical features should be represented by single components of vectors representing items.
These components hold the exact value of that feature. There is no harm if some components of the
vectors are Boolean and others are real-valued or integer-valued. We can still compute the cosine
distance between vectors, although if we do so, we should give some thought to the appropriate
scaling of the non-Boolean components so that they neither dominate the calculation nor are they
irrelevant.
Collecting information
An important part of running a business or even a website is having an abundance of information
available about the people you serve. Information is a currency in itself, and many digital services
depend on user information to function. Marketing, for example, depends on lots of demographic and
market information to be successful. User profiling allows companies to identify their ideal customers
and gather data–both personal data and general market data–to improve operations.
How does user profiling work?
User profiling works by separating users or customers into groups based on specific information. For
example, you can separate all of your customers by age, then by purchasing behavior. You might find
that customers above 40 years old purchase vastly different products from your company than
customers under 30. You can also allow users to create user profiles on your company application, in
the company database or on your website. This is a quick and efficient way to track customers while
offering the benefits of faster checkout and a more personalized customer experience.
When to use user profiling
You can use user profiling in a variety of situations, including:
When you're launching a new product: Launching new products often requires a strong
understanding of customer behavior and what they expect. You can profile users to determine
which may be interested in the new product and what features they're likely to expect from it.
When you're building a marketing campaign: If you're building a marketing campaign, user
profiling is a core component of the campaign. You typically build a marketing campaign
to reach a specific audience, which you understand if you can profile your customers and
organize them into customer groups.
When you're a new business: User profiling is especially useful for new companies, because
it allows them to identify their core audience more quickly. This may help prevent errors in
the future and create a strong initial customer base to help the company grow.
When you create a loyalty or rewards program: A company loyalty or rewards program
allows you to group customers by brand loyalty and purchases. User profiling is important
during this process because it helps you identify the customers who might benefit the most
from the program and helps you create targeted ads for the program or your products.
Algorithms:
Decision Trees: A tree-like model where each node represents a decision based on a feature, leading
to a classification outcome.
Decision Trees are a popular and intuitive machine learning algorithm used for both classification and
regression tasks. They are widely used due to their simplicity, interpretability, and effectiveness in
capturing complex relationships in data. Here are key concepts related to Decision Trees:
1. Tree Structure:
A Decision Tree is a hierarchical tree-like structure consisting of nodes. Each node
represents a decision based on a specific feature.
2. Nodes:
Nodes in a Decision Tree can be categorized into two types:
Root Node: The topmost node, representing the initial decision or feature.
Internal Nodes: Intermediate nodes that represent decisions based on
specific features.
Leaf Nodes (Terminal Nodes): End nodes that represent the final output,
which can be a class label in classification or a numerical value in regression.
3. Decision Rules:
Each internal node in the tree represents a decision based on a feature. The decision
rules guide the traversal from the root to the leaf nodes.
4. Splitting:
The process of dividing the dataset into subsets based on the values of a chosen
feature. The goal is to create homogenous subsets with respect to the target variable.
5. Entropy and Information Gain (for Classification):
Decision Trees for classification often use entropy and information gain to determine
the best feature for splitting.
Entropy measures the impurity or disorder in a set of data, and information gain
quantifies the improvement in purity achieved by splitting based on a particular
feature.
6. Gini Index (for Classification):
Another criterion for evaluating impurity in classification tasks is the Gini Index. It
measures the probability of incorrectly classifying a randomly chosen element in the
dataset.
7. CART Algorithm:
The Classification and Regression Trees (CART) algorithm is commonly used for
constructing Decision Trees.
CART can handle both classification and regression tasks.
8. Pruning:
Decision Trees are prone to overfitting, where they capture noise or specific patterns
in the training data that do not generalize well to new data.
Pruning involves removing parts of the tree that do not provide significant predictive
power on validation data, thus preventing overfitting.
9. Regression Trees:
In regression tasks, Decision Trees predict a numerical value at each leaf node, and
the prediction is the average of the target values in the corresponding subset.
Applications of Decision Trees:
1. Classification: Identifying categories or labels for instances in a dataset.
2. Regression: Predicting a continuous numerical value.
3. Data Exploration: Decision Trees can be used for exploratory data analysis to understand
the most important features in a dataset.
4. Rule Extraction: Decision Trees can be translated into sets of rules, providing interpretable
insights.
Pros and Cons:
Pros:
Easy to understand and interpret.
Requires minimal data preprocessing.
Handles both numerical and categorical data.
Cons:
Prone to overfitting, especially on noisy datasets.
Not well-suited for capturing complex relationships in data.
In summary, Decision Trees are versatile and widely used in various machine learning tasks. Their
simplicity makes them a valuable tool, especially when interpretability is crucial. Techniques like
pruning are employed to address the overfitting tendency associated with Decision Trees.
Support Vector Machines (SVM): Separates data points into different classes by finding the
hyperplane that maximally separates them.
Support Vector Machines (SVM) is a supervised machine learning algorithm that is used for
classification and regression tasks. It is particularly effective in tasks where the goal is to separate
data points into different classes. SVM works by finding the hyperplane that best separates the data
points of one class from another while maximizing the margin between the classes. Here are key
concepts related to Support Vector Machines:
Linear Separation:
SVM is most commonly used for binary classification, where the goal is to separate the data into two
classes.
The algorithm searches for a hyperplane that best separates the data points of one class from those of
the other class.
Margin:
The margin is the distance between the hyperplane and the nearest data point from either class.
SVM aims to maximize this margin, as it is believed to lead to better generalization performance on
unseen data.
Support Vectors:
Support vectors are the data points that lie closest to the decision boundary (hyperplane) and have the
most influence on determining the optimal hyperplane.
These are the critical instances that define the margin between the classes.
Kernel Trick:
SVM can be extended to handle non-linear decision boundaries by using the kernel trick.
Kernels allow SVM to implicitly map the input features into a higher-dimensional space, making it
possible to find non-linear decision boundaries.
C Parameter:
The C parameter in SVM represents the regularization parameter.
A smaller C value allows for a wider margin but may result in more training errors, while a larger C
value may lead to a narrower margin but fewer training errors.
Soft Margin SVM:
In cases where the data is not perfectly separable, SVM can be adapted to allow for some
misclassification. This is referred to as a soft-margin SVM.
Multi-class Classification:
SVM can be extended to handle multi-class classification problems through techniques such as one-
vs-one or one-vs-all.
Applications of SVM:
Image Classification: SVM can be used for image classification tasks, such as identifying objects in
images.
Text Classification: SVM is effective in tasks like spam detection or sentiment analysis.
Bioinformatics: It has applications in classifying proteins and genes.
Handwriting Recognition: SVM can be used for character recognition in handwritten documents.
Pros and Cons:
Pros:
Effective in high-dimensional spaces.
Versatile due to the kernel trick, allowing it to handle non-linear decision boundaries.
Memory-efficient, as it uses only a subset of training points (support vectors).
Cons:
Can be sensitive to noise in the data.
Choice of kernel and parameters can impact performance.
Support Vector Machines are a powerful tool for classification tasks, particularly in scenarios where a
clear margin between classes is desired. Proper tuning of parameters and selection of the appropriate
kernel function are crucial for obtaining good performance.
Neural Networks: Deep learning models with multiple layers that can learn complex relationships in
data for classification tasks.
Neural Networks, specifically Artificial Neural Networks (ANNs), are a class of machine learning
models inspired by the structure and functioning of the human brain. They consist of interconnected
nodes (neurons) organized into layers. Neural Networks have proven to be powerful and flexible
models capable of learning complex patterns from data. Here are key concepts related to Neural
Networks:
1. Neurons:
Neurons are the basic units in a neural network, analogous to neurons in the human brain.
Each neuron receives inputs, processes them using weights, applies an activation function,
and produces an output.
2. Layers:
Neural Networks consist of layers of neurons, typically organized into three main types: input
layer, hidden layers, and output layer.
The input layer receives the initial data, hidden layers process information, and the output
layer produces the final result.
3. Weights and Bias:
Weights represent the strength of connections between neurons. During training, these
weights are adjusted to minimize the error in predictions.
Bias terms provide flexibility and allow the model to learn the correct mapping even when all
input features are zero.
4. Activation Function:
The activation function determines the output of a neuron given its weighted inputs.
Common activation functions include sigmoid, hyperbolic tangent (tanh), and rectified linear
unit (ReLU).
5. Feedforward and Backpropagation:
Feedforward: The process of passing inputs through the network to produce predictions. The
information flows forward through the layers.
Backpropagation: The process of adjusting weights and biases during training to minimize
the difference between predicted and actual outputs.
6. Loss Function:
The loss function quantifies the difference between predicted and actual outputs. The goal
during training is to minimize this loss.
Common loss functions include mean squared error for regression tasks and cross-entropy for
classification tasks.
7. Training and Optimization:
Neural Networks are trained using optimization algorithms like stochastic gradient descent
(SGD) or variants such as Adam.
The model iteratively adjusts weights and biases to minimize the loss on the training data.
8. Deep Learning:
Deep Learning refers to the use of deep neural networks, which have multiple hidden layers.
Deep networks can learn hierarchical representations of data, capturing complex features at
different levels.
9. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs):
CNNs: Specialized for processing grid-like data, such as images. They use convolutional
layers to detect patterns.
RNNs: Suited for sequential data, like time series or natural language. They use recurrent
connections to capture temporal dependencies.
10. Transfer Learning:
Transfer learning involves using a pre-trained neural network on a similar task as a starting
point for a new task.
This approach leverages the knowledge gained from one task to improve performance on
another.
Applications of Neural Networks:
1. Image and Speech Recognition: CNNs are widely used for image recognition, while RNNs
can be applied to speech recognition.
2. Natural Language Processing (NLP): Neural Networks are used for tasks like language
translation, sentiment analysis, and text generation.
3. Healthcare: Applied in medical image analysis, disease prediction, and drug discovery.
4. Autonomous Vehicles: Neural Networks play a crucial role in object detection and decision-
making for autonomous vehicles.
5. Financial Forecasting: Used for predicting stock prices, credit risk assessment, and fraud
detection.
Pros and Cons:
Pros:
Capable of learning complex patterns and
representations. Effective in a wide range of tasks.
Can automatically learn hierarchical features.
Cons:
Require large amounts of data for training.
Computationally intensive and may require powerful
hardware. Prone to overfitting, especially with limited data.
Neural Networks, especially deep neural networks, have become a cornerstone of modern machine
learning and artificial intelligence, driving advancements in various fields. The success of deep
learning models has contributed to their widespread adoption in real-world applications.