Similarity-based
Learning
“Anyone who stops learning is old, whether at twenty or eighty."
— Henry Ford
Similarity-based Learning is a supervised learning technique that predicts the class label of a test
instance by gauging the similarity of this test instance with training instances. Similarity-based
learning refers to a family of instance-based learning which is used to solve both classification
"and regression problems. Instance-based leaming makes prediction by computing distances
or s between test instance and specific set of training instances local to the test
- instance in an incremental process. In contrast to other learning mechanisms, it considers only
the nearest instance or instances to predict the class of unseen instances. This learning method-
ology improves the performance of classification since it uses only a specific set of instances
as incremental learning task. Similarity-based classification is useful in various fields such as
image processing, text classification, pattern recognition, bio informatics, data mining, infor-
mation retrieval, natural language processing, etc. A practical application of this learning is
» predicting daily stock index price changes. This chapter provides an insight of how different
similarity-based models predict the class of a new instance.
* Understand the fundamentals of Instance based learning
* Know about the concepts of Nearest-Neighbor Learning using the algorithm called
k-Nearest-Neighbors (K-NN)
+ Lea about Weighted k Nearest Neighbor classifier that chooses the neighbors by
using the weighted distance
* Gain knowledge about Nearest Centroid classifier, a simple alternative to kNN
classifiers
* Understand Locally Weighted Regression (LWR) that approximates the linear
functions of all k neighbors to minimize the error while prediction
Raemnancme cme EY116 + Machine Learning
4.1 INTRODUCTION TO SIMILARITY OR INSTANCE-BASED
LEARNING
Similarity-based classifiers use similarity measures to locate the nearest neighbors and classify a test
instance which works in contrast with other learning mechanisms such as decision trees or neural
networks. Similarity-based leaming is also called as Instance-based learning/Just-in time learning
since it does not build an abstract model of the training instances and performs lazy learning when
Classifying a new instance. This learning mechanism simply stores all data and uses it only when it
needs to classify an unseen instance. The advantage of using this learning is that processing occurs
only when a request to classify a new instance is given. This methodology is particularly useful
when the whole dataset is not available in the beginning but collected in an incremental manner.
‘The drawback of this learning is that it requires a large memory to store the data since a global
abstract model is not constructed initially with the training data. Classification of instances is done
based on the measure of similarity in the form of distance functions over data instances. Several
distance metrics are used to estimate the similarity or dissimilarity between instances required for
clustering, nearest neighbor classification, anomaly detection, and so on. Popular distance metrics
used are Hamming distance, Euclidean distance, Manhattan distance, Minkowski distance, Cosine
similarity, Mahalanobis distance, Pearson's correlation or correlation similarity, Mean squared
difference, Jaccard coefficient, Tanimoto coefficient, etc.
Generally, Similarity-based classification problems formulate the features of test instance and
training instances in Euclidean space to learn the similarity or dissimilarity between instances.
4.1.1 Differences Between Instance- and Model-based Learning
An instance is an entity or an example in the training dataset. It is described by a set of features or
attributes. One attribute describes the class label of category of an instance. Instance-based methods
Jeam or predict the class label ofa test instance only when a new instance is given for classification
and until then it delays the processing of the training dataset.
It is also referred to as lazy learning methods since it does not generalize any model from the
training dataset but just keeps the training dataset as a knowledge base until a new instance is
given. In contrast, model-based learning, generally referred to as eager learning, tries to generalize the
training data to a model before receiving test instances. Model-based machine learning describes
all assumptions about the problem domain in the form of a model. These algorithms basically learn
in two phases, called training phase and testing phase. In training phase, a model is built from the
training dataset and is used to classify a test instance during the testing phase. Some examples of
models constructed are decision trees, neural networks and Support Vector Machines (SVM), etc.
The differences between Instance-based Learning and Model-based Leaming are listed in
Table 4.1.
Table 4.1: Differences between Instance-based Learning and Model-based Learning
Lazy Learners Eager Leamers
Processing of training instances is done only during | Processing of training instances is done during
testing phase training phaseest
ral
ng
it
as
or
on
he
he
Simniarity-based Learning » 117
; i
Generalizes @ model with the training instances
before it receives a test instance
‘No model is built with the training instances before
it receives a test instance
Predicts the class of the test instance directly from | Predicts the class of the test instance from the model
the training data built
Slow in testing phase Fastin testing phase
Teams by making many local approximations Leams by creating global approximation
Instance-based learning also comes under the category of memory-based models which normally
compare the given test instance with the trained instances that are stored in memory. Memory-
based models classify a test instance by checking the similarity with the training instances.
‘Some examples of Instance-based learning algorithms are:
Nearest Neighbor (k-NN)
Variants of Nearest Neighbor learning <<
Locally Weighted Regression
Learning Vector Quantization (LVQ)
Self-Organizing Map (SOM)
6, Radial Basis Function (RBF) networks
seen
In this chapter, we will discuss about certain instance-based learning algorithms such as
keNearest Neighbor (k-NN), Variants of Nearest Neighbor leaming, and Locally Weighted
Regression learning.
Self-Organizing Map (SOM) and Radial Basis Function (RBF) networks are discussed along
with the concepts of artificial neural networks discussed in Chapter 10 since they could be referred
only after the understanding of neural networks.
These instance-based methods have serious limitations about the range of feature values taken.
Moreover, they are sensitive to irrelevant and correlated features leading to misclassification of
instances.
4.2 NEAREST-NEIGHBOR LEARNING
‘A natural approach to similarity-based classification is,
k-Nearest-Neighbors (K-NN), which is a non-parametric
method used for both classification and regression
problems. It is a simple and powerful non-parametric
algorithm. that predicts the catege ry of the test instance
according to the ’ training samples which are closer to
the test instance and classifies it to that category which
has the largest probability, A visual representation
of this learning is shown in Figure 4.1. There are two
classes of objects called C, and C, in the given figure.
When given a test instance T, the category of this test
instance is determined by looking at the class of k = 3 | Figure 4.1: Visual Representation of
nearest neighbors. Thus, the class of this test instance ‘k-Nearest Neighbor Learning
Tis predicted as C,118. Machine Learning — —
‘The algorithm relies on the assumption that similar objects are close to each other in the featuy
space. -NN performs instance-based learning which just stores the training data instances an
learning instances case by case, The model is also ‘memory-based’ as it uses training data at tin
when predictions need to be made. Itis a lazy learning algorithm since no prediction model is bui
earlier with training instances and classification happens only after getting the test instance.
The algorithm classifies a new instance by determining the ‘k’ most similar instances (i«
k nearest neighbors) and summarizing the output of those ‘K’ instances. If the target variable
discrete then it is a classification problem, so it selects the most common class value among the’
instances by a majority vote. However, if the target variable is continuous then it is a regressic
problem, and hence the mean output variable of the ‘K’ instances is the output of the test instano:
‘The most popular distance measure such as Euclidean
the‘K instances which are
Inputs: Training dataset 7, distance metric d, Test instance f, the number of nearest neighbors k
Output: Predicted class or category
Prediction: For test instance t,
1. For each instance i in T, compute the distance between the test instance t and every
other instance i in the training dataset using a distance metric (Euclidean distance).
[Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x, y,) and (ty y,) is given as dist (x, y,), (ty ¥,)) = l%,-2 +Y,-¥)" }
[Categorical attributes (Binary) - Hamming Distance: If the value of the two instances
is same, the distance d will be equal to 0 otherwise d= 1.]
2. Sort the distances in an ascending order and select the first k nearest training data
instances to the test instance.
3. Predict the class of the test instance by majority voting (if target attribute is discrete
valued) or mean (if target attribute is continuous valued) of the k selected nearest
instances.
(yeti) Gee Consider the student performance training dataset of 8 data instances shown.
Table 4.2 which describes the performance of individual students in a course and their CGP
obtained in the previous semesters. The independent attributes are CGPA, Assessment ar
Project. The target variable is ‘Result’ which is a discrete valued variable that takes two valu:
‘Pass’ or ‘Fail’. Based on the performance of a student, classify whether a student will pass or fe
in that course,
Table 4.2: Training Dataset T
Cen
ieee
Pass
| 80 | 7 Pass
| 85 81 [ 8 Pass
(Continued)Similarity-based Learning «119
4.
5. 65 50, 4 Fail
6 82 2 7 Pass
7, 58 38 5 Fail
8 89 1 9 Pass
Solution: Given a test instance (6.1, 40,5) and a set of categories (Pass, Fail} also called as classes,
we need to use the training set to classify the test instance using Euclidean distance.
‘The task of classification is to assign a category or class to an arbitrary instance.
Assign
Step 1: Calculate the Euclidean distance between the test instance (6.1, 40, and 5) and each of the
- Euclidean distance between the test instance (6.1, 40, and 5)
training instances as shown in Table 43.
Table 4.3: Euclidean Distance
ent Lact cos
duce)
rn 92 85 8 Pass foc eu ey
= 45.2063
my 8 ° 7 YP) feceays (oma) (9)
= 40.09501
* 6 * SPs (s.5~6.1)' +(s1-40)' +(8-5)"
=41,17961
‘ . 6 ne
° ° ° sy (5-61) +(s0-40)' + (4-5)
. = 10.05783,
6 82 2 7 (Pass [oxen obe-ay 5)
=32.13114
, * * os (68-6.:1)' +(38-40) +(5—8)°
= 2.022375
ey 8 * 2 PPS | Feeney (reat) +3)
=51.23319120 © Machine Learning
Step 2: Sort the distances in the ascending order and select the first 3 nearest training data instances
to the test instance. The selected nearest neighbors are shown in Table 4.4.
Table 4.4: Nearest Neighbors
4 5001 Fail
| 5 | 1n05783 | Fall
7 | 2.022375 Fail
Here, we take the 3 nearest neighbors as instances 4, 5 and 7 with smallest distances.
Step 3: Predict the class of the test instance by majority voting.
The class for the test instance is predicted as ‘Fail’.
s
Data normalization/standardization is required when data (features) have different ranges or a
wider range of possible values when computing distances and to transform all features to a specific
range. This is probably done to eliminate the influence of one feature over another (ie,, to give all
features equal chances). For example if one feature has values in thé range of [0-1] and another
feature has values in the range of [0-100], then the second feature will influence more even if there
is a small variation than the first feature.
NN classifier performance is strictly affected by three factors such as the number of nearest
Ifthe k value selected is small then it may result in overfitting or less stable and if it is big then it
may include many irrelevant points from other classes. The choice of the distance metric selected also
plays a major role and it depends on the type of the independent attributes in the training dataset.
The k-NN classification algorithm best suits lower dimensional data asin a high-dimensional
space the nearest neighbors may not be very close at all.
4.3, WEIGHTED K-NEAREST-NEIGHBOR ALGORITHM
‘The Weighted k-NN is an extension of K-NN. It chooses the neighbors by using the weighted
distance. The k-Nearest Neighbor (K-NN) algorithm has some serious limitations as its performance
is solely dependent on choosing the knearest neighbors, the distance metric used and the decision
rule. However, the principle idea of Weighted k-NN is that k closest neighbors to the test instance
are assigned a higher weight in the decision as compared to neighbors that are farther away from
the test instance. The idea is that weights are inversely proportional to distances.
The selected k nearest neighbors can be assigned uniform weights, which ‘means all the
instances in each neighborhood are weighted equally or weights can be assigned by the inverse of
their distance. In the second case, closer neighbors of a query point will have a greater influence
than neighbors which are further away.
Inputs: Training dataset ‘7’, Distance metric ‘d’, Weighting function 1(i), Test instance ‘t’, the
number of nearest neighbors ‘K }
(Continued)ee Simitarity-based Learning «121
‘Output: Predicted class or category
Prediction: For test instance f,
1. Foreach instance‘ in Training dataset T, compute the distance between the test instance
and every other instance ‘using a distance metric (Euclidean distance).
{Continuous attributes - Euclidean distance between two points in the plane with
coordinates (x, y,) and (x, y,) is givenas dist ((x, y,), Gy ¥,)) = Yl -%) +¥.- Ys) I]
[Categorical attributes (Binary) - Hamming Distance: If the values of two instances are
the same, the distance d will be equal to 0. Otherwise d= 1,]
Sort the distances in the ascending order and select the first ‘k’ nearest training data
| instances to the test instance.
Predict the class of the test instance by weighted voting technique (Weighting function
‘w(i)) for the k selected nearest instances:
* Compute the inverse of each distance of the ‘K’ selected nearest instances.
© Find the sum of the inverses.
+ Compute the weight by dividing each inverse distance by the sum. (Each weight is a
vote for its associated class).
* Add the weights of the same class.
«Predict the class by choosing the class with the maximum vote.
v
2
eee
Consider the same training dataset given in Table 4.1. Use Weighted KNN and
determine the class.
Solution:
Step 1: Given a test instance (7.6, 60, 8) and a set of classes (Pass, Fail}, use the training dataset to
classify the test instance using Euclidean distance and weighting function.
Assign k = 3. The distance calculation is shown in Table 4.5.
Table 4.5: Euclidean Distance
Re Cee
Submi
My % ° 5 Ps (2-75) +(25-60)' +(8-8).
= 2505115
») ° 7 [Ps \(s—7.6) + (80-60) +(7~8).
= 20.02898
ap | 8 SIPS feso7e) +(e +(6-8)
=21.01928
(Continued)122+ Machine Learning
4
|
| |
5. | 65 30 4 [Fail
sy] ® ” 7 Pass (8.2-7.6) + (72-60) +(7-8)
= 12.05653
7. 58 38 5 Fal | (os-76) +68
|>
Solution:
Step 1; Compute the mean/centroid of each class. In this example there are two classes called ‘A’
and ‘B’.124 6 Machine Learning $$
Centroid of class ‘/
(34+5+4, 142+ 3)/3 = (12, 6)/3 = (4, 2)
=(7+6+8,6+7 +5)/3 = (21, 18/3 =(7, 6)
Now given a test instance (6, 5), we can predict the class.
‘Centroid of class ‘
Step 2: Calculate the Euclidean distance between test instance (6, 5) and each of the centroid.
Euc Dist{(6, 5) ; (4, 2)]= y(6~4) +(5-2)' = v3 =36
Buc Dist{(6,5) ; (7, 6)]= \(6-7) +(5-6) = v2 =1414
The test instance has smaller distance to class B. Hence, the class of this test instance is predicted]
as ‘BY
-*
4.5 LOCALLY WEIGHTED REGRESSION (LWR)
Locally Weighted Regression (LWR) is a non-parametric supervised learning algorithm that
performs local regression by combining regression model with nearest neighbor's model. LWR
is also referred to as a memory-based method as it requires training data while prediction but
uses only the training data instances locally around the point of interest. Using nearest neighbors
algorithm, we find the instances that are closest to a test instance and fit linear function to each of
those ’k’ nearest instances in the local regression model. The key idea is that we need to approx
imate the linear functions of all ‘’ neighbors that minimize the error such that the prediction line
isno more linear but rather itis a curve.
Ordinary linear regression finds out a linear relationship between the input x and the output y.
Given training dataset T,
Hypothesis function h(x), the predicted target output is a linear function where f, is the
intercept and B, is the coeflicient of .
It is given in Eq. (4.1) as,
Ti) = By + Bx (a)
The cost function is such that it minimizes the error difference between the predicted value
h,(x) and true value “y’ and itis given as in Eq, (4.2).
1)= 53 (t(s)-a) (42)
where ‘nis the number of instances in the training dataset.
Now the cost function is modified for locally weighted linear regression including the weights
only for the nearest neighbor points. Hence, the cost function is given as in Eq. (4.3).
J@)= aa (4 (x)-y) (43)
where w, is the weight associated with each x,
‘The weight function used is a Gaussian kernel that gives a higher value for instances that are
close to the test instance, and for instances far away, it tends to zero but never equals to zero.
wis computed in Eq. (44) as,
(44)that
so Sioiarity-based Learning «125
where, «is called the bandwidth parameter and controls the rate at which w, reduces to zero with
distance from x,
Consider a simple example with four instances shown in Table 4.10 and apply
[ocally weighted regression.
Table 4.10: Sample Table
1
2.
3.
4
Solution: Using linear regression model assuming we have computed the parameters:
B= 4.72, B= 0.62
Given a test instance with x = 2, the predicted y/ is:
y=B.+B, +0.62 x
Applying the nearest neighbor model, we choose k= 3 closest instances.
Table 4.11 shows the Euclidean distance calculation for the training instances.
Table 4.11: Euclidean Distance Calculation
2 5
, 1-2)" =1
3. 2 7 >
(2-2) =0
4 1 8 7
(a-2) =1
Instances 2, 3 and 4 are closer with smaller distances.
‘The mean value= (5 +7 +8)/3 = 20/3 =6.67.
Using Eq. (4.4) compute the weights for the closest instances, using the Gaussian kernel,
sea
nze*
Hence the weights of the closest instances is computed as follows,
Weight of Instance 2s:
teat126 Machine Learning