0% found this document useful (0 votes)
31 views29 pages

12 - 23ECE216 - Nearest Neighbors

The document provides an introduction to Nearest Neighbor Classifiers, explaining their basic principles, advantages, and issues, particularly focusing on the k-Nearest Neighbor (k-NN) algorithm. It discusses the importance of choosing the right value of k, the impact of distance measures, and the process for optimizing hyperparameters for better model performance. Additionally, it includes examples and references for further reading on the topic.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views29 pages

12 - 23ECE216 - Nearest Neighbors

The document provides an introduction to Nearest Neighbor Classifiers, explaining their basic principles, advantages, and issues, particularly focusing on the k-Nearest Neighbor (k-NN) algorithm. It discusses the importance of choosing the right value of k, the impact of distance measures, and the process for optimizing hyperparameters for better model performance. Additionally, it includes examples and references for further reading on the topic.

Uploaded by

pvsbym
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Om Namo Bhagavate Vasudevaya

Introduction to Classification-
Lazy Learners- Nearest Neighbor
Classifiers

And some insights into classifier


terminology

Dr. Binoy B Nair

2/26/2025 1
Note
◼ Only to be used as a supplementary learning
material. Not for any commercial use.

◼ These slides have been substantially sourced from


Han and Kamber with some modifications made
for easier understanding.

◼ All rights belong to the original authors.

February 26, 2025 2


Nearest Neighbor Classifiers

◼ Basic idea:
◼ If it walks like a duck, quacks like a duck, then

it’s probably a duck


Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

2/26/2025 3
Nearest Neighbor Classifier
◼ This is amongst the simplest of all classification algorithms
in supervised learning.

◼ This is a method of classifying patterns based on the class


label of the closest training patterns in the feature space.

◼ These are non-parametric methods where no model is


fitted using the training patterns.

◼ The accuracy using nearest neighbor classifiers is good. It


is guaranteed to yield an error rate no worse than twice
the Bayes error rate which is the optimal error rate.
2/26/2025 4
Nearest Neighbor Classifiers
◼ There is no training time required for this classifier. In
other words, there is no design time for training the
classifier.

◼ Every time a test pattern is to be classified, it has to be


compared with all the training patterns, to find the closest
pattern.

◼ This classification time could be large if the training


patterns are large in number or if the dimensionality of the
patterns is high.

2/26/2025 5
Example: 1-NN classifier

Q. Find what the fifth customer is likely to buy from the given dataset using 1-NN
classifier

Customer Age Income Purchased


Product
1 45 46000 Book
2 39 100000 Book
3 35 38000 DVD
4 69 150000 DVD
5 58 51000 ???

2/26/2025 6
Example: 1-NN classifier
Solving :

Determine Distance between the unknown sample (sample no. 5) and other four known samples
58 − 𝐴𝑔𝑒 2 51000 − 𝐼𝑛𝑐𝑜𝑚𝑒 2
using Euclidean distance after normalization = +
69−35 150000−38000

Norm. Norm. Norm. Un Norm.


Customer Age Income
Age Income Distance Distance
1 45 46000 1.3 0.4 0.41 5000.02
2 39 100000 1.1 0.9 0.72 49000
3 35 38000 1 0.3 0.73 13000.02
4 69 150000 2 1.3 0.85 99000
5 58 51000 1.7 0.5
based on the Nearest Neighbor algorithm we ran through above, the customer
5 features are closest to Customer 1 and hence he is likely to buy what
customer 1 bought i.e. book.

2/26/2025 7
Issues with 1-NN classifier

◼ Might take outliers as the real data

2/26/2025 8
K-Nearest neighbor algorithm

◼ If there are n patterns X1,X2, ...,Xn in the training


data X and a test pattern P,

◼ If Xk is the most similar pattern to P from X, then


the class of P is the class of Xk.

◼ The similarity is usually measured by computing


the distance from P to the patterns X1,X2, ...,Xn.

2/26/2025 9
K-Nearest Neighbor Classification process

◼ Euclidean distance is a very common distance


measure
𝑑(𝑝, 𝑞) = ෍(𝑝𝑖 − 𝑞𝑖 )2
𝑖

◼ Determine the class from nearest neighbor list using either of the two
below techniques:

1. Simply take the majority vote of class labels among the k-nearest
neighbors.
2. Weigh the vote according to distance weight factor, w = 1/d2 and
then find the class with highest weight.

2/26/2025 10
Distance Measures-Minkowski distances

◼ Euclidean distance belongs to a group of popular


distance measures called Minkowski distances
given by:

𝑞
𝑑(𝑋𝑖 , 𝑋𝑗 ) = (|𝑥𝑖1 − 𝑥𝑗1 |𝑞 + |𝑥𝑖2 − 𝑥𝑗2 |𝑞 +. . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |𝑞 )

where Xi = (xi1, xi2, …, xip) and Xj = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and q is a positive integer

2/26/2025 11
Distance Measures-Minkowski distances

◼ If q = 1, the distance measure is Manhattan (or


city block) distance

𝑑(𝑋𝑖 , 𝑋𝑗 ) = |𝑥𝑖1 − 𝑥𝑗1 | + |𝑥𝑖2 − 𝑥𝑗2 |+. . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |

❑If q = 2, the distance measure is Euclidean distance

𝑑(𝑋𝑖 , 𝑋𝑗 ) = (|𝑥𝑖1 − 𝑥𝑗1 |2 + |𝑥𝑖2 − 𝑥𝑗2 |2 +. . . +|𝑥𝑖𝑝 − 𝑥𝑗𝑝 |2 )

2/26/2025 12
Distance Measures-Minkowski distances

◼ If q = ∞, the distance measure is Chebyshev


distance

𝑑 𝑋𝑖 , 𝑋𝑗 = max{ (𝑥𝑖1 −𝑥𝑗1 ), (𝑥𝑖2 −𝑥𝑗2 ), … , (𝑥𝑖𝑝 −𝑥𝑗𝑝 )}

Note: Just subtract dimension wise and take the


maximum value form the resulting p values

2/26/2025 13
K-Nearest neighbor method

◼ Majority vote within the k nearest neighbors

new

K= 1: brown
K= 3: green

2/26/2025 14
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

2/26/2025 15
Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x

2/26/2025 16
Nearest Neighbor Classification… Issues

◼ Choosing the value of k:


◼ If k is too small, sensitive to noise points
◼ If k is too large, neighborhood may include points from
other classes

2/26/2025 17
Nearest Neighbor Classification…Issues

◼ Scaling issues

◼ Attributes may have to be scaled to prevent


distance measures from being dominated by
one of the attributes

◼ Example:
◼ height of a person may vary from 1.5m to 1.8m
◼ weight of a person may vary from 90lb to 300lb
◼ income of a person may vary from $10K to $1M

2/26/2025 18
Discussion on the k-NN Algorithm
◼ Make the system robust to noisy data by averaging k-nearest
neighbors

◼ The k-NN algorithm for continuous-valued target :

◼ Calculate the mean values of the k nearest neighbors

◼ Curse of dimensionality: distance between neighbors could be


dominated by irrelevant attributes.

◼ To overcome it, axes stretch or elimination of the least relevant


attributes.

2/26/2025 19
We still have the problem of how to
choose an optimal value of k ?

◼ To better understand this, we need to understand two


commonly used (and misunderstood) terms:

◼ Parameters and

◼ Hyperparameters

2/26/2025 20
Important term: Parameter
◼ Remember: Use term ‘Parameter’ when you talk about the ML model.

◼ In case of ML, the parameters of the trained machine learning model


that describe the model are referred to as model parameters.

◼ The aim of the ALL machine learning problems is to find the optimal
set of model parameters that give the best performance
(classification, clustering or forecasting performance based on the ML
problem in hand).
◼ Eg. The final set of connection weights in a trained artificial neural

network (ANN) model are the set of parameters for that ANN
model.

2/26/2025 21
Important term: Hyperparameter
◼ Remember: Use term ‘Hyperparameter’ when you talk about the ML
algorithm.

◼ Hyperparameters are set of parameters used for the ML algorithm to


start functioning so that it can eventually result in finding an ML
model that has optimal model parameters.

◼ Typically, these are set manually or tuned via some search algorithms.
◼ Eg. The learning rate parameter η and the momentum term µ to

be set for starting the execution of the backpropagation algorithm


to train the above ANN are the two hyperparameters.

◼ The algorithm hyperparameters significantly affect the ML model


training process, hence tuning of hyperparameters is usually quite
helpful.
2/26/2025 22
Let’s get back to our problem

The problem we had was: ‘How to choose the


optimal value of k for a kNN classifier?’

◼ If you look at the definitions in the previous


slides, k is a hyperparameter which is need for
the kNN algorithm to start working.

◼ The kNN classifier is a lazy classifier, hence has


no training or model, hence there is no question
of model parameters, either.

2/26/2025 23
How to find optimal hyperparameters ?

Step 1: Split the dataset into 3 parts


Entire dataset
70% 15% 15%
Training set Validation set Test set

Step 2: Set the model hyperparameters.


Step 3: Use the training set into train the model.
Step 4: Use validation set to find the model performance.
Step 5: Go back to step 2, change model hyperparameters and repeat Steps 2-4 until you
have found the hyperparameter values that results in the ML model that gives the best
performance on validation set.
Step 6: The best performing model is your final model that can be deployed. To test how
well it will do, pass the test set through the best model identified in Step 5 and get the
performance results. This is the result you are very likely to get when the model is deployed
in the field as well.

Note: The percentage 70-15-15 is not sacrosanct. You can have


your own split ratios.
2/26/2025 24
NOTES

◼ Note 1: Several ML practitioners use only the


training and the test sets. In this case, the test
set serves the same function as the validation set
discussed in the previous slide.

◼ Note 2: There is a possibility that a dataset might


be very small in size making this 70-15-15 split
infeasible. In such cases ‘Cross-Validation’ is a
good alternative.

2/26/2025 25
Performance??
◼ In the previous slides, we talked about ‘…. ML model that
gives the best performance…’.

◼ What do you mean by performance of an ML model?

◼ Typically, performance of an ML classifier refers to its


Accuracy. There are other measures such as Sensitivity,
specificity, F1-score etc.

◼ A confusion matrix is a good way of visualizing the ML


classifier performance

2/26/2025 26
Example
◼ Aim: To identify the type of
flower using the features:
[Petal Width, Sepal Length]

◼ Measurements of 150 flowers


(50 of each type) is already Iris virginica
available with us (FisherIris
dataset).
Iris Setosa
◼ k-NN example

◼ Find the flower type for the


unknown flower with given Iris Versicolor
m/m:
P=[5.2, 1.6]
2/26/2025 27
Major References
◼ Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques, 2nd Ed. (including slides).
◼ http://www.ibm.com/developerworks/library/os-weka3/
◼ M. Narasimha Murty, V. Susheela Devi, Pattern Recognition, NPTEL
Notes.
◼ https://www.cse.msu.edu/~cse847/slides/knn.pptx
◼ chem-eng.utoronto.ca/~datamining/Presentations/KNN.ppt
◼ http://www.cs.uoi.gr/~tsap/teaching/2012s-cs059/material/
datamining-lect10.pptx

2/26/2025 28
Questions?

2/26/2025 29

You might also like