Classification Using K-Nearest
Neighbor
Back Ground
Prepared By
Anand Bhosale
Supervised Unsupervised
• Labeled Data • Unlabeled Data
X1 X2 Class X1 X2
10 100 Square 10 100
2 4 Root 2 4
Distance
Distance
Distances
• Distance are used Distan
Mahalan ce
to measure obis
Distance
similarity
• There are many Euclidean
ways to measure Distance
the distance s
between two Hammin
g
instances Distance
Minkows
ki
distance
Distances
• Manhattan Distance • Euclidean Distance
|X1-X2| + |Y1-Y2| • 2
Properties of Distance
• Dist (x,y) >= 0
• Dist (x,y) = Dist (y,x) are Symmetric
• Detours can not Shorten Distance
Dist(x,z) <= Dist(x,y) + Dist (y,z)
z
X X y
y z
Distance
Hamming Distance
Distances Measure
• Distance Measure – What does it mean “Similar"?
• Minkowski Distance
1/ m
N
– Norm:
d ( x, y ) || x y ||m ( xi yi ) m
i 1
– Chebyshew Distance
– Mahalanobis distance:
d(x , y) = |x – y|TSxy1|x – y|
Nearest Neighbor and Exemplar
Exemplar
• Arithmetic Mean
• Geometric Mean
• Medoid
• Centroid
Arithmetic Mean
Geometric Mean
A term between two terms of a geometric sequence is the geometric
mean of the two terms.
Example: In the geometric sequence 4, 20, 100, ....(with a factor of 5), 20
is the geometric mean of 4 and 100.
Nearest Neighbor Search
• Given: a set P of n points in Rd
• Goal: a data structure, which given a query
point q, finds the nearest neighbor p of q
in P
p
q
K-NN
• (K-l)-NN: Reduce complexity by having a threshold on the
majority. We could restrict the associations through (K-l)-NN.
K-NN
• (K-l)-NN: Reduce complexity by having a threshold on the
majority. We could restrict the associations through (K-l)-NN.
K=5
K-NN
• Select 5 Nearest Neighbors
as Value of K=5 by Taking their
Euclidean Disances
K-NN
• Decide if majority of Instances over a given
value of K Here, K=5.
Example
Points X1 (Acid Durability ) X2(strength) Y=Classification
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
KNN Example
Points X1(Acid Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 ?
Scatter Plot
Euclidean Distance From Each
Point
KNN
P1 P2 P3 P4
(7,7) (7,4) (3,4) (1,4)
Euclidean
Distance of
P5(3,7) from Sqrt((7-3) 2 + (7-7)2 ) Sqrt((7-3) 2 + (4-7)2 ) Sqrt((3-3) 2 + (4- Sqrt((1-3) 2 + (4-7)2
= = 7)2 ) = )=
3 Nearest NeighBour
P1 P2 P3 P4
(7,7) (7,4) (3,4) (1,4)
Euclidean
Distance of
P5(3,7) from Sqrt((7-3) 2 + (7-7)2 ) Sqrt((7-3) 2 + (4-7)2 ) Sqrt((3-3) 2 + (4- Sqrt((1-3) 2 + (4-7)2
= = 7)2 ) = )=
Class BAD BAD GOOD GOOD
KNN Classification
Points X1(Durability) X2(Strength) Y(Classification)
P1 7 7 BAD
P2 7 4 BAD
P3 3 4 GOOD
P4 1 4 GOOD
P5 3 7 GOOD
Variation In KNN
Different Values of K
References
• Machine Learning : The Art and Science of Algorithms that Make Sense of Data By
Peter Flach
• A presentation on KNN Algorithm : West Virginia University , Published on May 22,
2015
Thanks