0% found this document useful (0 votes)
468 views31 pages

Class Notes Unit 2 ML Material

Nearest neighbor-based models, particularly the k-nearest neighbors (k-NN) algorithm, utilize proximity or similarity between data points for predictions in classification and regression tasks. Key concepts include various distance metrics like Euclidean, Manhattan, and Minkowski distances, which significantly impact model performance. While k-NN is simple and versatile, it has weaknesses such as high computation costs and sensitivity to noise, making it suitable for applications like pattern recognition and anomaly detection.

Uploaded by

madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
468 views31 pages

Class Notes Unit 2 ML Material

Nearest neighbor-based models, particularly the k-nearest neighbors (k-NN) algorithm, utilize proximity or similarity between data points for predictions in classification and regression tasks. Key concepts include various distance metrics like Euclidean, Manhattan, and Minkowski distances, which significantly impact model performance. While k-NN is simple and versatile, it has weaknesses such as high computation costs and sensitivity to noise, making it suitable for applications like pattern recognition and anomaly detection.

Uploaded by

madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT-2

Nearest Neighbor-Based Models in Machine Learning

Nearest neighbor-based models, especially the k-nearest neighbors (k-NN) algorithm, are a
fundamental class of machine learning techniques that rely on the idea of proximity or
similarity between data points for making predictions. These models are used in both
classification and regression tasks and are based on the idea that similar points are likely to
have similar labels or values. Below is a detailed breakdown of the key concepts related to
nearest neighbor-based models.

Nearest Neighbor-Based Models are non-parametric machine learning techniques that


classify data points or predict values based on the similarity (or distance) to nearby data
points. Here’s a detailed explanation of key concepts related to these models:

1. Introduction to Nearest Neighbor Models

 Nearest Neighbor methods rely on the idea that similar instances have similar outputs.
 Predictions are based on the proximity of a query point to data points in the training set.
 Common distance metrics:
o Euclidean Distance: d(x,x′)=∑i=1n(xi−xi′)2d(x, x') = \sqrt{\sum_{i=1}^n (x_i -

o Manhattan Distance: d(x,x′)=∑i=1n∣xi−xi′∣d(x, x') = \sum_{i=1}^n |x_i - x'_i|d(x,x


x'_i)^2}d(x,x′)=∑i=1n(xi−xi′)2

′)=∑i=1n∣xi−xi′∣
o Cosine Similarity: Measures the cosine of the angle between two vectors.

2. K-Nearest Neighbors (K-NN) for Classification

 K-NN is a simple, instance-based algorithm that classifies a data point based on the majority
label of its kkk-nearest neighbors.
 Algorithm:
1. Choose a value for kkk (number of neighbors).
2. Compute the distances between the query point and all points in the training set.
3. Identify the kkk closest points.
4. Assign the class label most common among these kkk points.
 Key Considerations:
1. Larger kkk smoothens the decision boundary (reduces variance but may increase
bias).
2. Small kkk may lead to overfitting.

Distance Metrics

 The choice of distance metric can significantly impact model performance:


o Euclidean: Best for continuous, normalized data.
o Manhattan: Works well for high-dimensional or grid-like data.
o Hamming: Suitable for categorical data.
o Minkowski: Generalized distance metric: d(x,x′)=(∑i=1n∣xi−xi′∣p)1/pd(x, x') = \left( \
sum_{i=1}^n |x_i - x'_i|^p \right)^{1/p}d(x,x′)=(i=1∑n∣xi−xi′∣p)1/p Special cases: p=2p
= 2p=2 (Euclidean), p=1p = 1p=1 (Manhattan).

5. Strengths and Weaknesses of K-NN

Strengths:

 Simple and Intuitive: No training phase (lazy learning).


 Versatile: Can handle classification and regression tasks.
 Adaptable: Works with different distance metrics.

Weaknesses:

 High Computation Cost: Requires computing distances to all training points at inference.
 Memory Intensive: Stores the entire training set.
 Curse of Dimensionality: Performance deteriorates in high-dimensional spaces.
 Sensitive to Noise: Outliers can affect predictions.

6. Applications of Nearest Neighbor Models

 Pattern Recognition: Handwriting and facial recognition.


 Recommender Systems: Collaborative filtering.
 Anomaly Detection: Identifying outliers.
 Medical Diagnosis: Classifying symptoms.

Introduction to Proximity Measures

Proximity measures refer to techniques that quantify the similarity or closeness between two
data points. The idea is that if two data points are close to each other in the feature space,
they are more likely to belong to the same class (in classification) or have similar values (in
regression). These proximity measures are essential for nearest neighbor algorithms.

Proximity measures in machine learning quantify the similarity or dissimilarity between data
points, which are often represented as vectors. These measures are fundamental in tasks like
clustering, classification, and anomaly detection, where the relationship between data points
is crucial.

Why Are Proximity Measures Important?

1. Clustering: In algorithms like k-means or hierarchical clustering, proximity determines how


data points are grouped.
2. Classification: For instance, in k-Nearest Neighbors (k-NN), the "closeness" of a test sample
to training samples decides its label.
3. Anomaly Detection: Dissimilarity from the majority helps identify outliers.
4. Recommender Systems: Proximity between user/item profiles drives recommendations.
Types of Proximity Measures

Proximity measures can be categorized based on the data types they are applied to:

1. Distance Metrics

These measure the "distance" between two points, treating closeness as the inverse of
distance.

Euclidean Distance
Euclidean Distance gives the distance between any two points in an n-dimensional plane.
Euclidean distance between two points in the Euclidean space is defined as the length of the
line segment joining the two points.
Euclidean distance is like measuring the straightest and shortest path between two
points. Imagine you have a string and you stretch it tight between two points on a map; the
length of that string is the Euclidean distance

Euclidean Distance Formula


Consider two points (x1, y1) and (x2, y2) in a 2-dimensional space; the Euclidean Distance
between them is given by using the formula:
d = √[(x2 - x1)2 + (y2 - y1)2]
Where,
 d is Euclidean Distance
 (x1, y1) is Coordinate of the first point
 (x2, y2) is Coordinate of the second point

Euclidean Distance in 3D
If the two points (x1, y1, z1) and (x2, y2, z2) are in a 3-dimensional space, the Euclidean
Distance between them is given by using the formula:
d = √[(x2 - x1)2 + (y2 - y1)2+ (z2 - z1)2]
where,
 d is Euclidean Distance
 (x1, y1, z1) is Coordinate of the first point
 (x2, y2, z2) is Coordinate of the second point

Euclidean Distance in nD
In general, the Euclidean Distance formula between two points (x11, x12, x13, ...., x1n) and
(x21, x22, x23, ...., x2n) in an n-dimensional space is given by the formula:
d = √[∑(x2i – x1i)2]
Where,
 i Ranges from 1 to n
 d is Euclidean distance
 (x11, x12, x13, ...., x1n) is Coordinate of First Point
 (x21, x22, x23, ...., x2n) is Coordinate of Second Point

Euclidean Distance Formula Derivation


Euclidean Distance Formula is derived by following the steps added below:
Step 1: Let us consider two points, A (x 1, y1) and B (x2, y2), and d is the distance
between the two points.
Step 2: Join the points using a straight line (AB).
Step 3: Now, let us construct a right-angled triangle whose hypotenuse is AB, as
shown in the figure below.

Step4: Now, using Pythagoras theorem we know that,

⇒ d2 = (x2 – x1)2 + (y2 – y1)2


(Hypotenuse)2 = (Base)2 + (Perpendicular)2

Now, take the square root on both sides of the equation, we get
d = √(x2 – x1)2 + (y2 – y1)2

examples: A triangle has vertices at points A(2, 3), B(5, 7), and C(8, 1). Find
the length of the longest side of the triangle.
Given, the points A(2, 3), B(5, 7), and C(8, 1) are the vertices of a triangle.
Distance between points A and B:

⇒ AB = √9+16= √25
AB = √[(5-2)2 + (7-3)2]

AB = 5 unit
Distance between points B and C:

⇒ BC = √[9+36] = √45
BC = √[(8-5)2 + (1-7)2]

BC = 6.708 unit
Distance between points C and A:

⇒ CA = √[36+4] = √40
CA = √[(8-2)2 + (1-3)2]

CA = 6.325 unit
Therefore, the length of the longest side of triangle is 6.708 unit.
2.Mathematically prove Euclidean distance is a non negative value.

Consider two points (x 1, y1) and (x2, y2) in a 2-dimensional space; the Euclidean
Distance between them is given by using the formula:
d = √(x2 – x1)2 + (y2 – y1)2

⇒(x2 – x1)2 >= 0 and (y2 – y1)2 >= 0


We know that squares of real numbers are always non-negative.

⇒ √(x2 – x1)2 + (y2 – y1)2 >= 0


As square root of a non-negative number gives a non-negative number,
Therefore Euclidean distance is a non-negative value. It cannot be a negative
number.\

3. Show that the points A (0, 0), B (4, 0), and C (2, 2√3) are the vertices of an
Equilateral Triangle.
To prove that these three points form an equilateral triangle, we need to show
that the distances between all pairs of points, i.e., AB, BC, and CA, are equal.
Distance between points A and B:

⇒ AB = √16
AB = √[(4– 0)2 + (0-0)2]

AB = 4 unit
Distance between points B and C:

⇒ BC = √[4+12] = √16
BC = √[(2-4)^2 + (2√3-0)^2]

BC = 4 unit
Distance between points C and A:

⇒ CA = √[4 + 12] = √16


CA = √[(0-2)2 + (0-2√3)2]

CA = 4 unit
Here, we can observe that all three distances, AB, BC, and CA, are equal.
Therefore, the given triangle is an Equilateral Triangle

Code:
import numpy as np
# Define points
point_a = np.array([3, 4])
point_b = np.array([0, 0])
# Compute Euclidean distance
distance = np.linalg.norm(point_a - point_b)
print("Euclidean Distance:", distance)

Manhattan Distance (L1 Norm):


Manhattan distance is a distance measure that is calculated by taking the sum of
distances between the x and y coordinates.

The Manhattan distance is also known as Manhattan length. In other words, it is


the distance between two points measured along axes at right angles.

Manhattan distance works very well for high-dimensional datasets. As it does not
take any squares, it does not amplify the differences between any of the features. It
also does not ignore any features.

# calculating manhattan distance between vectors


from scipy.spatial.distance import cityblock
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = cityblock(row1, row2)
print(dist)
Running the example, we can see we get the same result, confirming our
manual implementation.

1 13
This Manhattan distance metric is also known as Manhattan length,
rectilinear distance, L1 distance or L1 norm, city block distance,
Minkowski’s L1 distance, taxi-cab metric, or city block distance.

Applications of Manhattan distance metric include,

1. Regression analysis: It is used in the linear regression to find a straight


line that fits a given set of points
2. Compressed sensing: In solving an underdetermined system of linear
equations, the regularisation term for the parameter vector is expressed
in terms of Manhattan distance. This approach appears in the signal
recovery framework called compressed sensing
3. Frequency distribution: It is used to assess the differences in discrete
frequency distributions.

Minkowski distance:
Minkowski distance is a distance measured between two points in N-dimensional
space. It is basically a generalization of the Euclidean distance and the Manhattan
distance. It is widely used in the field of Machine learning, especially in the
concept to find the optimal correlation or classification of data. Minkowski
distance is used in certain algorithms also like K-Nearest Neighbors, Learning
Vector Quantization (LVQ), Self-Organizing Map (SOM), and K-Means
Clustering.
Let us consider a 2-dimensional space having three points P 1 (X1, Y1), P2 (X2, Y2),
and P3 (X3, Y3), the Minkowski distance is given by ( |X 1 – Y1|p + |X2 – Y2|p + |X2 –
Y2|p )1/p. In R, Minkowski distance is calculated with respect to vectors.

For example,

we are given two vectors, vect1 as (4, 2, 6, 8) and vect2 as (5, 1, 7, 9). Their
Minkowski distance for p = 2 is given by, ( |4 – 5| 2 + |2 – 1|2 + |6 – 7|2 + |8 – 9|
) which is equal to 2. This article focuses upon how we can calculate
2 1/2

Minkowski distance in R.
Method 1:Using a custom function
We can calculate Minkowski distance between a pair of vectors by apply the
formula,
( Σ|vector1i – vector2i|p )1/p
Here,
vector1 is the first vector
vector2 is the second vector
p is an integer
code:
calculateMinkowskiDistance <- function(vect1, vect2, p) {

# Initializing answer variable as 0


answer <- as.integer(0)

# Iterating over the length of the vector


# Using for-in loop
for (index in 0 : length(vect1))
{
# temp stores the absolute difference raised to power p
temp = as.integer(abs(vect1[index] - vect2[index]) ^ p)

# Updating answer variable


answer = sum(temp, answer)
}

# The final answer would be answer raised to


# power 1 / p
answer = answer ^ (1 / p)

# Return the answer


return(answer)
}

# Initializing a vector
vect1 <- c(1, 3, 5, 7)

# Initializing another vector


vect2 <- c(2, 4, 6, 8)

# Set p equal to 4
p <- as.integer(1)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)

# Print the calculated distance


print(paste("The Minkowski distance between vect1\
and vect2 having the value of p =",p, "is", distance ))

# Set p equal to 5
p <- as.integer(2)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)

# Print the calculated distance


print(paste("The Minkowski distance between vect1\
and vect2 having the value of p =",p, "is", distance ))

# Set p equal to 5
p <- as.integer(3)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)
# Print the calculated distance
print(paste("The Minkowski distance between vect1\
and vect2 having the value of p =",p, "is", distance ))

# Set p equal to 5
p <- as.integer(4)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)

# Print the calculated distance


print(paste("The Minkowski distance between vect1 \
and vect2 having the value of p =",p, "is", distance ))

output:

Hamming Distance calculates the distance between two binary vectors, also referred to
as binary strings or bitstrings for short.

You are most likely going to encounter bitstrings when you one-hot encode categorical columns
of data.
For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot encode
each example as a bitstring with one bit for each column.
 red = [1, 0, 0]
 green = [0, 1, 0]
 blue = [0, 0, 1]
The distance between red and green could be calculated as the sum or the average number of bit
differences between the two bitstrings. This is the Hamming distance.

For a one-hot encoded string, it might make more sense to summarize to the sum of the bit
differences between the strings, which will always be a 0 or 1.

 HammingDistance = sum for i to N abs(v1[i] – v2[i])


For bitstrings that may have many 1 bits, it is more common to calculate the average number of
bit differences to give a hamming distance score between 0 (identical) and 1 (all different).

 HammingDistance = (sum for i to N abs(v1[i] – v2[i])) / N


We can demonstrate this with an example of calculating the Hamming distance between two
bitstrings, listed below.

calculating hamming distance between bit strings

# calculate hamming distance


def hamming_distance(a, b):
return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a)

# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)

Distance Measures
Distance measures are specific types of proximity measures that calculate the physical
distance between data points in a multi-dimensional feature space. The choice of distance
measure can have a significant impact on the performance of the nearest neighbor algorithm.

Euclidean Distance
Euclidean Distance gives the distance between any two points in an n-dimensional
plane. Euclidean distance between two points in the Euclidean space is defined as
the length of the line segment joining the two points.
Euclidean distance is like measuring the straightest and shortest path between
two points. Imagine you have a string and you stretch it tight between two points
on a map; the length of that string is the Euclidean distance

Euclidean Distance Formula


Consider two points (x1, y1) and (x2, y2) in a 2-dimensional space; the Euclidean
Distance between them is given by using the formula:
d = √[(x2 - x1)2 + (y2 - y1)2]
Where,
 d is Euclidean Distance
 (x1, y1) is Coordinate of the first point
 (x2, y2) is Coordinate of the second point

Euclidean Distance in 3D
If the two points (x1, y1, z1) and (x2, y2, z2) are in a 3-dimensional space, the
Euclidean Distance between them is given by using the formula:
d = √[(x2 - x1)2 + (y2 - y1)2+ (z2 - z1)2]
where,
 d is Euclidean Distance
 (x1, y1, z1) is Coordinate of the first point
 (x2, y2, z2) is Coordinate of the second point

Euclidean Distance in nD
In general, the Euclidean Distance formula between two points (x11, x12, x13, ....,
x1n) and (x21, x22, x23, ...., x2n) in an n-dimensional space is given by the
formula:
d = √[∑(x2i – x1i)2]
Where,
 i Ranges from 1 to n
 d is Euclidean distance
 (x11, x12, x13, ...., x1n) is Coordinate of First Point
 (x21, x22, x23, ...., x2n) is Coordinate of Second Point

Euclidean Distance Formula Derivation


Euclidean Distance Formula is derived by following the steps added
below:
Step 1: Let us consider two points, A (x 1, y1) and B (x2, y2), and d is the
distance between the two points.
Step 2: Join the points using a straight line (AB).
Step 3: Now, let us construct a right-angled triangle whose hypotenuse is
AB, as shown in the figure below.
Step4: Now, using Pythagoras theorem we know that,

⇒ d2 = (x2 – x1)2 + (y2 – y1)2


(Hypotenuse)2 = (Base)2 + (Perpendicular)2

Now, take the square root on both sides of the equation, we get
d = √(x2 – x1)2 + (y2 – y1)2

examples: A triangle has vertices at points A(2, 3), B(5, 7), and C(8,
1). Find the length of the longest side of the triangle.

Given, the points A(2, 3), B(5, 7), and C(8, 1) are the vertices of a
triangle.
Distance between points A and B:

⇒ AB = √9+16= √25
AB = √[(5-2)2 + (7-3)2]

AB = 5 unit
Distance between points B and C:

⇒ BC = √[9+36] = √45
BC = √[(8-5)2 + (1-7)2]

BC = 6.708 unit
Distance between points C and A:

⇒ CA = √[36+4] = √40
CA = √[(8-2)2 + (1-3)2]

CA = 6.325 unit
Therefore, the length of the longest side of triangle is 6.708 unit.
2.Mathematically prove Euclidean distance is a non negative value.
Consider two points (x1, y1) and (x2, y2) in a 2-dimensional space; the
Euclidean Distance between them is given by using the formula:
d = √(x2 – x1)2 + (y2 – y1)2

⇒(x2 – x1)2 >= 0 and (y2 – y1)2 >= 0


We know that squares of real numbers are always non-negative.

⇒ √(x2 – x1)2 + (y2 – y1)2 >= 0


As square root of a non-negative number gives a non-negative number,
Therefore Euclidean distance is a non-negative value. It cannot be a
negative number.\

3. Show that the points A (0, 0), B (4, 0), and C (2, 2√3) are the vertices of an
Equilateral Triangle.
To prove that these three points form an equilateral triangle, we need to
show that the distances between all pairs of points, i.e., AB, BC, and CA,
are equal.
Distance between points A and B:

⇒ AB = √16
AB = √[(4– 0)2 + (0-0)2]

AB = 4 unit
Distance between points B and C:

⇒ BC = √[4+12] = √16
BC = √[(2-4)^2 + (2√3-0)^2]

BC = 4 unit
Distance between points C and A:

⇒ CA = √[4 + 12] = √16


CA = √[(0-2)2 + (0-2√3)2]

CA = 4 unit
Here, we can observe that all three distances, AB, BC, and CA, are equal.
Therefore, the given triangle is an Equilateral Triangle

Code:
import numpy as np
# Define points
point_a = np.array([3, 4])
point_b = np.array([0, 0])
# Compute Euclidean distance
distance = np.linalg.norm(point_a - point_b)
print("Euclidean Distance:", distance)

Manhattan Distance (L1 Norm):


Manhattan distance is a distance measure that is calculated by
taking the sum of distances between the x and y coordinates.
The Manhattan distance is also known as Manhattan length. In
other words, it is the distance between two points measured
along axes at right angles.

Manhattan distance works very well for high-dimensional


datasets. As it does not take any squares, it does not amplify the
differences between any of the features. It also does not ignore
any features.

# calculating manhattan distance between vectors


from scipy.spatial.distance import cityblock
# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calculate distance
dist = cityblock(row1, row2)
print(dist)
This Manhattan distance metric is also known as
Manhattan length, rectilinear distance, L1 distance or L1
norm, city block distance, Minkowski’s L1 distance, taxi-
cab metric, or city block distance.

Applications of Manhattan distance metric include,

4. Regression analysis: It is used in the linear regression


to find a straight line that fits a given set of points
5. Compressed sensing: In solving an underdetermined
system of linear equations, the regularisation term for
the parameter vector is expressed in terms of
Manhattan distance. This approach appears in the
signal recovery framework called compressed sensing
6. Frequency distribution: It is used to assess the
differences in discrete frequency distributions.

Minkowski distance:
Minkowski distance is a distance measured between two points in N-
dimensional space. It is basically a generalization of the Euclidean
distance and the Manhattan distance. It is widely used in the field of
Machine learning, especially in the concept to find the optimal correlation
or classification of data. Minkowski distance is used in certain algorithms
also like K-Nearest Neighbors, Learning Vector Quantization (LVQ), Self-
Organizing Map (SOM), and K-Means Clustering.
Let us consider a 2-dimensional space having three points P 1 (X1, Y1),
P2 (X2, Y2), and P3 (X3, Y3), the Minkowski distance is given by ( |X 1 – Y1|p + |
X2 – Y2|p + |X2 – Y2|p )1/p. In R, Minkowski distance is calculated with respect
to vectors.
For example,

we are given two vectors, vect1 as (4, 2, 6, 8) and vect2 as (5, 1, 7, 9).
Their Minkowski distance for p = 2 is given by, ( |4 – 5| 2 + |2 – 1|2 + |6 – 7|
2
+ |8 – 9|2 )1/2 which is equal to 2. This article focuses upon how we can
calculate Minkowski distance in R.
Method 1:Using a custom function
We can calculate Minkowski distance between a pair of vectors by apply
the formula,
( Σ|vector1i – vector2i|p )1/p
Here,
vector1 is the first vector
vector2 is the second vector
p is an integer
code:
calculateMinkowskiDistance <- function(vect1, vect2, p) {

# Initializing answer variable as 0


answer <- as.integer(0)

# Iterating over the length of the vector


# Using for-in loop
for (index in 0 : length(vect1))
{
# temp stores the absolute difference raised to power p
temp = as.integer(abs(vect1[index] - vect2[index]) ^ p)

# Updating answer variable


answer = sum(temp, answer)
}

# The final answer would be answer raised to


# power 1 / p
answer = answer ^ (1 / p)

# Return the answer


return(answer)
}

# Initializing a vector
vect1 <- c(1, 3, 5, 7)

# Initializing another vector


vect2 <- c(2, 4, 6, 8)

# Set p equal to 4
p <- as.integer(1)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)

# Print the calculated distance


print(paste("The Minkowski distance between vect1\
and vect2 having the value of p =",p, "is", distance ))

# Set p equal to 5
p <- as.integer(2)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)

# Print the calculated distance


print(paste("The Minkowski distance between vect1\
and vect2 having the value of p =",p, "is", distance ))

# Set p equal to 5
p <- as.integer(3)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)

# Print the calculated distance


print(paste("The Minkowski distance between vect1\
and vect2 having the value of p =",p, "is", distance ))

# Set p equal to 5
p <- as.integer(4)

# Call the function to calculate MinkowskiDistance


distance = calculateMinkowskiDistance(vect1, vect2, p)

# Print the calculated distance


print(paste("The Minkowski distance between vect1 \
and vect2 having the value of p =",p, "is", distance ))

output:
Hamming Distance calculates the distance between two binary vectors, also
referred to as binary strings or bitstrings for short.

You are most likely going to encounter bitstrings when you one-hot encode categorical
columns of data.
For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot
encode each example as a bitstring with one bit for each column.
 red = [1, 0, 0]
 green = [0, 1, 0]
 blue = [0, 0, 1]
The distance between red and green could be calculated as the sum or the average
number of bit differences between the two bitstrings. This is the Hamming distance.

For a one-hot encoded string, it might make more sense to summarize to the sum of the
bit differences between the strings, which will always be a 0 or 1.

 HammingDistance = sum for i to N abs(v1[i] – v2[i])


For bitstrings that may have many 1 bits, it is more common to calculate the average
number of bit differences to give a hamming distance score between 0 (identical) and 1
(all different).

 HammingDistance = (sum for i to N abs(v1[i] – v2[i])) / N


We can demonstrate this with an example of calculating the Hamming distance between
two bitstrings, listed below.

calculating hamming distance between bit strings

# calculate hamming distance


def hamming_distance(a, b):
return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a)

# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)
Non-Metric Similarity Functions
Not all similarity measures correspond to a "distance" in the strict mathematical sense. These
are non-metric similarity functions, often used in cases where the data is not naturally metric
(e.g., categorical or nominal data).

Examples of Non-Metric Similarity Functions

Jaccard Similarity also called as Jaccard Index or Jaccard Coefficient is a


simple measure to represent the similarity between data samples. The
similarity is computed as the ratio of the length of the intersection within
data samples to the length of the union of the data samples.
It is represented as –
J(A, B) = |A Ո B| / |A U B|
It is used to find the similarity or overlap between the two binary vectors or
numeric vectors or strings. It can be represented as J. There is also a
closely related term associated with Jaccard Similarity which is called
Jaccard Dissimilarity or Jaccard Distance. Jaccard Distance is a measure
of dissimilarity between data samples and can be represented as (1 – J)
where J is Jaccard Similarity.
Common Applications of Jaccard Similarity:
Jaccard Similarity is used in multiple data science and machine learning
applications. Some of the frequent use cases encountered in real life
include :
 finding the similarity between two text documents based on
Text mining:
the number of terms used in both documents.
 E-Commerce: finding similar customers via their purchase history from a
sales database of thousands of customers and millions of items.
 Recommendation Systems: Finding similar customers based on ratings
and reviews e.g., Movie recommendation algorithms, Product
recommendation, diet recommendation, matrimony recommendations,
etc.
Jaccard Similarity Formula and Concepts:
Jaccard Similarity value ranges from 0 to 1. The higher the number, the
more similar are the datasets with each other. Although it is easy to
interpret but is extremely sensitive to smaller sample datasets and can
give erroneous results hence one needs to be careful while
comprehending results.
Jaccard Similarity for Numeric Sets:
Jaccard Similarity (J) = ( count of common elements in both sets) / ( count
of elements in first set + count of elements in second set – count of
common elements in both sets)
Where (count of elements in first set + count of elements in the second set
– count of common elements in both sets) = count of total unique
elements in both the sets.
Considering A and B as two sets, it can be represented in symbolic form
as
J(A, B) = |A Ո B| / |A U B| = |A Ո B| / |A| + |B| - |A Ո B|
Example:
In this example, we will be considering A and B be two sets respectively
where Set A = { 5, 10, 15, 20, 25, 30, 35,40, 45, 50} and Set B = {10, 20,
30, 40, 50, 60, 70, 80, 90, 100),A Ո B = { 10, 20, 30, 40, 50 } i.e., |A Ո
B| = 5,A U B = {5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100 }
i.e., |A U B| = 15,Therefore, J(A, B) = |A Ո B| / |A U B| = 5 / 15 = 0.33333
and stating it in R programming language.
Cosine similarity
It is a metric, helpful in determining, how similar the data objects are
irrespective of their size. We can measure the similarity between two
sentences in Python using Cosine Similarity. In cosine similarity, data
objects in a dataset are treated as a vector. The formula to find the cosine
similarity between two vectors is –
Formula :
S C(x, y) = x . y / ||x|| ×× ||y||
where,
 x . y = product (dot) of the vectors ‘x’ and ‘y’.
 ||x|| and ||y|| = length (magnitude) of the two vectors ‘x’ and ‘y’.
 ||x|| ×× ||y|| = regular product of the two vectors ‘x’ and ‘y’.

Example
Consider an example to find the similarity between two vectors
– ‘x’ and ‘y’, using Cosine Similarity. The ‘x’ vector has values, x = { 3, 2,
0, 5 } The ‘y’ vector has values, y = { 1, 0, 0, 0 } The formula for
calculating the cosine similarity is : SCSC(x, y) = x . y / ||x|| ×× ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3

||x|| = √ (3)^2 + (2)^2 + (0)^2 + (5)^2 = 6.16

||y|| = √ (1)^2 + (0)^2 + (0)^2 + (0)^2 = 1

∴ SCSC(x, y) = 3 / (6.16 * 1) = 0.49

∴ DCDC(x, y) = 1 - SCSC(x, y) = 1 - 0.49 = 0.51


The dissimilarity between the two vectors ‘x’ and ‘y’ is given by –

 The cosine similarity between two vectors is measured in ‘θ’.


 If θ = 0°, the ‘x’ and ‘y’ vectors overlap, thus proving they are similar.
 If θ = 90°, the ‘x’ and ‘y’ vectors are dissimilar.

Cosine Similarity between two vectors

Advantages
 The cosine similarity is beneficial because even if the two similar data
objects are far apart by the Euclidean distance because of the size,
they could still have a smaller angle between them. Smaller the angle,
higher the similarity.
 When plotted on a multi-dimensional space, the cosine similarity
captures the orientation (the angle) of the data objects and not the
magnitude.

. Edit Distance (Levenshtein Distance)

 Definition: Measures the minimum number of edit operations (insertions, deletions,


substitutions) required to transform one string into another.
 Use Case: Text similarity, spell-checking, and DNA sequence analysis.

. Kernel-Based Similarities

Kernel functions project data into a higher-dimensional space and compute similarity there,
often using non-metric formulations.

 RBF Kernel (Radial Basis Function):

K(x,y)=exp⁡(−∥x−y∥22σ2)K(x, y) = \exp\left(-\frac{\|x - y\|^2}{2\sigma^2}\right)K(x,y)=exp(−2σ2∥x−y∥2


)

While it resembles a metric function, it measures similarity, not distance.

 String Kernels: Compare substrings within strings for text or sequence data.

. Information-Theoretic Measures

 KL Divergence (Kullback-Leibler Divergence): Measures the difference between


two probability distributions:

DKL(P∥Q)=∑iP(i)log⁡P(i)Q(i)D_{KL}(P \| Q) = \sum_{i} P(i) \log \frac{P(i)}{Q(i)}DKL(P∥Q)=i∑


P(i)logQ(i)P(i)

It is non-symmetric and does not satisfy the triangle inequality.

 Mutual Information: Quantifies the shared information between two variables:

Applications of Non-Metric Similarity Functions

1. Text Processing:
o Edit distance for spell-checking and plagiarism detection.
o Jaccard similarity for duplicate document detection.

2. Recommendation Systems:
o Jaccard similarity for collaborative filtering.
o Soft matching for user profiles.
3. Bioinformatics:
o Sequence alignment using Levenshtein distance.
o Tanimoto coefficient for molecular similarity.

4. Clustering:
o Non-metric measures like mutual information for grouping categorical data.

5. Medical Imaging:
o Dice similarity coefficient for segmentation evaluation.

Advantages of Non-Metric Similarity Functions

 Tailored to specific domains or data types.


 Handle complex relationships beyond simple geometric distances.
 Offer greater flexibility for structured and contextual data.

Proximity Between Binary Patterns

For binary patterns (data with values of 0 or 1), the distance or similarity can be computed
using measures that specifically handle binary data:

 Hamming Distance: The number of differing bits between two binary patterns.
 Jaccard Similarity: Works well for comparing binary sets, focusing on the
intersection over the union of two sets.

Different Classification Algorithms Based on the Distance Measures

Several classification algorithms rely on proximity measures to classify data points:

1. k-Nearest Neighbors Classifier (k-NN):


o This is the most common classification algorithm based on distance measures.
o Algorithm:
1. Choose the number of neighbors kkk.
2. For a new data point, calculate the distance to all training data points.
3. Identify the kkk-nearest neighbors.
4. Assign the class label based on the majority vote of the kkk neighbors.
2. Radius Nearest Neighbor (RNN):
o Similar to k-NN but instead of selecting a fixed number of nearest neighbors, a
radius rrr is specified.
o Algorithm:
1. Choose a radius rrr.
2. For a new data point, find all training points within radius rrr.
3. Assign the class label based on the majority vote of all points within
the radius.
o This method is useful when data is sparse and you want to avoid having a
fixed kkk.
3. Weighted k-NN:
o A variation where closer neighbors have more influence on the decision.
o For example, use inverse distance weighting to weight closer neighbors
higher.

K-Nearest Neighbor Classifier

The k-Nearest Neighbors (k-NN) algorithm is one of the most straightforward and widely
used classification algorithms. It classifies new data points based on the majority class among
its kkk-nearest neighbors.

As an example, consider the following table of data points containing two


features:

Algorithm Steps

Input:

 Training dataset: D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}
 Query point: xqx_qxq
 Number of neighbors: kkk
 Distance metric: d(xi,xj)d(x_i, x_j)d(xi,xj) (e.g., Euclidean, Manhattan).

Procedure:

1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Sort Neighbors:
o Sort the training points by their distance to xqx_qxq.
3. Select Nearest Neighbors:
o Select the kkk-nearest neighbors (smallest distances).
4. Majority Vote:
o Determine the majority class label among the kkk-nearest neighbors.
5. Output:
o Assign the query point xqx_qxq to the majority class.

3. Distance Metrics

1. Euclidean Distance (default for continuous features): d(xi,xj)=∑k=1n(xik−xjk)2d(x_i, x_j) = \


sqrt{\sum_{k=1}^n (x_{ik} - x_{jk})^2}d(xi,xj)=k=1∑n(xik−xjk)2
2. Manhattan Distance: d(xi,xj)=∑k=1n∣xik−xjk∣d(x_i, x_j) = \sum_{k=1}^n |x_{ik} - x_{jk}|d(xi,xj

3. Cosine Similarity (for text or high-dimensional data): Cosine similarity=x⃗i⋅x⃗j∥x⃗i∥∥x⃗j∥\


)=k=1∑n∣xik−xjk∣

text{Cosine similarity} = \frac{\vec{x}_i \cdot \vec{x}_j}{\|\vec{x}_i\| \|\


vec{x}_j\|}Cosine similarity=∥xi∥∥xj∥xi⋅xj Lower similarity corresponds to greater distance.
4. Hamming Distance (for categorical data):
d(xi,xj)=Number of differing attributes between xi and xjd(x_i, x_j) = \text{Number of
differing attributes between } x_i \text{ and } x_jd(xi,xj
)=Number of differing attributes between xi and xj

4. Choosing kkk

 Small kkk:
o Captures local patterns but may overfit (sensitive to noise).
 Large kkk:
o Smoothens predictions but may underfit (ignores local structures).
 Optimal kkk is usually determined via cross-validation.

Example

Dataset:

Training data:

D={(1,A),(2,A),(4,B),(5,B),(6,A)}D = \{(1, A), (2, A), (4, B), (5, B), (6, A)\}D={(1,A),(2,A),(4,B),(5,B),(6,A)}

Query point: xq=3x_q = 3xq=3, k=3k = 3k=3.

Steps:

1. Compute Distances:
o d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3d(1, 3) = 2, d(2, 3) = 1, d(4, 3) = 1, d(5, 3)
= 2, d(6, 3) = 3d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3.
2. Sort Neighbors:
o Sorted distances: (2,A),(4,B),(1,A),(5,B),(6,A)(2, A), (4, B), (1, A), (5, B), (6, A)(2,A),
(4,B),(1,A),(5,B),(6,A).
3. Select k=3k = 3k=3 Nearest Neighbors:
o Neighbors: (2,A),(4,B),(1,A)(2, A), (4, B), (1, A)(2,A),(4,B),(1,A).
4. Majority Vote:
o Class labels: A,B,AA, B, AA,B,A.
oMajority class: AAA.
5. Prediction:
o Assign xq=3x_q = 3xq=3 to class AAA.

Radius Distance Nearest Neighbor Algorithm

The Radius Nearest Neighbor (RNN) algorithm is an extension of the k-NN approach but
with a focus on the radius rather than the number of neighbors.

Algorithm Steps

Input:

 Training data: D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}
 Query point: xqx_qxq
 Radius: rrr
 Distance metric: d(xi,xq)d(x_i, x_q)d(xi,xq) (e.g., Euclidean, Manhattan).

Procedure:

1. Compute Distances:
o Calculate the distance d(xi,xq)d(x_i, x_q)d(xi,xq) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Identify Neighbors:
o Select all points xix_ixi such that d(xi,xq)≤rd(x_i, x_q) \leq rd(xi,xq)≤r.
3. Prediction:
o Classification:
 Use a majority vote among the labels of the selected neighbors.
 If no neighbors are found within rrr, a default class can be assigned, or a
larger radius can be used.
o Regression:
 Compute the mean (or weighted mean) of the target values yiy_iyi of the
selected neighbors.

Output:

 Predicted class or value for xqx_qxq.

3. Pseudocode
plaintext
Copy code
Algorithm RDNN:
Input: Training data D = {(x1, y1), (x2, y2), ..., (xn, yn)}, query point
x_q, radius r, distance metric d.

1. Initialize an empty list of neighbors: Neighbors = []


2. For each (x_i, y_i) in D:
Compute distance: dist = d(x_i, x_q)
If dist ≤ r:
Add (x_i, y_i) to Neighbors
3. If Neighbors is empty:
Return default prediction (or use larger radius)
4. If classification:
Return majority label among Neighbors
Else if regression:
Return mean(target values of Neighbors)

KNN Regression

In k-Nearest Neighbors Regression (k-NN regression), the output is continuous rather than
categorical. The prediction is made based on the average (or weighted average) of the target
values of the kkk-nearest neighbors.

Steps of the Algorithm

Input:

1. Training dataset D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}:
o xix_ixi represents the feature vector of the ithi^{th}ith data point.
o yiy_iyi represents the target value of the ithi^{th}ith data point.
2. Query point xqx_qxq.
3. Number of neighbors kkk (a hyperparameter).
4. A distance metric (e.g., Euclidean distance, Manhattan distance).

Procedure:

1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each training point xix_ixi.
2. Find kkk-Nearest Neighbors:
o Identify the kkk-nearest neighbors by selecting the kkk points with the smallest
distances to xqx_qxq.
3. Aggregate Neighbor Target Values:
o Simple Average:
 Predict the target value as the mean of the target values of the kkk-nearest
neighbors: y^q=1k∑i=1kyi\hat{y}_q = \frac{1}{k} \sum_{i=1}^k y_iy^q=k1
i=1∑kyi
o Weighted Average:
 Assign weights inversely proportional to the distances:
y^q=∑i=1kwiyi∑i=1kwi,wi=1d(xq,xi)+ϵ\hat{y}_q = \frac{\sum_{i=1}^k w_i y_i}
{\sum_{i=1}^k w_i}, \quad w_i = \frac{1}{d(x_q, x_i) + \epsilon}y^q=∑i=1kwi
∑i=1kwiyi,wi=d(xq,xi)+ϵ1 Here, ϵ\epsilonϵ is a small constant to prevent
division by zero.
Output:

 Predicted target value y^q\hat{y}_qy^q for the query point xqx_qxq.

Pseudocode
plaintext
Copy code
Algorithm KNN-Regression:
Input: Training data D = {(x1, y1), ..., (xn, yn)}, query point x_q, number
of neighbors k, distance metric d.

1. Compute distances:
For each (x_i, y_i) in D:
dist[i] = d(x_i, x_q)

2. Sort training data by distances:


Sort D based on ascending values of dist.

3. Select k-nearest neighbors:


Neighbors = first k points from sorted D.

4. Aggregate target values:


If using simple average:
Predicted value = Mean(target values of Neighbors)
If using weighted average:
Predicted value = Weighted Mean(target values of Neighbors)

5. Return Predicted value.

Performance of Classifiers

The performance of nearest neighbor classifiers depends on several factors:

1. Choice of kkk: Too small a kkk can lead to noisy predictions (overfitting), while too
large a kkk can make the model too smooth (underfitting).
2. Distance Measure: The choice of distance measure can heavily affect the accuracy of
the classifier, especially when features are on different scales.
3. Dimensionality: As the number of features (dimensions) increases, the model can
struggle due to the curse of dimensionality.

Performance of Regression Algorithms

Evaluating the performance of regression algorithms is essential to understand how well the model
predicts continuous target variables. The choice of metric depends on the problem's context and the
specific goals, such as minimizing large errors, understanding variability, or achieving interpretable
results.
1. Common Regression Metrics

1.1 Mean Absolute Error (MAE)

 Measures the average magnitude of absolute errors between predicted values (y^i\
hat{y}_iy^i) and actual values (yiy_iyi).
 Formula: MAE=1n∑i=1n∣yi−y^i∣\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
MAE=n1i=1∑n∣yi−y^i∣
 Advantages:
o Simple to interpret.
o Less sensitive to outliers than Mean Squared Error.
 Disadvantages:
o Doesn't penalize large errors as heavily as other metrics.

1.2 Mean Squared Error (MSE)

 Measures the average of the squared differences between predicted and actual values.
 Formula: MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2MSE=n1i=1∑n(yi−y^i)2
 Advantages:
o Penalizes large errors more than small ones.
 Disadvantages:
o Can be heavily influenced by outliers.
o Units are squared, making interpretation less intuitive.

1.3 Root Mean Squared Error (RMSE)

 The square root of MSE; expresses error in the same units as the target variable.
 Formula: RMSE=1n∑i=1n(yi−y^i)2\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2}RMSE=n1i=1∑n(yi−y^i)2
 Advantages:
o Penalizes large errors more than MAE.
o Easier to interpret than MSE.
 Disadvantages:
o Still sensitive to outliers.

1.4 R-Squared (R2R^2R2) or Coefficient of Determination

 Measures the proportion of variance in the target variable explained by the model.
 Formula: R2=1−∑i=1n(yi−y^i)2∑i=1n(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\
sum_{i=1}^n (y_i - \bar{y})^2}R2=1−∑i=1n(yi−yˉ)2∑i=1n(yi−y^i)2 Where yˉ\bar{y}yˉ is the
mean of the actual values.
 Range:
o R2=1R^2 = 1R2=1: Perfect fit.
o R2=0R^2 = 0R2=0: Model predicts as well as the mean.
o R2<0R^2 < 0R2<0: Model performs worse than the mean prediction.
 Advantages:
o Intuitive measure of goodness-of-fit.

You might also like