Class Notes Unit 2 ML Material
Class Notes Unit 2 ML Material
Nearest neighbor-based models, especially the k-nearest neighbors (k-NN) algorithm, are a
fundamental class of machine learning techniques that rely on the idea of proximity or
similarity between data points for making predictions. These models are used in both
classification and regression tasks and are based on the idea that similar points are likely to
have similar labels or values. Below is a detailed breakdown of the key concepts related to
nearest neighbor-based models.
Nearest Neighbor methods rely on the idea that similar instances have similar outputs.
Predictions are based on the proximity of a query point to data points in the training set.
Common distance metrics:
o Euclidean Distance: d(x,x′)=∑i=1n(xi−xi′)2d(x, x') = \sqrt{\sum_{i=1}^n (x_i -
′)=∑i=1n∣xi−xi′∣
o Cosine Similarity: Measures the cosine of the angle between two vectors.
K-NN is a simple, instance-based algorithm that classifies a data point based on the majority
label of its kkk-nearest neighbors.
Algorithm:
1. Choose a value for kkk (number of neighbors).
2. Compute the distances between the query point and all points in the training set.
3. Identify the kkk closest points.
4. Assign the class label most common among these kkk points.
Key Considerations:
1. Larger kkk smoothens the decision boundary (reduces variance but may increase
bias).
2. Small kkk may lead to overfitting.
Distance Metrics
Strengths:
Weaknesses:
High Computation Cost: Requires computing distances to all training points at inference.
Memory Intensive: Stores the entire training set.
Curse of Dimensionality: Performance deteriorates in high-dimensional spaces.
Sensitive to Noise: Outliers can affect predictions.
Proximity measures refer to techniques that quantify the similarity or closeness between two
data points. The idea is that if two data points are close to each other in the feature space,
they are more likely to belong to the same class (in classification) or have similar values (in
regression). These proximity measures are essential for nearest neighbor algorithms.
Proximity measures in machine learning quantify the similarity or dissimilarity between data
points, which are often represented as vectors. These measures are fundamental in tasks like
clustering, classification, and anomaly detection, where the relationship between data points
is crucial.
Proximity measures can be categorized based on the data types they are applied to:
1. Distance Metrics
These measure the "distance" between two points, treating closeness as the inverse of
distance.
Euclidean Distance
Euclidean Distance gives the distance between any two points in an n-dimensional plane.
Euclidean distance between two points in the Euclidean space is defined as the length of the
line segment joining the two points.
Euclidean distance is like measuring the straightest and shortest path between two
points. Imagine you have a string and you stretch it tight between two points on a map; the
length of that string is the Euclidean distance
Euclidean Distance in 3D
If the two points (x1, y1, z1) and (x2, y2, z2) are in a 3-dimensional space, the Euclidean
Distance between them is given by using the formula:
d = √[(x2 - x1)2 + (y2 - y1)2+ (z2 - z1)2]
where,
d is Euclidean Distance
(x1, y1, z1) is Coordinate of the first point
(x2, y2, z2) is Coordinate of the second point
Euclidean Distance in nD
In general, the Euclidean Distance formula between two points (x11, x12, x13, ...., x1n) and
(x21, x22, x23, ...., x2n) in an n-dimensional space is given by the formula:
d = √[∑(x2i – x1i)2]
Where,
i Ranges from 1 to n
d is Euclidean distance
(x11, x12, x13, ...., x1n) is Coordinate of First Point
(x21, x22, x23, ...., x2n) is Coordinate of Second Point
Now, take the square root on both sides of the equation, we get
d = √(x2 – x1)2 + (y2 – y1)2
examples: A triangle has vertices at points A(2, 3), B(5, 7), and C(8, 1). Find
the length of the longest side of the triangle.
Given, the points A(2, 3), B(5, 7), and C(8, 1) are the vertices of a triangle.
Distance between points A and B:
⇒ AB = √9+16= √25
AB = √[(5-2)2 + (7-3)2]
AB = 5 unit
Distance between points B and C:
⇒ BC = √[9+36] = √45
BC = √[(8-5)2 + (1-7)2]
BC = 6.708 unit
Distance between points C and A:
⇒ CA = √[36+4] = √40
CA = √[(8-2)2 + (1-3)2]
CA = 6.325 unit
Therefore, the length of the longest side of triangle is 6.708 unit.
2.Mathematically prove Euclidean distance is a non negative value.
Consider two points (x 1, y1) and (x2, y2) in a 2-dimensional space; the Euclidean
Distance between them is given by using the formula:
d = √(x2 – x1)2 + (y2 – y1)2
3. Show that the points A (0, 0), B (4, 0), and C (2, 2√3) are the vertices of an
Equilateral Triangle.
To prove that these three points form an equilateral triangle, we need to show
that the distances between all pairs of points, i.e., AB, BC, and CA, are equal.
Distance between points A and B:
⇒ AB = √16
AB = √[(4– 0)2 + (0-0)2]
AB = 4 unit
Distance between points B and C:
⇒ BC = √[4+12] = √16
BC = √[(2-4)^2 + (2√3-0)^2]
BC = 4 unit
Distance between points C and A:
CA = 4 unit
Here, we can observe that all three distances, AB, BC, and CA, are equal.
Therefore, the given triangle is an Equilateral Triangle
Code:
import numpy as np
# Define points
point_a = np.array([3, 4])
point_b = np.array([0, 0])
# Compute Euclidean distance
distance = np.linalg.norm(point_a - point_b)
print("Euclidean Distance:", distance)
Manhattan distance works very well for high-dimensional datasets. As it does not
take any squares, it does not amplify the differences between any of the features. It
also does not ignore any features.
1 13
This Manhattan distance metric is also known as Manhattan length,
rectilinear distance, L1 distance or L1 norm, city block distance,
Minkowski’s L1 distance, taxi-cab metric, or city block distance.
Minkowski distance:
Minkowski distance is a distance measured between two points in N-dimensional
space. It is basically a generalization of the Euclidean distance and the Manhattan
distance. It is widely used in the field of Machine learning, especially in the
concept to find the optimal correlation or classification of data. Minkowski
distance is used in certain algorithms also like K-Nearest Neighbors, Learning
Vector Quantization (LVQ), Self-Organizing Map (SOM), and K-Means
Clustering.
Let us consider a 2-dimensional space having three points P 1 (X1, Y1), P2 (X2, Y2),
and P3 (X3, Y3), the Minkowski distance is given by ( |X 1 – Y1|p + |X2 – Y2|p + |X2 –
Y2|p )1/p. In R, Minkowski distance is calculated with respect to vectors.
For example,
we are given two vectors, vect1 as (4, 2, 6, 8) and vect2 as (5, 1, 7, 9). Their
Minkowski distance for p = 2 is given by, ( |4 – 5| 2 + |2 – 1|2 + |6 – 7|2 + |8 – 9|
) which is equal to 2. This article focuses upon how we can calculate
2 1/2
Minkowski distance in R.
Method 1:Using a custom function
We can calculate Minkowski distance between a pair of vectors by apply the
formula,
( Σ|vector1i – vector2i|p )1/p
Here,
vector1 is the first vector
vector2 is the second vector
p is an integer
code:
calculateMinkowskiDistance <- function(vect1, vect2, p) {
# Initializing a vector
vect1 <- c(1, 3, 5, 7)
# Set p equal to 4
p <- as.integer(1)
# Set p equal to 5
p <- as.integer(2)
# Set p equal to 5
p <- as.integer(3)
# Set p equal to 5
p <- as.integer(4)
output:
Hamming Distance calculates the distance between two binary vectors, also referred to
as binary strings or bitstrings for short.
You are most likely going to encounter bitstrings when you one-hot encode categorical columns
of data.
For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot encode
each example as a bitstring with one bit for each column.
red = [1, 0, 0]
green = [0, 1, 0]
blue = [0, 0, 1]
The distance between red and green could be calculated as the sum or the average number of bit
differences between the two bitstrings. This is the Hamming distance.
For a one-hot encoded string, it might make more sense to summarize to the sum of the bit
differences between the strings, which will always be a 0 or 1.
# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)
Distance Measures
Distance measures are specific types of proximity measures that calculate the physical
distance between data points in a multi-dimensional feature space. The choice of distance
measure can have a significant impact on the performance of the nearest neighbor algorithm.
Euclidean Distance
Euclidean Distance gives the distance between any two points in an n-dimensional
plane. Euclidean distance between two points in the Euclidean space is defined as
the length of the line segment joining the two points.
Euclidean distance is like measuring the straightest and shortest path between
two points. Imagine you have a string and you stretch it tight between two points
on a map; the length of that string is the Euclidean distance
Euclidean Distance in 3D
If the two points (x1, y1, z1) and (x2, y2, z2) are in a 3-dimensional space, the
Euclidean Distance between them is given by using the formula:
d = √[(x2 - x1)2 + (y2 - y1)2+ (z2 - z1)2]
where,
d is Euclidean Distance
(x1, y1, z1) is Coordinate of the first point
(x2, y2, z2) is Coordinate of the second point
Euclidean Distance in nD
In general, the Euclidean Distance formula between two points (x11, x12, x13, ....,
x1n) and (x21, x22, x23, ...., x2n) in an n-dimensional space is given by the
formula:
d = √[∑(x2i – x1i)2]
Where,
i Ranges from 1 to n
d is Euclidean distance
(x11, x12, x13, ...., x1n) is Coordinate of First Point
(x21, x22, x23, ...., x2n) is Coordinate of Second Point
Now, take the square root on both sides of the equation, we get
d = √(x2 – x1)2 + (y2 – y1)2
examples: A triangle has vertices at points A(2, 3), B(5, 7), and C(8,
1). Find the length of the longest side of the triangle.
Given, the points A(2, 3), B(5, 7), and C(8, 1) are the vertices of a
triangle.
Distance between points A and B:
⇒ AB = √9+16= √25
AB = √[(5-2)2 + (7-3)2]
AB = 5 unit
Distance between points B and C:
⇒ BC = √[9+36] = √45
BC = √[(8-5)2 + (1-7)2]
BC = 6.708 unit
Distance between points C and A:
⇒ CA = √[36+4] = √40
CA = √[(8-2)2 + (1-3)2]
CA = 6.325 unit
Therefore, the length of the longest side of triangle is 6.708 unit.
2.Mathematically prove Euclidean distance is a non negative value.
Consider two points (x1, y1) and (x2, y2) in a 2-dimensional space; the
Euclidean Distance between them is given by using the formula:
d = √(x2 – x1)2 + (y2 – y1)2
3. Show that the points A (0, 0), B (4, 0), and C (2, 2√3) are the vertices of an
Equilateral Triangle.
To prove that these three points form an equilateral triangle, we need to
show that the distances between all pairs of points, i.e., AB, BC, and CA,
are equal.
Distance between points A and B:
⇒ AB = √16
AB = √[(4– 0)2 + (0-0)2]
AB = 4 unit
Distance between points B and C:
⇒ BC = √[4+12] = √16
BC = √[(2-4)^2 + (2√3-0)^2]
BC = 4 unit
Distance between points C and A:
CA = 4 unit
Here, we can observe that all three distances, AB, BC, and CA, are equal.
Therefore, the given triangle is an Equilateral Triangle
Code:
import numpy as np
# Define points
point_a = np.array([3, 4])
point_b = np.array([0, 0])
# Compute Euclidean distance
distance = np.linalg.norm(point_a - point_b)
print("Euclidean Distance:", distance)
Minkowski distance:
Minkowski distance is a distance measured between two points in N-
dimensional space. It is basically a generalization of the Euclidean
distance and the Manhattan distance. It is widely used in the field of
Machine learning, especially in the concept to find the optimal correlation
or classification of data. Minkowski distance is used in certain algorithms
also like K-Nearest Neighbors, Learning Vector Quantization (LVQ), Self-
Organizing Map (SOM), and K-Means Clustering.
Let us consider a 2-dimensional space having three points P 1 (X1, Y1),
P2 (X2, Y2), and P3 (X3, Y3), the Minkowski distance is given by ( |X 1 – Y1|p + |
X2 – Y2|p + |X2 – Y2|p )1/p. In R, Minkowski distance is calculated with respect
to vectors.
For example,
we are given two vectors, vect1 as (4, 2, 6, 8) and vect2 as (5, 1, 7, 9).
Their Minkowski distance for p = 2 is given by, ( |4 – 5| 2 + |2 – 1|2 + |6 – 7|
2
+ |8 – 9|2 )1/2 which is equal to 2. This article focuses upon how we can
calculate Minkowski distance in R.
Method 1:Using a custom function
We can calculate Minkowski distance between a pair of vectors by apply
the formula,
( Σ|vector1i – vector2i|p )1/p
Here,
vector1 is the first vector
vector2 is the second vector
p is an integer
code:
calculateMinkowskiDistance <- function(vect1, vect2, p) {
# Initializing a vector
vect1 <- c(1, 3, 5, 7)
# Set p equal to 4
p <- as.integer(1)
# Set p equal to 5
p <- as.integer(2)
# Set p equal to 5
p <- as.integer(3)
# Set p equal to 5
p <- as.integer(4)
output:
Hamming Distance calculates the distance between two binary vectors, also
referred to as binary strings or bitstrings for short.
You are most likely going to encounter bitstrings when you one-hot encode categorical
columns of data.
For example, if a column had the categories ‘red,’ ‘green,’ and ‘blue,’ you might one hot
encode each example as a bitstring with one bit for each column.
red = [1, 0, 0]
green = [0, 1, 0]
blue = [0, 0, 1]
The distance between red and green could be calculated as the sum or the average
number of bit differences between the two bitstrings. This is the Hamming distance.
For a one-hot encoded string, it might make more sense to summarize to the sum of the
bit differences between the strings, which will always be a 0 or 1.
# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calculate distance
dist = hamming_distance(row1, row2)
print(dist)
Non-Metric Similarity Functions
Not all similarity measures correspond to a "distance" in the strict mathematical sense. These
are non-metric similarity functions, often used in cases where the data is not naturally metric
(e.g., categorical or nominal data).
Example
Consider an example to find the similarity between two vectors
– ‘x’ and ‘y’, using Cosine Similarity. The ‘x’ vector has values, x = { 3, 2,
0, 5 } The ‘y’ vector has values, y = { 1, 0, 0, 0 } The formula for
calculating the cosine similarity is : SCSC(x, y) = x . y / ||x|| ×× ||y||
x . y = 3*1 + 2*0 + 0*0 + 5*0 = 3
Advantages
The cosine similarity is beneficial because even if the two similar data
objects are far apart by the Euclidean distance because of the size,
they could still have a smaller angle between them. Smaller the angle,
higher the similarity.
When plotted on a multi-dimensional space, the cosine similarity
captures the orientation (the angle) of the data objects and not the
magnitude.
. Kernel-Based Similarities
Kernel functions project data into a higher-dimensional space and compute similarity there,
often using non-metric formulations.
String Kernels: Compare substrings within strings for text or sequence data.
. Information-Theoretic Measures
1. Text Processing:
o Edit distance for spell-checking and plagiarism detection.
o Jaccard similarity for duplicate document detection.
2. Recommendation Systems:
o Jaccard similarity for collaborative filtering.
o Soft matching for user profiles.
3. Bioinformatics:
o Sequence alignment using Levenshtein distance.
o Tanimoto coefficient for molecular similarity.
4. Clustering:
o Non-metric measures like mutual information for grouping categorical data.
5. Medical Imaging:
o Dice similarity coefficient for segmentation evaluation.
For binary patterns (data with values of 0 or 1), the distance or similarity can be computed
using measures that specifically handle binary data:
Hamming Distance: The number of differing bits between two binary patterns.
Jaccard Similarity: Works well for comparing binary sets, focusing on the
intersection over the union of two sets.
The k-Nearest Neighbors (k-NN) algorithm is one of the most straightforward and widely
used classification algorithms. It classifies new data points based on the majority class among
its kkk-nearest neighbors.
Algorithm Steps
Input:
Procedure:
1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Sort Neighbors:
o Sort the training points by their distance to xqx_qxq.
3. Select Nearest Neighbors:
o Select the kkk-nearest neighbors (smallest distances).
4. Majority Vote:
o Determine the majority class label among the kkk-nearest neighbors.
5. Output:
o Assign the query point xqx_qxq to the majority class.
3. Distance Metrics
4. Choosing kkk
Small kkk:
o Captures local patterns but may overfit (sensitive to noise).
Large kkk:
o Smoothens predictions but may underfit (ignores local structures).
Optimal kkk is usually determined via cross-validation.
Example
Dataset:
Training data:
D={(1,A),(2,A),(4,B),(5,B),(6,A)}D = \{(1, A), (2, A), (4, B), (5, B), (6, A)\}D={(1,A),(2,A),(4,B),(5,B),(6,A)}
Steps:
1. Compute Distances:
o d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3d(1, 3) = 2, d(2, 3) = 1, d(4, 3) = 1, d(5, 3)
= 2, d(6, 3) = 3d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3.
2. Sort Neighbors:
o Sorted distances: (2,A),(4,B),(1,A),(5,B),(6,A)(2, A), (4, B), (1, A), (5, B), (6, A)(2,A),
(4,B),(1,A),(5,B),(6,A).
3. Select k=3k = 3k=3 Nearest Neighbors:
o Neighbors: (2,A),(4,B),(1,A)(2, A), (4, B), (1, A)(2,A),(4,B),(1,A).
4. Majority Vote:
o Class labels: A,B,AA, B, AA,B,A.
oMajority class: AAA.
5. Prediction:
o Assign xq=3x_q = 3xq=3 to class AAA.
The Radius Nearest Neighbor (RNN) algorithm is an extension of the k-NN approach but
with a focus on the radius rather than the number of neighbors.
Algorithm Steps
Input:
Procedure:
1. Compute Distances:
o Calculate the distance d(xi,xq)d(x_i, x_q)d(xi,xq) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Identify Neighbors:
o Select all points xix_ixi such that d(xi,xq)≤rd(x_i, x_q) \leq rd(xi,xq)≤r.
3. Prediction:
o Classification:
Use a majority vote among the labels of the selected neighbors.
If no neighbors are found within rrr, a default class can be assigned, or a
larger radius can be used.
o Regression:
Compute the mean (or weighted mean) of the target values yiy_iyi of the
selected neighbors.
Output:
3. Pseudocode
plaintext
Copy code
Algorithm RDNN:
Input: Training data D = {(x1, y1), (x2, y2), ..., (xn, yn)}, query point
x_q, radius r, distance metric d.
KNN Regression
In k-Nearest Neighbors Regression (k-NN regression), the output is continuous rather than
categorical. The prediction is made based on the average (or weighted average) of the target
values of the kkk-nearest neighbors.
Input:
Procedure:
1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each training point xix_ixi.
2. Find kkk-Nearest Neighbors:
o Identify the kkk-nearest neighbors by selecting the kkk points with the smallest
distances to xqx_qxq.
3. Aggregate Neighbor Target Values:
o Simple Average:
Predict the target value as the mean of the target values of the kkk-nearest
neighbors: y^q=1k∑i=1kyi\hat{y}_q = \frac{1}{k} \sum_{i=1}^k y_iy^q=k1
i=1∑kyi
o Weighted Average:
Assign weights inversely proportional to the distances:
y^q=∑i=1kwiyi∑i=1kwi,wi=1d(xq,xi)+ϵ\hat{y}_q = \frac{\sum_{i=1}^k w_i y_i}
{\sum_{i=1}^k w_i}, \quad w_i = \frac{1}{d(x_q, x_i) + \epsilon}y^q=∑i=1kwi
∑i=1kwiyi,wi=d(xq,xi)+ϵ1 Here, ϵ\epsilonϵ is a small constant to prevent
division by zero.
Output:
Pseudocode
plaintext
Copy code
Algorithm KNN-Regression:
Input: Training data D = {(x1, y1), ..., (xn, yn)}, query point x_q, number
of neighbors k, distance metric d.
1. Compute distances:
For each (x_i, y_i) in D:
dist[i] = d(x_i, x_q)
Performance of Classifiers
1. Choice of kkk: Too small a kkk can lead to noisy predictions (overfitting), while too
large a kkk can make the model too smooth (underfitting).
2. Distance Measure: The choice of distance measure can heavily affect the accuracy of
the classifier, especially when features are on different scales.
3. Dimensionality: As the number of features (dimensions) increases, the model can
struggle due to the curse of dimensionality.
Evaluating the performance of regression algorithms is essential to understand how well the model
predicts continuous target variables. The choice of metric depends on the problem's context and the
specific goals, such as minimizing large errors, understanding variability, or achieving interpretable
results.
1. Common Regression Metrics
Measures the average magnitude of absolute errors between predicted values (y^i\
hat{y}_iy^i) and actual values (yiy_iyi).
Formula: MAE=1n∑i=1n∣yi−y^i∣\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
MAE=n1i=1∑n∣yi−y^i∣
Advantages:
o Simple to interpret.
o Less sensitive to outliers than Mean Squared Error.
Disadvantages:
o Doesn't penalize large errors as heavily as other metrics.
Measures the average of the squared differences between predicted and actual values.
Formula: MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2MSE=n1i=1∑n(yi−y^i)2
Advantages:
o Penalizes large errors more than small ones.
Disadvantages:
o Can be heavily influenced by outliers.
o Units are squared, making interpretation less intuitive.
The square root of MSE; expresses error in the same units as the target variable.
Formula: RMSE=1n∑i=1n(yi−y^i)2\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2}RMSE=n1i=1∑n(yi−y^i)2
Advantages:
o Penalizes large errors more than MAE.
o Easier to interpret than MSE.
Disadvantages:
o Still sensitive to outliers.
Measures the proportion of variance in the target variable explained by the model.
Formula: R2=1−∑i=1n(yi−y^i)2∑i=1n(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\
sum_{i=1}^n (y_i - \bar{y})^2}R2=1−∑i=1n(yi−yˉ)2∑i=1n(yi−y^i)2 Where yˉ\bar{y}yˉ is the
mean of the actual values.
Range:
o R2=1R^2 = 1R2=1: Perfect fit.
o R2=0R^2 = 0R2=0: Model predicts as well as the mean.
o R2<0R^2 < 0R2<0: Model performs worse than the mean prediction.
Advantages:
o Intuitive measure of goodness-of-fit.