Support Vector Machines (SVM)
Support Vector Machine (SVM) is a machine learning algorithm used for classification
problems. It is a supervised learning algorithm that uses a hyperplane to separate the
samples into different classes.
The hyperplane is a line (in 2-dimensional space) or a plane (in higher-dimensional
space) that separates the samples into different classes. SVM algorithms aim to find the
hyperplane with the largest margin, which is the distance between the hyperplane and
the closest samples from each class. These closest samples are known as support vectors.
SVM can handle both linear and non-linear data distributions by transforming the data
into a higher-dimensional space using a technique called kernel trick. In this transformed
space, the hyperplane can be a straight line or a curved surface, and it can separate the
samples into different classes even if they are not linearly separable in the original space.
Once the hyperplane is determined, it can be used to make predictions for new samples.
Samples that fall on one side of the hyperplane are classified as one class, and samples
that fall on the other side are classified as the other class.
SVM is widely used for binary classification problems, and it can also be used for multi-
class classification problems by training multiple binary classifiers and combining the
results. However, SVM can be computationally expensive for large datasets, and it can
also be sensitive to the choice of kernel function and the hyperparameters used to train
the model.
Sigmoid function
Multiclass classification
Multi-Class Classification – classification jobs with more than two class labels are
referred to as multi-class classification. Multiclass classification in machine learning,
unlike binary classification, does not distinguish between normal and pathological
results. Instead, examples are assigned to one of several pre-defined classes.
True class Predicted class
Class A Class B Class C Class D
Class A TP class A FP (A samples as B) FP (A samples as C) FP (A samples as D)
Class B FP (B samples as A) TP class B FP (B samples as C) FP (B samples as D)
Class C FP (C samples as A) FP (C samples as B) TP class C FP (C samples as D)
Class D FP (D samples as A) FP (D samples as B) FP (D samples as C) TP class D
Transformation to binary
One-vs-rest
One-vs-one
OvR or one-vs-all, OvA or one-against-all, OAA) strategy involves training a single
classifier per class, with the samples of that class as positive samples and all other
samples as negatives. This strategy requires the base classifiers to produce a real-valued
confidence score for its decision, rather than just a class label; discrete class labels alone
can lead to ambiguities, where multiple classes are predicted for a single sample.
Number of classsifiers is equal to number of classes.
In the one-vs.-one (OvO) reduction, one trains K (K − 1) / 2 binary classifiers for a K-
way multiclass problem; each receives the samples of a pair of classes from the original
training set, and must learn to distinguish these two classes. At prediction time, a voting
scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the
class that got the highest number of "+1" predictions gets predicted by the combined
classifiers.
Multiclass Classification using Support Vector Machine
k-Nearest Neighbors (k-NN) classification algorithm
The k-Nearest Neighbors (k-NN) classification algorithm is a simple and effective
machine learning technique used for classification tasks. The basic idea behind the
algorithm is to find the k data points in the training set that are closest to the new data
point, and then use those k data points to determine the class of the new data point.
Here's how the algorithm works:
1. The algorithm starts by storing the entire training dataset in memory.
2. When a new data point is encountered, the algorithm calculates the distance
between the new data point and each of the data points in the training set.
3. The algorithm then selects the k data points from the training set that are
closest to the new data point, based on the calculated distances.
4. The class of the new data point is determined by a majority vote of the classes
of the k nearest neighbours. If k = 1, the class of the new data point is the
same as the class of the closest data point in the training set.
5. This process is repeated for each new data point in the test set.
The k-NN algorithm is considered a "lazy" learning algorithm, because it doesn't
explicitly learn a model from the training data. Instead, it simply memorizes the training
data and uses it to make predictions on new data points. This makes the algorithm very
fast and simple to implement, but it also means that the algorithm is sensitive to the
choice of k and the quality of the training data.
In summary, the k-NN classification algorithm is a simple and effective algorithm that
can be used for classification tasks. It is particularly useful when the relationship
between the features and the classes is not well understood, and it can be easily
implemented in a variety of contexts.
There are several ways to calculate the distance between two data points in a machine
learning problem, including:
Euclidean distance
Manhattan Distance
Minkowski Distance
Cosine Similarity
Jaccard Similarity
Mahalanobis Distance
Euclidean Distance: This is the most used distance metric, and it is based on the
Pythagorean theorem. The Euclidean distance between two points is defined as the
square root of the sum of the squares of the differences between the corresponding
coordinates.
The equation for Euclidean distance between two points, x and y, with d features is given
by:
Manhattan Distance: Also known as the taxicab or city block distance, this distance
metric is calculated as the sum of the absolute differences between the coordinates of
the two points. It is used when the distance between two points is defined as the
minimum distance a person would have to travel if moving only vertically or
horizontally.
The equation for Manhattan distance (also known as L1 distance) between two points,
x and y, with d features is given by:
d(x, y) = |x1 - y1| + |x2 - y2| + ... + |xd - yd|
where x1, x2, ..., xd are the features of the first point, and y1, y2, ..., yd are the features of
the second point. The Manhattan distance is the sum of the absolute differences between
the corresponding features of the two points. This distance metric is often used in
problems where the distances between two points are defined as the minimum distance
a person would have to travel if moving only vertically or horizontally. For example, it
is commonly used in image processing and robotics problems.
Minkowski Distance: This is a generalization of the Euclidean and Manhattan
distances, and it is defined as the Lp-norm of the differences between the two points.
The parameter p determines whether the distance is equivalent to the Euclidean or
Manhattan distance.
The equation for Minkowski distance between two points, x and y, with d features is
given by:
where x1, x2, ..., xd are the features of the first point, and y1, y2, ..., yd are the features of
the second point. The parameter p determines the degree of the Minkowski distance and
can take on any positive real value. When p = 1, the Minkowski distance becomes the
Manhattan distance. When p = 2, the Minkowski distance becomes the Euclidean
distance.
Cosine Similarity: This distance metric is used when working with high-dimensional
data, such as text or image data. It is calculated as the dot product of two vectors
divided by the product of their magnitudes.
The equation for cosine similarity between two vectors, x and y, with d features is
given by:
Jaccard Similarity: This distance metric is used to compare the similarity between two
sets of data, such as in text classification or image segmentation problems. It is
calculated as the size of the intersection of the two sets divided by the size of the union
of the two sets.
The equation for Jaccard similarity between two sets, X and Y, is given by:
Jaccard_similarity(X, Y) = |X intersection Y| / |X union Y|
where |X| is the number of elements in set X, |Y| is the number of elements in set Y,
and |X intersection Y| is the number of elements that are common to both sets X and
Y. The Jaccard similarity is a measure of the similarity between two sets, and it ranges
from 0 to 1, with values close to 1 indicating high similarity and values close to 0
indicating high dissimilarity.
Jaccard similarity is widely used in information retrieval and text classification
problems, where the data is represented as sets of words or terms. It is also used in image
processing and computer vision problems to measure the similarity between two images
based on the objects or features present in the images. In addition, it is used in
bioinformatics to measure the similarity between two biological sequences, such as
DNA or protein sequences.
Mahalanobis Distance: This is a more advanced distance metric that takes into
account the covariance of the data. It is particularly useful when working with data
that has correlations between features.
The equation for Mahalanobis distance between a vector x and the mean vector μ of a
set of n vectors, in a d-dimensional feature space, is given by:
d(x, μ) = sqrt((x - μ)^T * Σ^-1 * (x - μ))
where x is the vector representing the point, μ is the mean vector of the set of n
vectors, Σ is the covariance matrix of the set of n vectors, and (x - μ)^T * Σ^-1 * (x -
μ) is the quadratic form. The Mahalanobis distance is a measure of the distance
between a point and the mean of a set of points, taking into account the covariance
structure of the data.
Mahalanobis distance is commonly used in multivariate statistical analysis to measure
the distance between a point and a distribution. It is especially useful when the data
has correlations between the features, as it accounts for these correlations in the
calculation of the distance. It is also used in outlier detection, pattern recognition, and
classification problems, where the goal is to distinguish between normal and abnormal
data points based on their distance from the mean of the data.
The choice of distance metric depends on the specific requirements of the problem and
the nature of the data. In some cases, the choice of distance metric can significantly
impact the performance of a machine learning algorithm, so it is important to carefully
consider the appropriate distance metric for each problem.
Identification and Verification
Identification - classification to classes.
Verification – confirmation belonging to class.
Identification refers to the process of determining who a person is. This is typically done
by requiring the person to provide some form of identification, such as a username and
password, a government-issued ID, or biometric data such as a fingerprint or facial
recognition. The goal of identification is to establish the identity of an individual so that
they can be granted access to certain resources or privileges.
Verification, on the other hand, is the process of confirming that a claimed identity is
accurate. This is typically done by comparing the information provided by the person
to some other source of information, such as a database or a trusted third-party service.
The goal of verification is to ensure that the person claiming to be someone is actually
who they say they are, and not an impostor.
In summary, identification is about establishing who a person is, while verification
is about confirming that a claimed identity is accurate.
ROC (Receiver Operating Characteristic)
• Performance of a classifier represented as a point on the ROC curve
• Changing some parameter of the algorithm, sample distribution, or cost
matrix changes the location of the point
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
Using ROC for Model Comparison
l No model consistently outperform the other
l M1 is better for small FPR
l M2 is better for large FPR
l Area Under the ROC curve (AUC)
l Ideal: Area = 1
l Random guess:
Area = 0.5