Support Vector Machines
Machine Learning
Manoj Kumar
Youtube
October 23, 2024
Manoj Sir (Youtube) Lecture 1 1 / 55
Outline
1 Introduction
2 Kernels
3 Convex hull
4 Lagrange formulation
Manoj Sir (Youtube) Lecture 1 2 / 55
Vector Dot Product
Manoj Sir (Youtube) Lecture 1 3 / 55
Calculations Involved in Dot Product
Manoj Sir (Youtube) Lecture 1 4 / 55
Magic!!
Manoj Sir (Youtube) Lecture 1 5 / 55
Kernel Functions
Manoj Sir (Youtube) Lecture 1 6 / 55
Example of Kernel Functions
Manoj Sir (Youtube) Lecture 1 7 / 55
Validity of Kernel Functions
Manoj Sir (Youtube) Lecture 1 8 / 55
Validity of Kernel Functions
Method 1: We find a function K (x, x ′ ) such that
K (x, x ′ ) = ϕ(x).ϕ(x ′ )
Method 2: Mercer’s theorem
1 K (x, x ′ ) = K (x ′ , x) {Symmetric}
2 Datapoints x1 , x2 , x3 . The Kernel matrix K is formed by applying the kernel function to
each pair of data points:
K (x1 , x1 ) K (x1 , x2 ) K (x1 , x3 )
K = K (x2 , x1 ) K (x2 , x2 ) K (x2 , x3 )
K (x3 , x1 ) K (x3 , x2 ) K (x3 , x3 )
Kernel Matrix K should be a symmetric matrix and positive semi-definite (all eigenvalues
of K must be non-negative).
Manoj Sir (Youtube) Lecture 1 8 / 55
Types of Kernel Function
Manoj Sir (Youtube) Lecture 1 9 / 55
Types of Kernel Function
1 Polynomial Kernel Function
d
K (x, x ′ ) = x ⊤ x ′ + c
x and x ′ are input vectors.
c is a constant that can be adjusted.
d is the degree of the polynomial.
Manoj Sir (Youtube) Lecture 1 9 / 55
Types of Kernel Function
1 Polynomial Kernel Function
d
K (x, x ′ ) = x ⊤ x ′ + c
x and x ′ are input vectors.
c is a constant that can be adjusted.
d is the degree of the polynomial.
2 Gaussian Kernel / Radial Basis Function (RBF) Kernel
It maps the input space into an infinite dimensional feature space.
∥x−x ′ ∥2 d2
K (x, x ′ ) = e − 2σ 2 = e − 2σ2
d → distance between x and x ′
x and x ′ are input vectors.
Manoj Sir (Youtube) Lecture 1 9 / 55
Validity of Kernel Functions
Given that K1 : X × X → R and K2 : X × X → R are two symmetric, positive definite kernel
functions, the validity of the following kernel functions is as follows:
Valid Kernel Functions:
K (x, x ′ ) = c · K1 (x, x ′ ), where c is a positive constant.
K (x, x ′ ) = K1 (x, x ′ ) + K2 (x, x ′ )
Not a Valid Kernel Function:
K (x, x ′ ) = K1 (x, x ′ ) + 1
K2 (x,x ′ )
Manoj Sir (Youtube) Lecture 1 10 / 55
Question
Consider two data points, x1 = (1, −1) and x2 = (2, 2), in a binary classification task
using an SVM with a custom kernel function K (x, y ). The kernel function is applied to
these points, resulting in the following matrix, referred to as matrix A:
K (x1 , x1 ) K (x1 , x2 ) 1 3
=
K (x2 , x1 ) K (x2 , x2 ) 3 6
Which of the following statements is correct regarding matrix A and the kernel function
K (x, y )?
A) K (x, y ) is a valid kernel.
B) K (x, y ) is not a valid kernel.
C) Matrix A is positive semi-definite.
D) Matrix A is not positive semi-definite.
Question
Given a kernel function K1 : Rn × Rn → R and its corresponding feature map ϕ1 : Rn →
Rd , which feature map ϕ : Rn → Rd would correctly produce the kernel cK1 (x, z),
where c is a positive constant?
a) ϕ(x) = cϕ1 (x)
√
b) ϕ(x) = cϕ1 (x)
c) ϕ(x) = c 2 ϕ1 (x)
d) No such feature map exists.
Question
Let K1 : X × X → R and K2 : X × X → R be two symmetric, positive definite kernel
functions. Which of the following cannot be a valid kernel function?
(a) K (x, x ′ ) = 5 · K1 (x, x ′ )
(b) K (x, x ′ ) = K1 (x, x ′ ) + K2 (x, x ′ )
(c) K (x, x ′ ) = K1 (x, x ′ ) + 1
K2 (x,x ′ )
(d) All three are valid kernels.
Question
Given a kernel function K (x, x ′ ) = f (x)g (x ′ ) + f (x ′ )g (x), where f and g are real-valued
functions (Rd → R), the kernel is not valid. What additional terms would you include
in K to make it a valid kernel?
Options:
(A) f (x) + g (x)
(B) f (x)g (x) + f (x ′ )g (x ′ )
(C) f (x)f (x ′ ) + g (x)g (x ′ )
(D) f (x ′ ) + g (x ′ )
Question
Which of the following are properties that a kernel matrix always has?
□ Invertible
□ At least one negative eigenvalue
□ All the entries are positive
□ Symmetric
Question
Suppose ϕ(x) is an arbitrary feature mapping from input x ∈ X to ϕ(x) ∈ RN and let
K (x, z) = ϕ(x)⊤ ϕ(z). Then K (x, z) will always be a valid kernel function.
Circle one: True False
Question
Suppose ϕ(x) is the feature map induced by a polynomial kernel K (x, z) of degree d,
then ϕ(x) should be a d-dimensional vector.
Circle one: True False
Question
Which of the following are valid kernel functions?
2
⃝ k(x, z) = exp − ∥x−z∥2σ 2
⃝ k(x, z) = ∥x∥∥z∥
⊤ 727 1
⃝ k(x, z) = x z
1 42
⊤ −727 1
⃝ k(x, z) = x z
1 −42
Question
= x⊤ rev(y
Let x, y ∈ Rd be two points. Consider the function k(x, y ) )where rev(y )
1 3
reverses the order of the components in y . For example, rev 2 = 2. Show that
3 1
k cannot be a valid kernel function.
Question
Which of the following statements about kernels are true?
⃝ A: The dimension of the lifted feature vectors Φ(·), whose inner products the
kernel function computes, can be infinite.
⃝ B: For any desired lifting Φ(x), we can design a kernel function k(x, z) that
will evaluate Φ(x)⊤ Φ(z) more quickly than explicitly computing Φ(x) and Φ(z).
⃝ C: The kernel trick, when it is applicable, speeds up a learning algorithm if the
number of sample points is substantially less than the dimension of the (lifted)
feature space.
⃝ D: If the raw feature vectors x, y are of dimension 2, then
k(x, y ) = x12 y12 + x22 y22 is a valid kernel.
Question
Suppose we have a feature map Φ and a kernel function k(Xi , Xj ) = Φ(Xi ) · Φ(Xj ).
Select the true statements about kernels.
A: If there are n sample points of dimension d, it takes O(nd) time to compute
the kernel matrix.
⃝ B: The kernel trick implies we do not compute Φ(Xi ) explicitly for any sample
point Xi .
⃝ C: For every possible feature map Φ : Rd → RD you could imagine, there is a
way to compute k(Xi , Xj ) in O(d) time.
⃝ D: Running times of kernel algorithms do not depend on the dimension D of
the feature space Φ(·).
Linearly separable data
Manoj Sir (Youtube) Lecture 1 22 / 55
Note
The maximum margin hyperplane is determined by the positions of the support
vectors. Other data points can be relocated freely (as long as they stay outside
the margin region) without affecting the decision boundary. Therefore, the decision
boundary is independent of these non-support vector data points.
Convex Hull
Manoj Sir (Youtube) Lecture 1 25 / 55
Convex Hull
Definition: The convex hull is the smallest convex set that encloses all the points, forming a
convex polygon.
Manoj Sir (Youtube) Lecture 1 25 / 55
Convex Hull
Definition: The convex hull is the smallest convex set that encloses all the points, forming a
convex polygon.
Manoj Sir (Youtube) Lecture 1 25 / 55
Question
Find the Decision Boundary and support vectors for the following data points:
Question
In the following image, identify and circle the points that, if removed from the training
set, would result in a different decision boundary upon retraining the SVM compared to
training with the full sample.
Question
You are training an SVM on a tiny dataset with 4 points. This dataset consists of two
examples with class label -1 (denoted with plus), and two examples with class label +1
(denoted with triangles).
The points are:
Class -1: (1, 4), (2, 3)
Class +1: (4, 5), (5, 6)
What is the equation corresponding to the decision boundary?
Question
Consider the training data in Figure 2, for your convenience. The ”maximum margin
classifier” (also called linear ”hard margin” SVM) is a classifier that leaves the largest
possible margin on either side of the decision boundary. The samples lying on the margin
are called support vectors.
Question
Draw on Figure 2 the decision boundary obtained by the linear hard margin SVM method
with a thick solid line. Draw the margins on either side with thinner dashed lines. Circle
the support vectors.
Question
What is the training error rate?
Question
The removal of which sample will change the decision boundary?
Question
What is the leave-one-out error rate?
Question
Suppose your training set for two-class classification in one dimension contains three
sample points: point x1 = 3 with label y1 = 1, point x2 = 1 with label y2 = 1, and
point x3 = −1 with label y3 = −1. What are the values of w and b given by a hard-
margin SVM?
A: w = 1, b = 1
B: w = 0, b = 1
C: w = 1, b = 0
D: w = ∞, b = 0
Question
The geometric margin in a hard margin Support Vector Machine is:
∥w ∥2
2
1
∥w ∥2
2
∥w ∥
2
∥w ∥2
Question
2
Suppose an SVM has w = and b = −1. What is the actual distance between the
3
two planes defined by w T x + b = −1 and w T x + b = 1?
A: √1
14
B: √2
14
√
11
C: 2
√
D: 11
E: None of the above
Question
If a hard-margin support vector machine tries to minimize ∥w ∥2 subject to yi (Xi ·w +α) ≥
2, what will be the width of the slab (the point-free region bracketing the decision
boundary)?
1
∥w ∥
2
∥w ∥
4
∥w ∥
1
2∥w ∥
Primal Form to Dual (Lagrange) Form
n n
n X
X X T
(i) (j)
L = max αi − αi αj yi yj x x
i=1 i=1 j=1
Where:
n ⇒ no. of datapoints x(i) ⇒ the ith datapoint
αi , αj > 0 for support vectors
αi , αj = 0 for non-support vectors
(i) x1 (j)
T
x(i) x(j) = x1 , x2 ·
x2
n
X T
f (xq ) = αi yi x(i) xq + b
i=1
Where:
xq ⇒ represents the query point.
x(i) ⇒ represents the ith support vector in the training data.
Manoj Sir (Youtube) Lecture 1 39 / 55
n
X n X
X n
(i) T (j)
L = max αi − αi αj yi yj Φ(x ) Φ(x )
i=1 i=1 j=1
Where:
n ⇒ no. of datapoints x(i) ⇒ the ith datapoint
αi , αj > 0 for support vectors
αi , αj = 0 for non-support vectors
(2)
x13
2
x1 x2
x1 x22
3
(1) x
Φ(x(i) )T Φ(x(j) ) = x13 , x12 x2 , x1 x22 , x23 , x12 , x1 x2 , x1 , x2
· 22
x1
x1 x2
x
1
x2
n
X
f (xq ) = αi yi Φ(x(i) )T Φ(xq ) + b
i=1
Where:
xq ⇒ represents the query point.
x(i) ⇒ represents the ith support vector in the training data.
Kernel Function
Manoj Sir (Youtube) Lecture 1 41 / 55
Types of Kernel Function
Manoj Sir (Youtube) Lecture 1 42 / 55
Types of Kernel Function
1 Polynomial Kernel Function
d
K (x, x ′ ) = x ⊤ x ′ + c
x and x ′ are input vectors.
c is a constant that can be adjusted.
d is the degree of the polynomial.
Manoj Sir (Youtube) Lecture 1 42 / 55
Types of Kernel Function
1 Polynomial Kernel Function
d
K (x, x ′ ) = x ⊤ x ′ + c
x and x ′ are input vectors.
c is a constant that can be adjusted.
d is the degree of the polynomial.
2 Gaussian Kernel / Radial Basis Function (RBF) Kernel
It maps the input space into an infinite dimensional feature space.
∥x−x ′ ∥2 d2
K (x, x ′ ) = e − 2σ 2 = e − 2σ2
d → distance between x and x ′
x and x ′ are input vectors.
Manoj Sir (Youtube) Lecture 1 42 / 55
Question
Given the scenario with four training examples in two dimensions as shown in the figure:
- Positive examples at x1 = [0, 0] and x2 = [2, 2] - Negative examples at x3 = [h, 1] and
x4 = [0, 3] - We treat 0 ≤ h ≤ 3 as a parameter.
How large can h be so that the training points are still linearly separable?
Does the orientation of the maximum margin decision boundary change as a function
of h when the points are separable? (Yes/No)
Question
Assume that we can only observe the second component of the input vectors. Without
the other component, the labeled training points reduce to (0, y = 1), (2, y = 1), (1, y
= -1), and (3, y = -1). What is the lowest order p of the polynomial kernel that would
allow us to correctly classify these points?
Question
Given the following kernel function and the set of points plotted on the 2D plane,
determine the equation of the maximum margin decision boundary:
u ||⃗v |
u , ⃗v ) = 2|⃗
K (⃗
The points are distributed as follows: - Positive class points: (-1, -1), (0, 0), (1, 1), (2,
2) - Negative class points: (4, 4), (-3, -3), (-4, -4)
Determine the equation of the decision boundary.
Question
What are support vectors?
The examples farthest from the decision boundary.
The only examples necessary to compute f (x) in an SVM.
The class centroids.
All the examples that have a non-zero weight αk in an SVM.
Question
For the following dataset, select all transformations that can make the data linearly
separable.
Question
The geometric margin in a hard margin Support Vector Machine is:
∥w ∥2 1 2 2
, , ,
2 ∥w ∥2 ∥w ∥ ∥w ∥2
Question
In SVMs, the values of αi for non-support vectors are 0. True or False?
Question
Suppose we train a hard-margin linear SVM on n > 100 data points in R2 , yielding a
hyperplane with exactly 2 support vectors. If we add one more data point and retrain
the classifier, what is the maximum possible number of support vectors for the new
hyperplane?
Question
1. In kernelized SVMs, the kernel matrix K has to be positive definite. True or False?
2. The RBF kernel K (xi , xj ) = exp(−γ∥xi − xj ∥2 ) corresponds to an infinite dimensional
mapping of the feature vectors. True or False?
Question
Given the following data samples (square and triangle mean two classes), which one(s)
of the following kernels can we use in SVM to separate the two classes?
Linear kernel
Polynomial kernel
Gaussian RBF (radial basis function) kernel
None of the above
Question
Radial basis functions can be used to make the following dataset linearly separable.
Question
Consider an SVM using a Radial Basis Function (RBF) kernel:
K (xi , xj ) = exp(−γ∥xi − xj ∥2 )
If you notice that the model is underfitting the training data, which parameter adjust-
ment might improve the model’s performance?
A) Increase the gamma γ parameter.
B) Decrease the gamma γ parameter.
C) Simplify the model by reducing the training iterations.
D) Remove the kernel trick and use a linear kernel.
Question
Question
Question
Question
Question
Question
Question
Question
Question