0% found this document useful (0 votes)
140 views87 pages

SVM Class 2

Uploaded by

nick peak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
140 views87 pages

SVM Class 2

Uploaded by

nick peak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Support Vector Machines

Machine Learning

Manoj Kumar

Youtube

October 23, 2024

Manoj Sir (Youtube) Lecture 1 1 / 55


Outline

1 Introduction

2 Kernels

3 Convex hull

4 Lagrange formulation

Manoj Sir (Youtube) Lecture 1 2 / 55


Vector Dot Product

Manoj Sir (Youtube) Lecture 1 3 / 55


Calculations Involved in Dot Product

Manoj Sir (Youtube) Lecture 1 4 / 55


Magic!!

Manoj Sir (Youtube) Lecture 1 5 / 55


Kernel Functions

Manoj Sir (Youtube) Lecture 1 6 / 55


Example of Kernel Functions

Manoj Sir (Youtube) Lecture 1 7 / 55


Validity of Kernel Functions

Manoj Sir (Youtube) Lecture 1 8 / 55


Validity of Kernel Functions

Method 1: We find a function K (x, x ′ ) such that

K (x, x ′ ) = ϕ(x).ϕ(x ′ )
Method 2: Mercer’s theorem
1 K (x, x ′ ) = K (x ′ , x) {Symmetric}
2 Datapoints x1 , x2 , x3 . The Kernel matrix K is formed by applying the kernel function to
each pair of data points:
 
K (x1 , x1 ) K (x1 , x2 ) K (x1 , x3 )
K = K (x2 , x1 ) K (x2 , x2 ) K (x2 , x3 )
K (x3 , x1 ) K (x3 , x2 ) K (x3 , x3 )

Kernel Matrix K should be a symmetric matrix and positive semi-definite (all eigenvalues
of K must be non-negative).

Manoj Sir (Youtube) Lecture 1 8 / 55


Types of Kernel Function

Manoj Sir (Youtube) Lecture 1 9 / 55


Types of Kernel Function

1 Polynomial Kernel Function


 d
K (x, x ′ ) = x ⊤ x ′ + c

x and x ′ are input vectors.


c is a constant that can be adjusted.
d is the degree of the polynomial.

Manoj Sir (Youtube) Lecture 1 9 / 55


Types of Kernel Function

1 Polynomial Kernel Function


 d
K (x, x ′ ) = x ⊤ x ′ + c

x and x ′ are input vectors.


c is a constant that can be adjusted.
d is the degree of the polynomial.
2 Gaussian Kernel / Radial Basis Function (RBF) Kernel
It maps the input space into an infinite dimensional feature space.
∥x−x ′ ∥2 d2
K (x, x ′ ) = e − 2σ 2 = e − 2σ2
d → distance between x and x ′
x and x ′ are input vectors.

Manoj Sir (Youtube) Lecture 1 9 / 55


Validity of Kernel Functions

Given that K1 : X × X → R and K2 : X × X → R are two symmetric, positive definite kernel


functions, the validity of the following kernel functions is as follows:
Valid Kernel Functions:
K (x, x ′ ) = c · K1 (x, x ′ ), where c is a positive constant.
K (x, x ′ ) = K1 (x, x ′ ) + K2 (x, x ′ )
Not a Valid Kernel Function:
K (x, x ′ ) = K1 (x, x ′ ) + 1
K2 (x,x ′ )

Manoj Sir (Youtube) Lecture 1 10 / 55


Question

Consider two data points, x1 = (1, −1) and x2 = (2, 2), in a binary classification task
using an SVM with a custom kernel function K (x, y ). The kernel function is applied to
these points, resulting in the following matrix, referred to as matrix A:
   
K (x1 , x1 ) K (x1 , x2 ) 1 3
=
K (x2 , x1 ) K (x2 , x2 ) 3 6

Which of the following statements is correct regarding matrix A and the kernel function
K (x, y )?
A) K (x, y ) is a valid kernel.
B) K (x, y ) is not a valid kernel.
C) Matrix A is positive semi-definite.
D) Matrix A is not positive semi-definite.
Question

Given a kernel function K1 : Rn × Rn → R and its corresponding feature map ϕ1 : Rn →


Rd , which feature map ϕ : Rn → Rd would correctly produce the kernel cK1 (x, z),
where c is a positive constant?
a) ϕ(x) = cϕ1 (x)

b) ϕ(x) = cϕ1 (x)
c) ϕ(x) = c 2 ϕ1 (x)
d) No such feature map exists.
Question

Let K1 : X × X → R and K2 : X × X → R be two symmetric, positive definite kernel


functions. Which of the following cannot be a valid kernel function?
(a) K (x, x ′ ) = 5 · K1 (x, x ′ )
(b) K (x, x ′ ) = K1 (x, x ′ ) + K2 (x, x ′ )
(c) K (x, x ′ ) = K1 (x, x ′ ) + 1
K2 (x,x ′ )
(d) All three are valid kernels.
Question

Given a kernel function K (x, x ′ ) = f (x)g (x ′ ) + f (x ′ )g (x), where f and g are real-valued
functions (Rd → R), the kernel is not valid. What additional terms would you include
in K to make it a valid kernel?
Options:
(A) f (x) + g (x)
(B) f (x)g (x) + f (x ′ )g (x ′ )
(C) f (x)f (x ′ ) + g (x)g (x ′ )
(D) f (x ′ ) + g (x ′ )
Question

Which of the following are properties that a kernel matrix always has?
□ Invertible
□ At least one negative eigenvalue
□ All the entries are positive
□ Symmetric
Question

Suppose ϕ(x) is an arbitrary feature mapping from input x ∈ X to ϕ(x) ∈ RN and let
K (x, z) = ϕ(x)⊤ ϕ(z). Then K (x, z) will always be a valid kernel function.
Circle one: True False
Question

Suppose ϕ(x) is the feature map induced by a polynomial kernel K (x, z) of degree d,
then ϕ(x) should be a d-dimensional vector.
Circle one: True False
Question

Which of the following are valid kernel functions?


 2

⃝ k(x, z) = exp − ∥x−z∥2σ 2
⃝ k(x, z) = ∥x∥∥z∥
 
⊤ 727 1
⃝ k(x, z) = x z
1 42
 
⊤ −727 1
⃝ k(x, z) = x z
1 −42
Question

= x⊤ rev(y
Let x, y ∈ Rd be two points. Consider the function k(x, y )   )where rev(y )
1 3
reverses the order of the components in y . For example, rev 2 = 2. Show that
  
3 1
k cannot be a valid kernel function.
Question

Which of the following statements about kernels are true?


⃝ A: The dimension of the lifted feature vectors Φ(·), whose inner products the
kernel function computes, can be infinite.
⃝ B: For any desired lifting Φ(x), we can design a kernel function k(x, z) that
will evaluate Φ(x)⊤ Φ(z) more quickly than explicitly computing Φ(x) and Φ(z).
⃝ C: The kernel trick, when it is applicable, speeds up a learning algorithm if the
number of sample points is substantially less than the dimension of the (lifted)
feature space.
⃝ D: If the raw feature vectors x, y are of dimension 2, then
k(x, y ) = x12 y12 + x22 y22 is a valid kernel.
Question

Suppose we have a feature map Φ and a kernel function k(Xi , Xj ) = Φ(Xi ) · Φ(Xj ).
Select the true statements about kernels.
A: If there are n sample points of dimension d, it takes O(nd) time to compute
the kernel matrix.
⃝ B: The kernel trick implies we do not compute Φ(Xi ) explicitly for any sample
point Xi .
⃝ C: For every possible feature map Φ : Rd → RD you could imagine, there is a
way to compute k(Xi , Xj ) in O(d) time.
⃝ D: Running times of kernel algorithms do not depend on the dimension D of
the feature space Φ(·).
Linearly separable data

Manoj Sir (Youtube) Lecture 1 22 / 55


Note
The maximum margin hyperplane is determined by the positions of the support
vectors. Other data points can be relocated freely (as long as they stay outside
the margin region) without affecting the decision boundary. Therefore, the decision
boundary is independent of these non-support vector data points.
Convex Hull

Manoj Sir (Youtube) Lecture 1 25 / 55


Convex Hull

Definition: The convex hull is the smallest convex set that encloses all the points, forming a
convex polygon.

Manoj Sir (Youtube) Lecture 1 25 / 55


Convex Hull

Definition: The convex hull is the smallest convex set that encloses all the points, forming a
convex polygon.

Manoj Sir (Youtube) Lecture 1 25 / 55


Question

Find the Decision Boundary and support vectors for the following data points:
Question

In the following image, identify and circle the points that, if removed from the training
set, would result in a different decision boundary upon retraining the SVM compared to
training with the full sample.
Question

You are training an SVM on a tiny dataset with 4 points. This dataset consists of two
examples with class label -1 (denoted with plus), and two examples with class label +1
(denoted with triangles).
The points are:
Class -1: (1, 4), (2, 3)
Class +1: (4, 5), (5, 6)
What is the equation corresponding to the decision boundary?
Question

Consider the training data in Figure 2, for your convenience. The ”maximum margin
classifier” (also called linear ”hard margin” SVM) is a classifier that leaves the largest
possible margin on either side of the decision boundary. The samples lying on the margin
are called support vectors.
Question

Draw on Figure 2 the decision boundary obtained by the linear hard margin SVM method
with a thick solid line. Draw the margins on either side with thinner dashed lines. Circle
the support vectors.
Question

What is the training error rate?


Question

The removal of which sample will change the decision boundary?


Question

What is the leave-one-out error rate?


Question

Suppose your training set for two-class classification in one dimension contains three
sample points: point x1 = 3 with label y1 = 1, point x2 = 1 with label y2 = 1, and
point x3 = −1 with label y3 = −1. What are the values of w and b given by a hard-
margin SVM?
A: w = 1, b = 1
B: w = 0, b = 1
C: w = 1, b = 0
D: w = ∞, b = 0
Question

The geometric margin in a hard margin Support Vector Machine is:


∥w ∥2
2
1
∥w ∥2
2
∥w ∥
2
∥w ∥2
Question
 
2
Suppose an SVM has w = and b = −1. What is the actual distance between the
3
two planes defined by w T x + b = −1 and w T x + b = 1?
A: √1
14
B: √2
14

11
C: 2

D: 11
E: None of the above
Question

If a hard-margin support vector machine tries to minimize ∥w ∥2 subject to yi (Xi ·w +α) ≥


2, what will be the width of the slab (the point-free region bracketing the decision
boundary)?
1
∥w ∥
2
∥w ∥
4
∥w ∥
1
2∥w ∥
Primal Form to Dual (Lagrange) Form
 
n n
n X   
X X T
(i) (j) 
L = max  αi − αi αj yi yj x x
i=1 i=1 j=1

Where:

n ⇒ no. of datapoints x(i) ⇒ the ith datapoint

αi , αj > 0 for support vectors


αi , αj = 0 for non-support vectors

(i) x1 (j)
 T     
x(i) x(j) = x1 , x2 ·
x2

n
X  T
f (xq ) = αi yi x(i) xq + b
i=1

Where:

xq ⇒ represents the query point.


x(i) ⇒ represents the ith support vector in the training data.

Manoj Sir (Youtube) Lecture 1 39 / 55


 
n
X n X
X n  
(i) T (j) 
L = max  αi − αi αj yi yj Φ(x ) Φ(x )
i=1 i=1 j=1

Where:

n ⇒ no. of datapoints x(i) ⇒ the ith datapoint

αi , αj > 0 for support vectors


αi , αj = 0 for non-support vectors

(2)
x13

2
x1 x2 
x1 x22 
 
 3 
(1) x
Φ(x(i) )T Φ(x(j) ) = x13 , x12 x2 , x1 x22 , x23 , x12 , x1 x2 , x1 , x2
 
·  22 
 
 x1 
 x1 x2 
 
 x 
1
x2
n
X  
f (xq ) = αi yi Φ(x(i) )T Φ(xq ) + b
i=1

Where:

xq ⇒ represents the query point.


x(i) ⇒ represents the ith support vector in the training data.
Kernel Function

Manoj Sir (Youtube) Lecture 1 41 / 55


Types of Kernel Function

Manoj Sir (Youtube) Lecture 1 42 / 55


Types of Kernel Function

1 Polynomial Kernel Function


 d
K (x, x ′ ) = x ⊤ x ′ + c

x and x ′ are input vectors.


c is a constant that can be adjusted.
d is the degree of the polynomial.

Manoj Sir (Youtube) Lecture 1 42 / 55


Types of Kernel Function

1 Polynomial Kernel Function


 d
K (x, x ′ ) = x ⊤ x ′ + c

x and x ′ are input vectors.


c is a constant that can be adjusted.
d is the degree of the polynomial.
2 Gaussian Kernel / Radial Basis Function (RBF) Kernel
It maps the input space into an infinite dimensional feature space.
∥x−x ′ ∥2 d2
K (x, x ′ ) = e − 2σ 2 = e − 2σ2
d → distance between x and x ′
x and x ′ are input vectors.

Manoj Sir (Youtube) Lecture 1 42 / 55


Question

Given the scenario with four training examples in two dimensions as shown in the figure:
- Positive examples at x1 = [0, 0] and x2 = [2, 2] - Negative examples at x3 = [h, 1] and
x4 = [0, 3] - We treat 0 ≤ h ≤ 3 as a parameter.
How large can h be so that the training points are still linearly separable?
Does the orientation of the maximum margin decision boundary change as a function
of h when the points are separable? (Yes/No)
Question

Assume that we can only observe the second component of the input vectors. Without
the other component, the labeled training points reduce to (0, y = 1), (2, y = 1), (1, y
= -1), and (3, y = -1). What is the lowest order p of the polynomial kernel that would
allow us to correctly classify these points?
Question

Given the following kernel function and the set of points plotted on the 2D plane,
determine the equation of the maximum margin decision boundary:

u ||⃗v |
u , ⃗v ) = 2|⃗
K (⃗
The points are distributed as follows: - Positive class points: (-1, -1), (0, 0), (1, 1), (2,
2) - Negative class points: (4, 4), (-3, -3), (-4, -4)
Determine the equation of the decision boundary.
Question

What are support vectors?


The examples farthest from the decision boundary.
The only examples necessary to compute f (x) in an SVM.
The class centroids.
All the examples that have a non-zero weight αk in an SVM.
Question

For the following dataset, select all transformations that can make the data linearly
separable.
Question

The geometric margin in a hard margin Support Vector Machine is:

∥w ∥2 1 2 2
, , ,
2 ∥w ∥2 ∥w ∥ ∥w ∥2
Question

In SVMs, the values of αi for non-support vectors are 0. True or False?


Question

Suppose we train a hard-margin linear SVM on n > 100 data points in R2 , yielding a
hyperplane with exactly 2 support vectors. If we add one more data point and retrain
the classifier, what is the maximum possible number of support vectors for the new
hyperplane?
Question

1. In kernelized SVMs, the kernel matrix K has to be positive definite. True or False?
2. The RBF kernel K (xi , xj ) = exp(−γ∥xi − xj ∥2 ) corresponds to an infinite dimensional
mapping of the feature vectors. True or False?
Question

Given the following data samples (square and triangle mean two classes), which one(s)
of the following kernels can we use in SVM to separate the two classes?
Linear kernel
Polynomial kernel
Gaussian RBF (radial basis function) kernel
None of the above
Question

Radial basis functions can be used to make the following dataset linearly separable.
Question

Consider an SVM using a Radial Basis Function (RBF) kernel:

K (xi , xj ) = exp(−γ∥xi − xj ∥2 )
If you notice that the model is underfitting the training data, which parameter adjust-
ment might improve the model’s performance?
A) Increase the gamma γ parameter.
B) Decrease the gamma γ parameter.
C) Simplify the model by reducing the training iterations.
D) Remove the kernel trick and use a linear kernel.
Question
Question
Question
Question
Question

Question
Question
Question
Question

You might also like