0% found this document useful (0 votes)

5 views39 pages

Lecture06 Kernel Methods

The document discusses kernel methods in the context of binary classification, particularly focusing on the Euclidean classifier and its limitations. It introduces the concept of kernels, their formal definition, and various types such as linear, polynomial, and Gaussian kernels, emphasizing the kernel trick that allows algorithms to operate in high-dimensional spaces without explicit computation. Additionally, it covers the formulation of support vector machines (SVM), including the optimal separating hyperplane and the extension to non-separable cases using slack variables.

Uploaded by

ana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views39 pages

Lecture06 Kernel Methods

Uploaded by

ana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Kernel methods

UE de Master 2, AOS1
Fall 2020

S. Rousseau

1
Introductory example

Let’s look at one of the simplest binary classifier: the Euclidean classifier

• Classification rule is: choose class whose class sample mean is the closest

2
Introductory example

Let’s look at one of the simplest binary classifier: the Euclidean classifier

• Classification rule is: choose class whose class sample mean is the closest

w y
x

2
Introductory example

Let’s look at one of the simplest binary classifier: the Euclidean classifier

• Classification rule is: choose class whose class sample mean is the closest

g w y
x

x
x −g

2
Decision rule

• Decision rule depends on the

sign of an inner product

hx − g , w i g w y
x

• Inner product leads to too x

simple decision boundary x −g
(hyperplane in n-d, line in 2-d)

How can we remedy to this?

3
Making my algorithm more powerful

Two solutions
1. Pragmatic but without guarantee solution: I replace the inner product by some other
similarity
• The similarity is easy to compute, but
• I have no guarantee that it will work
2. Theoretical but hard in practice solution: I embed the samples in a possibly infinite
dimensional feature space
• My algorithm is the same but in a different space, but
• Computations in feature space can be computationally intensive

4
Making my algorithm more powerful

The two solutions are actually the same!

(provided that the similarity is a kernel)

4
Kernel: formal definition

• Arbitrary input space X (Rp , strings, graphs, . . . )

• A kernel k defined on X is a map:

k : X × X → R,

that verifies
• Symmetry: ∀x, y ∈ X , k(x, y ) = k(y , x), matrix (K )ij = k(x i , x j ) is symmetric

• Positive semidefiniteness: For all x 1 , . . . , x n ∈ X , matrix (K )ij = k(x i , x j ) is

positive semidefinite (PSD, K > 0):
X
∀α ∈ Rn , αT K α = αi αj k(x i , x j ) > 0
ij

5
Some examples

• Linear kernel (inner product is of course a kernel): X = Rp , k(x, y ) = hx, y i

• Polynomial kernel: let X = Rp , d > 0, c > 0

kpoly1 (x, y ) = (c + hx, y i)d

kx−y k2
n o
• Gaussian kernel or RBF (radial basis function): kgauss (x, y ) = exp − σ2 2

• Sigmoid kernel: k(x, y ) = tanh (γ hx, y i + c)

6
From solution #2 to solution #1

Embedding actually defines a kernel

• Let Φ : X 7−→ F some embedding of X in a feature space F that is equipped

with an inner production h·, ·iF .
• Define k as follows
k(x, y ) = hΦ(x), Φ(y )iF
• Define that way, k is always a kernel!
• Obviouly symmetric
• Inner product in F makes k PSD

7
From solution #2 to solution #1: the polynomial kernel case

Suppose X = R2 , n = 2, c = 1 and d = 2

• Polynomial kernel is then

kpoly1 (x, y ) = (1 + hx, y i)2

= (hx, y i + 1)2 = (1 + x1 y1 + x2 y2 )2
= 1 + 2x1 y1 + 2x2 y2 + x12 y12 + x22 y22 + 2x1 x2 y1 y2
√ √ √
• Define Φ(x) = 1, x12 , x22 , 2x1 , 2x2 , 2x1 x2 , we have

kpoly1 (x, y ) = hΦ(x), Φ(y )i

• The kernel kpoly1 is can be interpreted as an inner product

8
From solution #2 to solution #1: when X is finite

From the definition, K is a kernel iff K is symetric PSD

• K is diagonalizable in an orthonormal basis K = UDU T with U orthogonal and D

diagonal

k(x i , x j ) = Kij Define the function Φ

= (UDU T )ij Φ : X → Rn
√
= uT
i Duj x i 7→ Dui .
√ √
= uT
i D Duj
D√ √ E we have k(x i , x j ) = hΦ(x i ), Φ(x j )i
= Dui , Duj
• The kernel k is can also be interpreted as an inner product

9
Kernel trick

• Kernel trick
k(x, y ) = hΦ(x), Φ(y )iF

No need to compute the embedding and then the inner product, the kernel is exactly
doing this.
• Any algorithm using only inner products between training samples can be
“kernelized”
• Most famous kernel method: SVM (support vector machines)

10
Separating hyperplane

• In Rp , a hyperplane is defined by h(x) = 0 with h(x) = w · x + b.

w·
• w is orthogonal to the hyperplane

x+
w ·x +b >0

b=
|w ·x+b|
• d(x, H) = kw k2

0
|b|
• Distance from the origin is kw k2
w ·x +b <0
w
x1

11
Margin

x2
• H a separating hyperplane
H
• Distance from one x i to H is

|w · x i + b|
d(x i , H) =
kw k2
gin
mar
• Margin of a separating hyperplan is the
smallest distance from H to the x i ’s

M = min d(x i , H)
i=1,...,n
x1

12
Optimal separating hyperplane

• Optimal separating hyperplane x2

maximizes the margin
• Unique solution
• Vectors x i supporting the margin,
i.e. such that w ·x +b >0

w
·x
M = d(x i , H)

+
b

in
w ·x +b <0

g
=

ar
m
0
are called support vectors

n
w

gi
• H = arg max min d(x i , H)

ar
m
H∈H i=1,...,n x1

13
One-to-one representation

• Hyperplane ⇐⇒ (w , b) is not one-to-one

If (w , b) represents H so is (cw , cb) with c 6= 0
• Possible choice for c
1. Data independant: imposing kw k2 = 1 (taking c = kw k2 )
1

(w , b) with kw k2 = 1 (normalization 1)

2. Data dependant: w · x + b is ±1 at margin lines

1
(w , b) with kw k2 = (normalization 2)
M
When H is an optimal separating hyperplane, support vectors are ±1

14
Optimal separating hyperplane

Choosing data-dependent normalization

w
• Margin is 1

·x
kw k2

+
• Takes the value ±1 for support

b
w

=
·x
vectors

1
+
b
w

=
·x

0
+
2

b
kw k2

=
w

−
1
x1

15
Maximizing the margin

• Maximizing 1
kw k2 ⇐⇒ Minimizing kw k22

• Proper labelling:
• w · x i + b > 1 when yi = 1
yi (w · x i + b) > 1
• w · x i + b 6 −1 when yi = −1

1
arg min kw k22 such that ∀i, yi (w · x i + b) > 1 (1)
(w ,b) 2

• Quadratic programming problem: quadratic objective, linear constraints

16
Lagrangian formulation

• We use the Lagrangian, n Lagrange multipliers µ1 , . . . , µn

n
1 X
L(w , b, µ) = kw k22 − µi (yi (w · x i + b) − 1)
2
i=1

• Karush–Kuhn–Tucker (KKT) conditions because we have inequalities

 n  n
∂L X X
= w − µi yi x i = 0 w = µi y i x i

 

 
 ∂w
 

i=1 i=1
n =⇒ n
 ∂L X X
 ∂b = − µi yi = 0 µi yi = 0

 

 

i=1 i=1

µi > 0 et µi (yi (w · x i + b) − 1) = 0

17
Dual formulation

• Plugging the KKT conditions back in the objective function we have

2    
n n n
1 X X X
L(µ, b) = µj y j x j − µi yi  µj y j x j · x i + b  − 1 
2
j=1 i=1 j=1
2
2
n n n
1 X X X
L(µ) = µj y j x j − µi µj yi yj x i · x j + µi
2
j=1 i,j=1 i=1
2
2
n n
1 X X
=− µ j yj x j + µi
2
j=1 i=1
2

18
Dual formulation

SVM minimization problem (dual form)


n n ∀i, µ > 0
1X X i
arg min µi µj yi yj x i · x j − µi such that
µ 2 P n
µi yi = 0
i=1 i=1 i=1

• Only depends on x i · x j : SVM algorithm is kernelizable

19
Support vectors

• yi (w · x i + b) > 1: µi = 0, x i is not used

• yi (w · x i + b) = 1: µi > 0, x i is on the margin, it is a support vector
x2

w ·x +b >0

w
·x
+

n
w ·x +b <0

gi
=

ar
m
0
n
w

gi
ar
x1

m
• Only support vectors are used to classify
20
Classification

• Classification of a new example x:

y (x) = sgn (w · x + b)
n
!
X
= sgn µi yi x i · x + b
i=1
• Only the support vectors are used to classify
• We only need to compute x i · x with x i a support vector
• Robust to outliers in training data
• How to determine b:
• If x i is a support vector, yi (w · x i + b) = 1 hence
b = yi − w · x i
• More stable numerically
1 X
yi − w · x i
#{i, µi > 0} 21
i, µi >0
Extension to the non-separable case

• Until now, classes were was supposed to be linearly separable but:

• it is never the case in practice
• minimization problem (1) or dual form do not even have a solution: set of admissible
solution is empty
• Typical solution: relaxing the constraints

22
Slack variables

• Original minimization problem

1
arg min kw k22 such that ∀i, yi (w · x i + b) > 1
(w ,b) 2

• Some constraint might be violated; we give them some slack by introducing

nonnegative slack variables ξ1 , . . . , ξn so that

∀i, yi (w · x i + b) > 1 becomes ∀i, yi (w · x i + b) > 1 − ξi

• The ξi ’s catch up on mistakes

23
Slack variables: geometric interpretation

ξ>2
New constraints: yi (w · x i + b) > 1 − ξi x2 ξ=2
• If ξi = 0 the original constraint is met ξ>1
ξ=1
• If 0 < ξi < 2 the original constraint is ξ>0
violated and the sample x i belongs to

w
·x
the margin

+
b
w

=
• If ξi > 2 the original constraint is

·x

1
+
w
violated and the sample x i belong to

b
=
·x

0
the wrong side

+
2

b
kw k2

=
−
x1

1
24
Soft margin minimization problem

• Still maximizing the margin but regularizing by the amount of slack ni=1 ξi
P

n
∀i, yi (w · x i + b) > 1 − ξi
(
1 X
arg min kw k22 + C ξi such that
(w ,b) 2 i=1 ξi > 0

• We introduce a new hyperparameter C : cost of adding some slack

• Hard margin if C → +∞: no slack

25
Hinge loss

• Equivalence of constraints
∀i, yi (w · x i + b) > 1 − ξi
(
⇐⇒ ξi > max (0, 1 − yi (w · x i + b))
ξi > 0
• Smaller objective function if ξi = max (0, 1 − yi (w · x i + b))
• If we plug the ξi ’s in the objective function we have
n
1 X
arg min kw k22 + C max (0, 1 − yi (w · x i + b))
(w ,b) 2 i=1

• Defining hinge loss function as: Lhinge (x, y ) = max (0, 1 − xy )

n
X 1
arg min Lhinge (yi , w · x i + b) + kw k22
(w ,b) 2C
i=1

26
Other losses

loss L

Losses when true class is yi = 1 0/1 loss

hinge loss
• 0/1 loss: penalises wrong sign of square loss
prediction ybi
• Square loss penalises the gap whatever 2
the direction
• hinge loss: relaxing 0/1 loss 1

−2 −1 1 2 ybi

27
Illustrations: increasing the tuning parameter C

Gaussian kernel with varying C

C=1 C=10

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3
C=100 C=1000

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3 28
Illustrations: changing the kernel

linear rbf

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3
poly sigmoid

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3

29
Kernel PCA

The idea is to apply the principal component analysis on the feature space

• New points are y i = Φ(x i )

• Design matrix changes from
 T  T
x1 y1
 ..   .. 
X = .  to XΦ =  . 
xT
n yT
n

• Suppose for now that XΦ is centered

• The kernel matrix K gathers the inner products in feature space: K = XΦ XΦT

30
Kernel PCA

Kernel PCA on feature space

• Let VΦ = n1 XΦT XΦ the sample variance–covariance matrix, v1 , . . . , vq the eigen

vectors and λ1 > . . . > λq the corresponding eigenvalues
• We have seen that the i-th principal component is XΦ vi

We don’t want to compute XΦ vi

• It is easy to see that K (XΦ vi ) = nλi (XΦ vi ); PC are just eigenvectors (properly
rescaled) of the kernel matrix K

31
Centering the kernel

We supposed that XΦ was centered but what if it’s not?

• Define the centering operator

1− 1
− n1 − n1
 
n ...
.. .. 
1 1 − n1 . . 
 1
 −
Qn = In − 1n =  n .
n  .. .. ..
 . . . − n1 


− n1 ... − n1 1− 1
n

so that XΦ0 = Qn XΦ is the centering of XΦ

• The corresponding kernel is K 0 = Qn XΦ (Qn XΦ )T = Qn XΦ XΦT Qn = Qn KQn
• No need to compute XΦ , just center column-wise and row-wise the kernel

32
Examples

Classic PCA (linear kernel)

12
Level lines of first principal component
9
5 6
• Level lines of first principal
3
component are straight lines
0 0
• And they are orthogonal to −3
first principal direction
−6
−5
−9
−5 0 5 10
−12

33
Examples

Kernel PCA with an RBF kernel (gaussian kernel)

0.8
0.6 0.75
5 0.4
0.50
0.2
0 0.0 0.25
−0.2
−0.4 0.00
−5
−5 0 5 10 −0.6 −0.25
−0.8
−0.5 0.0 0.5

34
Kernel ridge regression

• Matrix inversion trick: from X I + X T X = I + XX T X we get

−1 −1
I + XX T X = X I + XTX

where XX T is n × n and X T X is p × p
−1 T
• Using this in ybridge = X X T X + λIp X y we get
−1
ybridge = λIp + XX T XX T y

and then ybridge = (λIp + K )−1 K y

35
Kernel k-means / kernel k-nearest neighbors

Those algorithms only depend on distances between samples

• Since interdistance can be expressed as inner products

kx i − x j k2 = hx i , x i i − 2 hx i , x j i + hx j , x j i

• K -means is kernelizable. The “kernel distance” is then

d(x i , x j )2 = k(x i , x i ) − 2k(x i , x j ) + k(x j , x j )

Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
58 pages
Support Vector Machines
No ratings yet
Support Vector Machines
57 pages
Understanding Support Vector Machines
No ratings yet
Understanding Support Vector Machines
58 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Lecture 5 - Kernel SAVAS1
No ratings yet
Lecture 5 - Kernel SAVAS1
19 pages
Support Vector Machines: Xiaojin Zhu
No ratings yet
Support Vector Machines: Xiaojin Zhu
41 pages
Eml 12 SVM 010925
No ratings yet
Eml 12 SVM 010925
41 pages
Lecture 7 - SVM
No ratings yet
Lecture 7 - SVM
125 pages
Extending Machine Learning Models
No ratings yet
Extending Machine Learning Models
64 pages
Support Vector Machines (SVMS)
No ratings yet
Support Vector Machines (SVMS)
31 pages
Financial Market Volatility Forecasting
No ratings yet
Financial Market Volatility Forecasting
52 pages
Statistical Machine Learning Overview
No ratings yet
Statistical Machine Learning Overview
45 pages
2IIG0 Cheat Sheet 1
No ratings yet
2IIG0 Cheat Sheet 1
2 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
315 F19 14 SVM 1
No ratings yet
315 F19 14 SVM 1
33 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
cs221 Lecture11
No ratings yet
cs221 Lecture11
71 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
8 pages
SVM Classifiers: A Technical Guide
No ratings yet
SVM Classifiers: A Technical Guide
44 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
SVM-CDing2024 11 15
No ratings yet
SVM-CDing2024 11 15
54 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
L06-Support Vector Machines NN1 2024
No ratings yet
L06-Support Vector Machines NN1 2024
85 pages
PRML Exercise Solutions Guide
No ratings yet
PRML Exercise Solutions Guide
87 pages
SVM Intro
No ratings yet
SVM Intro
23 pages
Understanding Kernel Tricks in SVMs
No ratings yet
Understanding Kernel Tricks in SVMs
43 pages
Support Vector Machine
No ratings yet
Support Vector Machine
50 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Be Central
No ratings yet
Be Central
98 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
An Introduction To Kernel Methods: C. Campbell
No ratings yet
An Introduction To Kernel Methods: C. Campbell
38 pages
Support Vector Machine
No ratings yet
Support Vector Machine
49 pages
07 SVMs
No ratings yet
07 SVMs
68 pages
10 SVM
No ratings yet
10 SVM
77 pages
SVM Overview and Applications
No ratings yet
SVM Overview and Applications
33 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
SVM Slides
No ratings yet
SVM Slides
22 pages
Classification
No ratings yet
Classification
47 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
CSC4020-Chap13 Kernel M
No ratings yet
CSC4020-Chap13 Kernel M
32 pages
Predicting Student Pass Rates
No ratings yet
Predicting Student Pass Rates
17 pages
Support Vector Machines Overview
No ratings yet
Support Vector Machines Overview
19 pages
SVM Techniques for Image Classification
No ratings yet
SVM Techniques for Image Classification
22 pages
Introduction to Support Vector Machines
100% (1)
Introduction to Support Vector Machines
56 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
33 pages
HW 3
No ratings yet
HW 3
7 pages
Week-10 - Summary
No ratings yet
Week-10 - Summary
13 pages
Report 1
No ratings yet
Report 1
6 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
5d. Support Vector Machine
No ratings yet
5d. Support Vector Machine
29 pages
Support Vector Machines: Logisic Regression
No ratings yet
Support Vector Machines: Logisic Regression
10 pages
Lecture05 Regularization
No ratings yet
Lecture05 Regularization
40 pages
Lecture07 Time Series
No ratings yet
Lecture07 Time Series
38 pages
Day7 8 - Lp-Upt
No ratings yet
Day7 8 - Lp-Upt
57 pages
IPython, DFS, Graphs, and Data Science Insights
No ratings yet
IPython, DFS, Graphs, and Data Science Insights
10 pages
Te Zgjidhura Sakte
No ratings yet
Te Zgjidhura Sakte
4 pages
HTML5 Questions
100% (2)
HTML5 Questions
19 pages
Scikit Learn Tutorial PDF
100% (2)
Scikit Learn Tutorial PDF
151 pages
Assignment No-05
No ratings yet
Assignment No-05
3 pages
Pattern Recognition Exam Paper 3171613
No ratings yet
Pattern Recognition Exam Paper 3171613
2 pages
All Practicals AM
No ratings yet
All Practicals AM
10 pages
Intermediate Algebra Posttest
No ratings yet
Intermediate Algebra Posttest
3 pages
Operations On Signals PPT v1
No ratings yet
Operations On Signals PPT v1
37 pages
PCA - Principal Component Analysis: Step by Step Computation of PCA
No ratings yet
PCA - Principal Component Analysis: Step by Step Computation of PCA
2 pages
Optimal Binary Search Tree
No ratings yet
Optimal Binary Search Tree
21 pages
AI Practical MCA
No ratings yet
AI Practical MCA
21 pages
DAA Syllabus 2022-23 As Per BoS
No ratings yet
DAA Syllabus 2022-23 As Per BoS
4 pages
DSA 3rd Sem Solution MCQ and 1 Word
No ratings yet
DSA 3rd Sem Solution MCQ and 1 Word
7 pages
Worksheet of Polynomial Class 9
No ratings yet
Worksheet of Polynomial Class 9
2 pages
Computer Vision and Image Processing
No ratings yet
Computer Vision and Image Processing
2 pages
Spectral Analysis Lab 1
No ratings yet
Spectral Analysis Lab 1
18 pages
DSUC CS-3040 2020 Spring 00101667 01
No ratings yet
DSUC CS-3040 2020 Spring 00101667 01
12 pages
Digital Lowpass Filter Design in MATLAB
No ratings yet
Digital Lowpass Filter Design in MATLAB
75 pages
Yamaha Rivage PM10 Review
No ratings yet
Yamaha Rivage PM10 Review
18 pages
Operations Research Test Paper
No ratings yet
Operations Research Test Paper
3 pages
Cpu Scheduling
No ratings yet
Cpu Scheduling
7 pages
Flop Matrix Inverse
100% (1)
Flop Matrix Inverse
11 pages
5G Network Lab Manual
No ratings yet
5G Network Lab Manual
25 pages
Types of Kriging
100% (1)
Types of Kriging
9 pages
Binary Search & Max Element Algorithms
No ratings yet
Binary Search & Max Element Algorithms
5 pages
Gaussian Elimination Spreadsheet
0% (1)
Gaussian Elimination Spreadsheet
2 pages
Data Structures Exam Prep
No ratings yet
Data Structures Exam Prep
1 page
NLP TheoryManual PDF
No ratings yet
NLP TheoryManual PDF
330 pages
Lecture#14
No ratings yet
Lecture#14
38 pages
History of NP-Completeness Theory
No ratings yet
History of NP-Completeness Theory
18 pages
Iterative Solutions for Nonlinear Systems
No ratings yet
Iterative Solutions for Nonlinear Systems
6 pages
SML - II Internal Question Bank
No ratings yet
SML - II Internal Question Bank
2 pages

Lecture06 Kernel Methods

Uploaded by

Lecture06 Kernel Methods

Uploaded by

Kernel methods

• Decision rule depends on the

• Inner product leads to too x

How can we remedy to this?

The two solutions are actually the same!

• Arbitrary input space X (Rp , strings, graphs, . . . )

• Positive semidefiniteness: For all x 1 , . . . , x n ∈ X , matrix (K )ij = k(x i , x j ) is

• Linear kernel (inner product is of course a kernel): X = Rp , k(x, y ) = hx, y i

• Polynomial kernel: let X = Rp , d > 0, c > 0

kpoly1 (x, y ) = (c + hx, y i)d

• Sigmoid kernel: k(x, y ) = tanh (γ hx, y i + c)

Embedding actually defines a kernel

• Let Φ : X 7−→ F some embedding of X in a feature space F that is equipped

• Polynomial kernel is then

kpoly1 (x, y ) = (1 + hx, y i)2

kpoly1 (x, y ) = hΦ(x), Φ(y )i

• The kernel kpoly1 is can be interpreted as an inner product

From the definition, K is a kernel iff K is symetric PSD

• K is diagonalizable in an orthonormal basis K = UDU T with U orthogonal and D

k(x i , x j ) = Kij Define the function Φ

• In Rp , a hyperplane is defined by h(x) = 0 with h(x) = w · x + b.

• Optimal separating hyperplane x2

• Hyperplane ⇐⇒ (w , b) is not one-to-one

2. Data dependant: w · x + b is ±1 at margin lines

Choosing data-dependent normalization

• Quadratic programming problem: quadratic objective, linear constraints

• We use the Lagrangian, n Lagrange multipliers µ1 , . . . , µn

• Karush–Kuhn–Tucker (KKT) conditions because we have inequalities

• Plugging the KKT conditions back in the objective function we have

SVM minimization problem (dual form)

• Only depends on x i · x j : SVM algorithm is kernelizable

• yi (w · x i + b) > 1: µi = 0, x i is not used

• Classification of a new example x:

• Until now, classes were was supposed to be linearly separable but:

• Original minimization problem

• Some constraint might be violated; we give them some slack by introducing

∀i, yi (w · x i + b) > 1 becomes ∀i, yi (w · x i + b) > 1 − ξi

• The ξi ’s catch up on mistakes

• We introduce a new hyperparameter C : cost of adding some slack

• Defining hinge loss function as: Lhinge (x, y ) = max (0, 1 − xy )

Losses when true class is yi = 1 0/1 loss

Gaussian kernel with varying C

• New points are y i = Φ(x i )

• Suppose for now that XΦ is centered

Kernel PCA on feature space

• Let VΦ = n1 XΦT XΦ the sample variance–covariance matrix, v1 , . . . , vq the eigen

We don’t want to compute XΦ vi

We supposed that XΦ was centered but what if it’s not?

• Define the centering operator

so that XΦ0 = Qn XΦ is the centering of XΦ

Classic PCA (linear kernel)

Kernel PCA with an RBF kernel (gaussian kernel)

• Matrix inversion trick: from X I + X T X = I + XX T X we get

and then ybridge = (λIp + K )−1 K y

Those algorithms only depend on distances between samples

• Since interdistance can be expressed as inner products

• K -means is kernelizable. The “kernel distance” is then

d(x i , x j )2 = k(x i , x i ) − 2k(x i , x j ) + k(x j , x j )

You might also like