0% found this document useful (0 votes)
5 views39 pages

Lecture06 Kernel Methods

The document discusses kernel methods in the context of binary classification, particularly focusing on the Euclidean classifier and its limitations. It introduces the concept of kernels, their formal definition, and various types such as linear, polynomial, and Gaussian kernels, emphasizing the kernel trick that allows algorithms to operate in high-dimensional spaces without explicit computation. Additionally, it covers the formulation of support vector machines (SVM), including the optimal separating hyperplane and the extension to non-separable cases using slack variables.

Uploaded by

ana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views39 pages

Lecture06 Kernel Methods

The document discusses kernel methods in the context of binary classification, particularly focusing on the Euclidean classifier and its limitations. It introduces the concept of kernels, their formal definition, and various types such as linear, polynomial, and Gaussian kernels, emphasizing the kernel trick that allows algorithms to operate in high-dimensional spaces without explicit computation. Additionally, it covers the formulation of support vector machines (SVM), including the optimal separating hyperplane and the extension to non-separable cases using slack variables.

Uploaded by

ana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Kernel methods

UE de Master 2, AOS1
Fall 2020

S. Rousseau

1
Introductory example

Let’s look at one of the simplest binary classifier: the Euclidean classifier

• Classification rule is: choose class whose class sample mean is the closest

2
Introductory example

Let’s look at one of the simplest binary classifier: the Euclidean classifier

• Classification rule is: choose class whose class sample mean is the closest

w y
x

2
Introductory example

Let’s look at one of the simplest binary classifier: the Euclidean classifier

• Classification rule is: choose class whose class sample mean is the closest

g w y
x

x
x −g

2
Decision rule

• Decision rule depends on the


sign of an inner product

hx − g , w i g w y
x

• Inner product leads to too x


simple decision boundary x −g
(hyperplane in n-d, line in 2-d)

How can we remedy to this?

3
Making my algorithm more powerful

Two solutions
1. Pragmatic but without guarantee solution: I replace the inner product by some other
similarity
• The similarity is easy to compute, but
• I have no guarantee that it will work
2. Theoretical but hard in practice solution: I embed the samples in a possibly infinite
dimensional feature space
• My algorithm is the same but in a different space, but
• Computations in feature space can be computationally intensive

4
Making my algorithm more powerful

Two solutions
1. Pragmatic but without guarantee solution: I replace the inner product by some other
similarity
• The similarity is easy to compute, but
• I have no guarantee that it will work
2. Theoretical but hard in practice solution: I embed the samples in a possibly infinite
dimensional feature space
• My algorithm is the same but in a different space, but
• Computations in feature space can be computationally intensive

The two solutions are actually the same!


(provided that the similarity is a kernel)

4
Kernel: formal definition

• Arbitrary input space X (Rp , strings, graphs, . . . )


• A kernel k defined on X is a map:

k : X × X → R,

that verifies
• Symmetry: ∀x, y ∈ X , k(x, y ) = k(y , x), matrix (K )ij = k(x i , x j ) is symmetric

• Positive semidefiniteness: For all x 1 , . . . , x n ∈ X , matrix (K )ij = k(x i , x j ) is


positive semidefinite (PSD, K > 0):
X
∀α ∈ Rn , αT K α = αi αj k(x i , x j ) > 0
ij

5
Some examples

• Linear kernel (inner product is of course a kernel): X = Rp , k(x, y ) = hx, y i

• Polynomial kernel: let X = Rp , d > 0, c > 0

kpoly1 (x, y ) = (c + hx, y i)d

kx−y k2
n o
• Gaussian kernel or RBF (radial basis function): kgauss (x, y ) = exp − σ2 2

• Sigmoid kernel: k(x, y ) = tanh (γ hx, y i + c)

6
From solution #2 to solution #1

Embedding actually defines a kernel

• Let Φ : X 7−→ F some embedding of X in a feature space F that is equipped


with an inner production h·, ·iF .
• Define k as follows
k(x, y ) = hΦ(x), Φ(y )iF
• Define that way, k is always a kernel!
• Obviouly symmetric
• Inner product in F makes k PSD

7
From solution #2 to solution #1: the polynomial kernel case

Suppose X = R2 , n = 2, c = 1 and d = 2

• Polynomial kernel is then

kpoly1 (x, y ) = (1 + hx, y i)2


= (hx, y i + 1)2 = (1 + x1 y1 + x2 y2 )2
= 1 + 2x1 y1 + 2x2 y2 + x12 y12 + x22 y22 + 2x1 x2 y1 y2
√ √ √
• Define Φ(x) = 1, x12 , x22 , 2x1 , 2x2 , 2x1 x2 , we have


kpoly1 (x, y ) = hΦ(x), Φ(y )i

• The kernel kpoly1 is can be interpreted as an inner product

8
From solution #2 to solution #1: when X is finite

From the definition, K is a kernel iff K is symetric PSD

• K is diagonalizable in an orthonormal basis K = UDU T with U orthogonal and D


diagonal

k(x i , x j ) = Kij Define the function Φ

= (UDU T )ij Φ : X → Rn

= uT
i Duj x i 7→ Dui .
√ √
= uT
i D Duj
D√ √ E we have k(x i , x j ) = hΦ(x i ), Φ(x j )i
= Dui , Duj
• The kernel k is can also be interpreted as an inner product

9
Kernel trick

• Kernel trick
k(x, y ) = hΦ(x), Φ(y )iF

No need to compute the embedding and then the inner product, the kernel is exactly
doing this.
• Any algorithm using only inner products between training samples can be
“kernelized”
• Most famous kernel method: SVM (support vector machines)

10
Separating hyperplane

• In Rp , a hyperplane is defined by h(x) = 0 with h(x) = w · x + b.


x2


• w is orthogonal to the hyperplane

x+
w ·x +b >0

b=
|w ·x+b|
• d(x, H) = kw k2

0
|b|
• Distance from the origin is kw k2
w ·x +b <0
w
x1

11
Margin

x2
• H a separating hyperplane
H
• Distance from one x i to H is

|w · x i + b|
d(x i , H) =
kw k2
gin
mar
• Margin of a separating hyperplan is the
smallest distance from H to the x i ’s

M = min d(x i , H)
i=1,...,n
x1

12
Optimal separating hyperplane

• Optimal separating hyperplane x2


maximizes the margin
• Unique solution
• Vectors x i supporting the margin,
i.e. such that w ·x +b >0

w
·x
M = d(x i , H)

+
b

in
w ·x +b <0

g
=

ar
m
0
are called support vectors

n
w

gi
• H = arg max min d(x i , H)

ar
m
H∈H i=1,...,n x1

13
One-to-one representation

• Hyperplane ⇐⇒ (w , b) is not one-to-one


If (w , b) represents H so is (cw , cb) with c 6= 0
• Possible choice for c
1. Data independant: imposing kw k2 = 1 (taking c = kw k2 )
1

(w , b) with kw k2 = 1 (normalization 1)

2. Data dependant: w · x + b is ±1 at margin lines


1
(w , b) with kw k2 = (normalization 2)
M
When H is an optimal separating hyperplane, support vectors are ±1

14
Optimal separating hyperplane

x2

Choosing data-dependent normalization

w
• Margin is 1

·x
kw k2

+
• Takes the value ±1 for support

b
w

=
·x
vectors

1
+
b
w

=
·x

0
+
2

b
kw k2

=
w


1
x1

15
Maximizing the margin

• Maximizing 1
kw k2 ⇐⇒ Minimizing kw k22

• Proper labelling:
• w · x i + b > 1 when yi = 1
yi (w · x i + b) > 1
• w · x i + b 6 −1 when yi = −1

1
arg min kw k22 such that ∀i, yi (w · x i + b) > 1 (1)
(w ,b) 2

• Quadratic programming problem: quadratic objective, linear constraints

16
Lagrangian formulation

• We use the Lagrangian, n Lagrange multipliers µ1 , . . . , µn


n
1 X
L(w , b, µ) = kw k22 − µi (yi (w · x i + b) − 1)
2
i=1

• Karush–Kuhn–Tucker (KKT) conditions because we have inequalities


 n  n
∂L X X
= w − µi yi x i = 0 w = µi y i x i

 

 
 ∂w
 

i=1 i=1
n =⇒ n
 ∂L X X
 ∂b = − µi yi = 0 µi yi = 0

 

 

i=1 i=1

µi > 0 et µi (yi (w · x i + b) − 1) = 0

17
Dual formulation

• Plugging the KKT conditions back in the objective function we have


2    
n n n
1 X X X
L(µ, b) = µj y j x j − µi yi  µj y j x j · x i + b  − 1 
2
j=1 i=1 j=1
2
2
n n n
1 X X X
L(µ) = µj y j x j − µi µj yi yj x i · x j + µi
2
j=1 i,j=1 i=1
2
2
n n
1 X X
=− µ j yj x j + µi
2
j=1 i=1
2

18
Dual formulation

SVM minimization problem (dual form)



n n ∀i, µ > 0
1X X i
arg min µi µj yi yj x i · x j − µi such that
µ 2 P n
µi yi = 0
i=1 i=1 i=1

• Only depends on x i · x j : SVM algorithm is kernelizable

19
Support vectors

• yi (w · x i + b) > 1: µi = 0, x i is not used


• yi (w · x i + b) = 1: µi > 0, x i is on the margin, it is a support vector
x2

w ·x +b >0

w
·x
+

n
w ·x +b <0

gi
=

ar
m
0
n
w

gi
ar
x1

m
• Only support vectors are used to classify
20
Classification

• Classification of a new example x:


y (x) = sgn (w · x + b)
n
!
X
= sgn µi yi x i · x + b
i=1
• Only the support vectors are used to classify
• We only need to compute x i · x with x i a support vector
• Robust to outliers in training data
• How to determine b:
• If x i is a support vector, yi (w · x i + b) = 1 hence
b = yi − w · x i
• More stable numerically
1 X
yi − w · x i
#{i, µi > 0} 21
i, µi >0
Extension to the non-separable case

• Until now, classes were was supposed to be linearly separable but:


• it is never the case in practice
• minimization problem (1) or dual form do not even have a solution: set of admissible
solution is empty
• Typical solution: relaxing the constraints

22
Slack variables

• Original minimization problem

1
arg min kw k22 such that ∀i, yi (w · x i + b) > 1
(w ,b) 2

• Some constraint might be violated; we give them some slack by introducing


nonnegative slack variables ξ1 , . . . , ξn so that

∀i, yi (w · x i + b) > 1 becomes ∀i, yi (w · x i + b) > 1 − ξi

• The ξi ’s catch up on mistakes

23
Slack variables: geometric interpretation

ξ>2
New constraints: yi (w · x i + b) > 1 − ξi x2 ξ=2
• If ξi = 0 the original constraint is met ξ>1
ξ=1
• If 0 < ξi < 2 the original constraint is ξ>0
violated and the sample x i belongs to

w
·x
the margin

+
b
w

=
• If ξi > 2 the original constraint is

·x

1
+
w
violated and the sample x i belong to

b
=
·x

0
the wrong side

+
2

b
kw k2

=

x1

1
24
Soft margin minimization problem

• Still maximizing the margin but regularizing by the amount of slack ni=1 ξi
P

n
∀i, yi (w · x i + b) > 1 − ξi
(
1 X
arg min kw k22 + C ξi such that
(w ,b) 2 i=1 ξi > 0

• We introduce a new hyperparameter C : cost of adding some slack


• Hard margin if C → +∞: no slack

25
Hinge loss

• Equivalence of constraints
∀i, yi (w · x i + b) > 1 − ξi
(
⇐⇒ ξi > max (0, 1 − yi (w · x i + b))
ξi > 0
• Smaller objective function if ξi = max (0, 1 − yi (w · x i + b))
• If we plug the ξi ’s in the objective function we have
n
1 X
arg min kw k22 + C max (0, 1 − yi (w · x i + b))
(w ,b) 2 i=1

• Defining hinge loss function as: Lhinge (x, y ) = max (0, 1 − xy )


n
X 1
arg min Lhinge (yi , w · x i + b) + kw k22
(w ,b) 2C
i=1

26
Other losses

loss L

Losses when true class is yi = 1 0/1 loss


hinge loss
• 0/1 loss: penalises wrong sign of square loss
prediction ybi
• Square loss penalises the gap whatever 2
the direction
• hinge loss: relaxing 0/1 loss 1

−2 −1 1 2 ybi

27
Illustrations: increasing the tuning parameter C

Gaussian kernel with varying C


C=1 C=10

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3
C=100 C=1000

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3 28
Illustrations: changing the kernel

linear rbf

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3
poly sigmoid

2 2

0 0

−2 −2
−1 0 1 2 3 −1 0 1 2 3

29
Kernel PCA

The idea is to apply the principal component analysis on the feature space

• New points are y i = Φ(x i )


• Design matrix changes from
 T  T
x1 y1
 ..   .. 
X = .  to XΦ =  . 
xT
n yT
n

• Suppose for now that XΦ is centered


• The kernel matrix K gathers the inner products in feature space: K = XΦ XΦT

30
Kernel PCA

Kernel PCA on feature space

• Let VΦ = n1 XΦT XΦ the sample variance–covariance matrix, v1 , . . . , vq the eigen


vectors and λ1 > . . . > λq the corresponding eigenvalues
• We have seen that the i-th principal component is XΦ vi

We don’t want to compute XΦ vi

• It is easy to see that K (XΦ vi ) = nλi (XΦ vi ); PC are just eigenvectors (properly
rescaled) of the kernel matrix K

31
Centering the kernel

We supposed that XΦ was centered but what if it’s not?

• Define the centering operator

1− 1
− n1 − n1
 
n ...
.. .. 
1 1 − n1 . . 
 1
 −
Qn = In − 1n =  n .
n  .. .. ..
 . . . − n1 

− n1 ... − n1 1− 1
n

so that XΦ0 = Qn XΦ is the centering of XΦ


• The corresponding kernel is K 0 = Qn XΦ (Qn XΦ )T = Qn XΦ XΦT Qn = Qn KQn
• No need to compute XΦ , just center column-wise and row-wise the kernel

32
Examples

Classic PCA (linear kernel)

12
Level lines of first principal component
9
5 6
• Level lines of first principal
3
component are straight lines
0 0
• And they are orthogonal to −3
first principal direction
−6
−5
−9
−5 0 5 10
−12

33
Examples

Kernel PCA with an RBF kernel (gaussian kernel)

0.8
0.6 0.75
5 0.4
0.50
0.2
0 0.0 0.25
−0.2
−0.4 0.00
−5
−5 0 5 10 −0.6 −0.25
−0.8
−0.5 0.0 0.5

34
Kernel ridge regression

• Matrix inversion trick: from X I + X T X = I + XX T X we get


 

 −1  −1
I + XX T X = X I + XTX

where XX T is n × n and X T X is p × p
−1 T
• Using this in ybridge = X X T X + λIp X y we get
 −1
ybridge = λIp + XX T XX T y

and then ybridge = (λIp + K )−1 K y

35
Kernel k-means / kernel k-nearest neighbors

Those algorithms only depend on distances between samples

• Since interdistance can be expressed as inner products

kx i − x j k2 = hx i , x i i − 2 hx i , x j i + hx j , x j i

• K -means is kernelizable. The “kernel distance” is then

d(x i , x j )2 = k(x i , x i ) − 2k(x i , x j ) + k(x j , x j )

36

You might also like