Kernel methods
UE de Master 2, AOS1
Fall 2020
S. Rousseau
1
Introductory example
Let’s look at one of the simplest binary classifier: the Euclidean classifier
• Classification rule is: choose class whose class sample mean is the closest
2
Introductory example
Let’s look at one of the simplest binary classifier: the Euclidean classifier
• Classification rule is: choose class whose class sample mean is the closest
w y
x
2
Introductory example
Let’s look at one of the simplest binary classifier: the Euclidean classifier
• Classification rule is: choose class whose class sample mean is the closest
g w y
x
x
x −g
2
Decision rule
• Decision rule depends on the
sign of an inner product
hx − g , w i g w y
x
• Inner product leads to too x
simple decision boundary x −g
(hyperplane in n-d, line in 2-d)
How can we remedy to this?
3
Making my algorithm more powerful
Two solutions
1. Pragmatic but without guarantee solution: I replace the inner product by some other
similarity
• The similarity is easy to compute, but
• I have no guarantee that it will work
2. Theoretical but hard in practice solution: I embed the samples in a possibly infinite
dimensional feature space
• My algorithm is the same but in a different space, but
• Computations in feature space can be computationally intensive
4
Making my algorithm more powerful
Two solutions
1. Pragmatic but without guarantee solution: I replace the inner product by some other
similarity
• The similarity is easy to compute, but
• I have no guarantee that it will work
2. Theoretical but hard in practice solution: I embed the samples in a possibly infinite
dimensional feature space
• My algorithm is the same but in a different space, but
• Computations in feature space can be computationally intensive
The two solutions are actually the same!
(provided that the similarity is a kernel)
4
Kernel: formal definition
• Arbitrary input space X (Rp , strings, graphs, . . . )
• A kernel k defined on X is a map:
k : X × X → R,
that verifies
• Symmetry: ∀x, y ∈ X , k(x, y ) = k(y , x), matrix (K )ij = k(x i , x j ) is symmetric
• Positive semidefiniteness: For all x 1 , . . . , x n ∈ X , matrix (K )ij = k(x i , x j ) is
positive semidefinite (PSD, K > 0):
X
∀α ∈ Rn , αT K α = αi αj k(x i , x j ) > 0
ij
5
Some examples
• Linear kernel (inner product is of course a kernel): X = Rp , k(x, y ) = hx, y i
• Polynomial kernel: let X = Rp , d > 0, c > 0
kpoly1 (x, y ) = (c + hx, y i)d
kx−y k2
n o
• Gaussian kernel or RBF (radial basis function): kgauss (x, y ) = exp − σ2 2
• Sigmoid kernel: k(x, y ) = tanh (γ hx, y i + c)
6
From solution #2 to solution #1
Embedding actually defines a kernel
• Let Φ : X 7−→ F some embedding of X in a feature space F that is equipped
with an inner production h·, ·iF .
• Define k as follows
k(x, y ) = hΦ(x), Φ(y )iF
• Define that way, k is always a kernel!
• Obviouly symmetric
• Inner product in F makes k PSD
7
From solution #2 to solution #1: the polynomial kernel case
Suppose X = R2 , n = 2, c = 1 and d = 2
• Polynomial kernel is then
kpoly1 (x, y ) = (1 + hx, y i)2
= (hx, y i + 1)2 = (1 + x1 y1 + x2 y2 )2
= 1 + 2x1 y1 + 2x2 y2 + x12 y12 + x22 y22 + 2x1 x2 y1 y2
√ √ √
• Define Φ(x) = 1, x12 , x22 , 2x1 , 2x2 , 2x1 x2 , we have
kpoly1 (x, y ) = hΦ(x), Φ(y )i
• The kernel kpoly1 is can be interpreted as an inner product
8
From solution #2 to solution #1: when X is finite
From the definition, K is a kernel iff K is symetric PSD
• K is diagonalizable in an orthonormal basis K = UDU T with U orthogonal and D
diagonal
k(x i , x j ) = Kij Define the function Φ
= (UDU T )ij Φ : X → Rn
√
= uT
i Duj x i 7→ Dui .
√ √
= uT
i D Duj
D√ √ E we have k(x i , x j ) = hΦ(x i ), Φ(x j )i
= Dui , Duj
• The kernel k is can also be interpreted as an inner product
9
Kernel trick
• Kernel trick
k(x, y ) = hΦ(x), Φ(y )iF
No need to compute the embedding and then the inner product, the kernel is exactly
doing this.
• Any algorithm using only inner products between training samples can be
“kernelized”
• Most famous kernel method: SVM (support vector machines)
10
Separating hyperplane
• In Rp , a hyperplane is defined by h(x) = 0 with h(x) = w · x + b.
x2
w·
• w is orthogonal to the hyperplane
x+
w ·x +b >0
b=
|w ·x+b|
• d(x, H) = kw k2
0
|b|
• Distance from the origin is kw k2
w ·x +b <0
w
x1
11
Margin
x2
• H a separating hyperplane
H
• Distance from one x i to H is
|w · x i + b|
d(x i , H) =
kw k2
gin
mar
• Margin of a separating hyperplan is the
smallest distance from H to the x i ’s
M = min d(x i , H)
i=1,...,n
x1
12
Optimal separating hyperplane
• Optimal separating hyperplane x2
maximizes the margin
• Unique solution
• Vectors x i supporting the margin,
i.e. such that w ·x +b >0
w
·x
M = d(x i , H)
+
b
in
w ·x +b <0
g
=
ar
m
0
are called support vectors
n
w
gi
• H = arg max min d(x i , H)
ar
m
H∈H i=1,...,n x1
13
One-to-one representation
• Hyperplane ⇐⇒ (w , b) is not one-to-one
If (w , b) represents H so is (cw , cb) with c 6= 0
• Possible choice for c
1. Data independant: imposing kw k2 = 1 (taking c = kw k2 )
1
(w , b) with kw k2 = 1 (normalization 1)
2. Data dependant: w · x + b is ±1 at margin lines
1
(w , b) with kw k2 = (normalization 2)
M
When H is an optimal separating hyperplane, support vectors are ±1
14
Optimal separating hyperplane
x2
Choosing data-dependent normalization
w
• Margin is 1
·x
kw k2
+
• Takes the value ±1 for support
b
w
=
·x
vectors
1
+
b
w
=
·x
0
+
2
b
kw k2
=
w
−
1
x1
15
Maximizing the margin
• Maximizing 1
kw k2 ⇐⇒ Minimizing kw k22
• Proper labelling:
• w · x i + b > 1 when yi = 1
yi (w · x i + b) > 1
• w · x i + b 6 −1 when yi = −1
1
arg min kw k22 such that ∀i, yi (w · x i + b) > 1 (1)
(w ,b) 2
• Quadratic programming problem: quadratic objective, linear constraints
16
Lagrangian formulation
• We use the Lagrangian, n Lagrange multipliers µ1 , . . . , µn
n
1 X
L(w , b, µ) = kw k22 − µi (yi (w · x i + b) − 1)
2
i=1
• Karush–Kuhn–Tucker (KKT) conditions because we have inequalities
n n
∂L X X
= w − µi yi x i = 0 w = µi y i x i
∂w
i=1 i=1
n =⇒ n
∂L X X
∂b = − µi yi = 0 µi yi = 0
i=1 i=1
µi > 0 et µi (yi (w · x i + b) − 1) = 0
17
Dual formulation
• Plugging the KKT conditions back in the objective function we have
2
n n n
1 X X X
L(µ, b) = µj y j x j − µi yi µj y j x j · x i + b − 1
2
j=1 i=1 j=1
2
2
n n n
1 X X X
L(µ) = µj y j x j − µi µj yi yj x i · x j + µi
2
j=1 i,j=1 i=1
2
2
n n
1 X X
=− µ j yj x j + µi
2
j=1 i=1
2
18
Dual formulation
SVM minimization problem (dual form)
n n ∀i, µ > 0
1X X i
arg min µi µj yi yj x i · x j − µi such that
µ 2 P n
µi yi = 0
i=1 i=1 i=1
• Only depends on x i · x j : SVM algorithm is kernelizable
19
Support vectors
• yi (w · x i + b) > 1: µi = 0, x i is not used
• yi (w · x i + b) = 1: µi > 0, x i is on the margin, it is a support vector
x2
w ·x +b >0
w
·x
+
n
w ·x +b <0
gi
=
ar
m
0
n
w
gi
ar
x1
m
• Only support vectors are used to classify
20
Classification
• Classification of a new example x:
y (x) = sgn (w · x + b)
n
!
X
= sgn µi yi x i · x + b
i=1
• Only the support vectors are used to classify
• We only need to compute x i · x with x i a support vector
• Robust to outliers in training data
• How to determine b:
• If x i is a support vector, yi (w · x i + b) = 1 hence
b = yi − w · x i
• More stable numerically
1 X
yi − w · x i
#{i, µi > 0} 21
i, µi >0
Extension to the non-separable case
• Until now, classes were was supposed to be linearly separable but:
• it is never the case in practice
• minimization problem (1) or dual form do not even have a solution: set of admissible
solution is empty
• Typical solution: relaxing the constraints
22
Slack variables
• Original minimization problem
1
arg min kw k22 such that ∀i, yi (w · x i + b) > 1
(w ,b) 2
• Some constraint might be violated; we give them some slack by introducing
nonnegative slack variables ξ1 , . . . , ξn so that
∀i, yi (w · x i + b) > 1 becomes ∀i, yi (w · x i + b) > 1 − ξi
• The ξi ’s catch up on mistakes
23
Slack variables: geometric interpretation
ξ>2
New constraints: yi (w · x i + b) > 1 − ξi x2 ξ=2
• If ξi = 0 the original constraint is met ξ>1
ξ=1
• If 0 < ξi < 2 the original constraint is ξ>0
violated and the sample x i belongs to
w
·x
the margin
+
b
w
=
• If ξi > 2 the original constraint is
·x
1
+
w
violated and the sample x i belong to
b
=
·x
0
the wrong side
+
2
b
kw k2
=
−
x1
1
24
Soft margin minimization problem
• Still maximizing the margin but regularizing by the amount of slack ni=1 ξi
P
n
∀i, yi (w · x i + b) > 1 − ξi
(
1 X
arg min kw k22 + C ξi such that
(w ,b) 2 i=1 ξi > 0
• We introduce a new hyperparameter C : cost of adding some slack
• Hard margin if C → +∞: no slack
25
Hinge loss
• Equivalence of constraints
∀i, yi (w · x i + b) > 1 − ξi
(
⇐⇒ ξi > max (0, 1 − yi (w · x i + b))
ξi > 0
• Smaller objective function if ξi = max (0, 1 − yi (w · x i + b))
• If we plug the ξi ’s in the objective function we have
n
1 X
arg min kw k22 + C max (0, 1 − yi (w · x i + b))
(w ,b) 2 i=1
• Defining hinge loss function as: Lhinge (x, y ) = max (0, 1 − xy )
n
X 1
arg min Lhinge (yi , w · x i + b) + kw k22
(w ,b) 2C
i=1
26
Other losses
loss L
Losses when true class is yi = 1 0/1 loss
hinge loss
• 0/1 loss: penalises wrong sign of square loss
prediction ybi
• Square loss penalises the gap whatever 2
the direction
• hinge loss: relaxing 0/1 loss 1
−2 −1 1 2 ybi
27
Illustrations: increasing the tuning parameter C
Gaussian kernel with varying C
C=1 C=10
2 2
0 0
−2 −2
−1 0 1 2 3 −1 0 1 2 3
C=100 C=1000
2 2
0 0
−2 −2
−1 0 1 2 3 −1 0 1 2 3 28
Illustrations: changing the kernel
linear rbf
2 2
0 0
−2 −2
−1 0 1 2 3 −1 0 1 2 3
poly sigmoid
2 2
0 0
−2 −2
−1 0 1 2 3 −1 0 1 2 3
29
Kernel PCA
The idea is to apply the principal component analysis on the feature space
• New points are y i = Φ(x i )
• Design matrix changes from
T T
x1 y1
.. ..
X = . to XΦ = .
xT
n yT
n
• Suppose for now that XΦ is centered
• The kernel matrix K gathers the inner products in feature space: K = XΦ XΦT
30
Kernel PCA
Kernel PCA on feature space
• Let VΦ = n1 XΦT XΦ the sample variance–covariance matrix, v1 , . . . , vq the eigen
vectors and λ1 > . . . > λq the corresponding eigenvalues
• We have seen that the i-th principal component is XΦ vi
We don’t want to compute XΦ vi
• It is easy to see that K (XΦ vi ) = nλi (XΦ vi ); PC are just eigenvectors (properly
rescaled) of the kernel matrix K
31
Centering the kernel
We supposed that XΦ was centered but what if it’s not?
• Define the centering operator
1− 1
− n1 − n1
n ...
.. ..
1 1 − n1 . .
1
−
Qn = In − 1n = n .
n .. .. ..
. . . − n1
− n1 ... − n1 1− 1
n
so that XΦ0 = Qn XΦ is the centering of XΦ
• The corresponding kernel is K 0 = Qn XΦ (Qn XΦ )T = Qn XΦ XΦT Qn = Qn KQn
• No need to compute XΦ , just center column-wise and row-wise the kernel
32
Examples
Classic PCA (linear kernel)
12
Level lines of first principal component
9
5 6
• Level lines of first principal
3
component are straight lines
0 0
• And they are orthogonal to −3
first principal direction
−6
−5
−9
−5 0 5 10
−12
33
Examples
Kernel PCA with an RBF kernel (gaussian kernel)
0.8
0.6 0.75
5 0.4
0.50
0.2
0 0.0 0.25
−0.2
−0.4 0.00
−5
−5 0 5 10 −0.6 −0.25
−0.8
−0.5 0.0 0.5
34
Kernel ridge regression
• Matrix inversion trick: from X I + X T X = I + XX T X we get
−1 −1
I + XX T X = X I + XTX
where XX T is n × n and X T X is p × p
−1 T
• Using this in ybridge = X X T X + λIp X y we get
−1
ybridge = λIp + XX T XX T y
and then ybridge = (λIp + K )−1 K y
35
Kernel k-means / kernel k-nearest neighbors
Those algorithms only depend on distances between samples
• Since interdistance can be expressed as inner products
kx i − x j k2 = hx i , x i i − 2 hx i , x j i + hx j , x j i
• K -means is kernelizable. The “kernel distance” is then
d(x i , x j )2 = k(x i , x i ) − 2k(x i , x j ) + k(x j , x j )
36