0% found this document useful (0 votes)
6 views32 pages

Kernel Methods

The document discusses kernel methods in machine learning, focusing on regularized risk minimization in reproducing kernel Hilbert spaces (RKHS). It covers various applications such as kernel perceptron, kernel PCA, ridge regression, and Gaussian processes, highlighting their mathematical formulations and optimization techniques. Additionally, it explores multiclass SVMs and structured prediction, emphasizing the flexibility and modularity of kernel methods in addressing different learning problems.

Uploaded by

eswam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views32 pages

Kernel Methods

The document discusses kernel methods in machine learning, focusing on regularized risk minimization in reproducing kernel Hilbert spaces (RKHS). It covers various applications such as kernel perceptron, kernel PCA, ridge regression, and Gaussian processes, highlighting their mathematical formulations and optimization techniques. Additionally, it explores multiclass SVMs and structured prediction, emphasizing the flexibility and modularity of kernel methods in addressing different learning problems.

Uploaded by

eswam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Topic 6: KERNEL METHODS

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning


Risi Kondor, The University of Chicago
General form of kernel methods
k : X × X → R is a PSD kernel and Hk is the associated RKHS. Kernel
methods (aka. Hilbert space learning algorithms) solve the RRM problem
 
1 X
m
fb = argmin ℓ(f (xi ), yi ) + Ω(∥f ∥Hk ) .
f ∈Hk m
i=1
| {z }
| {z } regularizer
training error

• Ω can be any increasing function R+ → R+ .


• The final hypothesis is b
h(x) = σ(fb(x)) . For example, in the SVM,
simply b
h(x) = sgn(fb(x)) .
• ℓ is the surrogate loss. For example, in classification, the hinge loss is
a surrogate for the zero-one loss.
Instead of actually having to search over a function space, all such problems
reduce to m dimensional optimization thanks to the reproducing property
f (x) = ⟨f, kx ⟩ and the Representer Theorem.

2
2/32
/32
The modularity of kernel methods
Regularized risk minimization in RKHSs is a powerful paradigm because it
has distinct moving parts:
• The loss
◦ Reflects the nature of the problem (classification/regression/ranking/...).
◦ Determines exactly what type of optimization problem we end up with.
• The kernel
◦ Regulates overfitting by determining the regularization term.
◦ Reflects our prior knowledge about the problem.
Can dream up virtually any kernel machine and solve it efficiently as long as
1. The loss only involves function evaluations f (x) = ⟨f, kx ⟩ at data
points;
2. The regularizer is an increasing function of ∥f ∥F .

3
3/32
/32
Loss functions for regression

4
4/32
/32
1. The kernel perceptron
The vanilla perceptron
w←0;
t←1;
while(true){
if w · xt ≥ 0 predict ŷt = 1 ; else predict ŷt = −1 ;
if ((ŷt = −1) and (yt = 1)) let w ← w + xt ;
if ((ŷt = 1) and (yt = −1)) let w ← w − xt ;
t←t+1;
}

At any t , the weight vector is of the form

X
t−1
w= c i xi where ci ∈ {−1, 0, +1} .
i=1

6
6/32
/32
The kernel perceptron

t←1;
while(1){
Pt−1
if i=1 ci k(xi , xt ) ≥ 0 predict ŷt = 1 ; else ŷt = −1 ;
ct ← 0 ;
if ((ŷt = −1) and (yt = 1)) let ct = 1 ;
if ((ŷt = 1) and (yt = −1)) let ct = −1 ;
t←t+1;
}

7
7/32
/32
2. Kernel PCA
PCA in feature space

Recall that in RD (after centering), the first principal component is given by

1 X
m
v1 = arg max (xi · v)2 .
∥v∥=1 m
i=1
Pm
Clearly, v1 lies in the span, i.e., v1 = i=1 αi xi .

Kernel analog:
X
m
f1 = argmax ⟨f, ϕ(xi )⟩2 .
f ∈F ∥f ∥=1 i=1
Pm
Once again, f = i=1 αi ϕ(xi ) for some α1 , . . . , α m ∈ R .

9
9/32
/32
Kernel PCA
As in RD , f will be the highest e-value e-vector of the sample covariance
operator
1 X
m
Σ(f ) = ϕ(xi ) ⟨f, ϕ(xi )⟩ .
m
i=1
Pm
Plugging in f = ℓ=1 αℓ ϕ(xℓ ) and multiplying from the right by any
ϕ(xj ) :

1 XX X
m m m
⟨ϕ(xj ), ϕ(xi )⟩ ⟨ϕ(xi ), ϕ(xℓ )⟩ αℓ = λ ⟨ϕ(xj ), ϕ(xℓ )⟩ αℓ .
m
i=1 ℓ=1 ℓ=1

Using ⟨ϕ(xj ), ϕ(xi )⟩ = k(xi , xj ) and letting K be the Gram matrix,

K 2 α = mλKα =⇒ Kα = mλα,

so kernel PCA reduces to just finding the first eigenvector of the Gram matrix!

10
/32
10/32
11
11/32
/32
3. Ridge Regression
Ridge Regression

Using squared error loss and setting λ = m/2C ,


X
m 
fb = argmin (f (xi ) − yi ) + 2
λ ∥f ∥2Hk .
f ∈Hk i=1
| {z }
R[f ]
Pm
By the Representer Theorem, f (x) = i=1 αi k(xi , x), so

m X
X m 2 X
m X
m
R[f ] = αj k(xi , xj ) − yi +λ αi αj k(xi , xj ).
i=1 j=1 i=1 j=1

13
13/32
/32
Ridge Regression

Letting y = (y1 , . . . , ym ) , α = (α1 , . . . , αm )⊤ and Ki,j = k(xi , xj ) ,

R(α) = ∥ Kα − y ∥2 + λα⊤Kα.

At the optimum,

∂R(α)
= [K(Kα − y)]i + λ[Kα]i = 0,
∂αi
so

K(Kα − y) + λKα = 0 =⇒ α = (K + λI)−1 y.

14
14/32
/32
Ridge Regression

Defining kx = k(xi , x) , the final solution is

fb(x) = k⊤ −1
x (K + λI) y.

• In this case RRM reduced to just inverting a matrix.


• In fact, this is just ridge regression, which is a classical method in
statistics, and the simplest non-linear regression/interpolation method
possible.
• Ridge regression is the same as the MAP of a Gaussian Process with
mean zero and covariance function k .

15
15/32
/32
3. Gaussian Processes
17
17/32
/32
Bayesian nonparametric regression

The canonical regression problem: learn a function f : Rd → R from a


training set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )} .

The Bayesian way:


1. Assume that f ∼ p0 (f ) for some appropriate prior p0
2. Assume that yi ∼ p(yi |f (xi )) for some distribution p
3. Use Bayes’ rule
p(D|f ) p0 (f )
p(f |D) = R ′ ′
f ′ p(D|f ) p0 (f )
Qm
with p(D|f ) = i=1 p(yi |f (xi )) .

18
18/32
/32
A prior over functions
The prior p0 should capture that f is expected to be smooth.

Question: But how does one define a distribution over functions?

19
19/32
/32
A prior over functions

IDEA: Assuming that the training points {x}m


i=1 and testing points
′ p
{x }i=1 are known, just focus on the marginals

p0 (f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p ))

p(f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p )|D).

A stochastic process is a distrubition over functions, usually defined by


specifying all possible finite dimensional marginals. → Bayesian
nonparametrics

20
20/32
/32
Gaussian Processes

Given any (suitably smooth) µ : X → R and a p.s.d. k : X × X → R,


GP (µ, k) is a distribution over functions f : X → R such that for any
x1 , . . . , xm ∈ X , if f ∼ GP (µ, k), then

(f (x1 ), . . . , f (xm ))⊤ ∼ N (µ, Σ)

where µi = µ(xi ) and Σi,j = k(xi , xj ).

µ and k are caled the mean and covariance functions of the GP since

E[f (x)] = µ(x)


Cov(f (x), f (x′ )) = k(x, x′ )

21
21/32
/32
Gaussian Processes

Assume for simplicity that y ∼ N (f (x), σ 2 ) . Then, after observing


{(x1 , y1 ), . . . , (xm , ym )} ,

E(f (x)) = k⊤ 2 −1
x (K + σ I) y
Var(f (x)) = κx − k⊤ 2 −1
x (K + σ I) kx

where y = (y1 , . . . , ym ), Ki,j = k(xi , xj ), [kx ]i = k(xi , x) , and


κx = k(xi , xi ).

→ GPs are very easy to use because the maringals and conditionals of
Gaussians are also Gaussian.

22
22/32
/32
Gaussian Process

23
23/32
/32
Gaussian Process

24
24/32
/32
One-class SVM and Multiclass SVM
The one-class SVM (outlier
detection)
RKHS primal form

X
m 
1 1
fb = argmin (1 − f (xi ))≥0 + 2
∥f ∥Hk .
f ∈Hk m 2C
i=1

Tries to peg f (xi ) ≥ 0 for all points x1 , . . . , xm in the training set


→ outlier detector.

Dual form
X 1X
maximizeα1 ,...,αm L(α) = αi − αi αj k(xi , xj )
2
i i,j

C
subject to 0 ≤ αi ≤ ∀i
m
26
26/32
/32
The Multiclass SVM
• Defining fz (x) = zf (x)/2 for z = ±1 ,

ℓhinge (f (x), y) = (1 − y f (x))≥0 = (1 − (fy (x) − f−y (x)))≥0 ,

i.e., the correct answer is supposed to beat the incorrect answer by at


least a margin of 1.

• This inspires the multiclass hinge loss


X 
ℓ(f1 (x), . . . fk (x), y) = 1 − (fy (x) − fy′ (x)) ≥0
,
y ′ ∈{1,2,...,k}\{y}

which is the basis of the k -class SVM (fj (x) is a bit like a “score”). This is
essentially the same notion of multiclass margin as in the k -class
perceptron. Predict y b = argmaxj∈Y fj (x, j) .

27
27/32
/32
RKHS form of Multiclass SVM
The loss now depends on not just fy (x) , but also fy ′ (x) for all y ′ ̸= y ,
so the RKHS form also needs to be generalized slightly:
X
m 
1
fb = argmin ℓ(f1 (xi ), f2 (xi ), . . . , fk (xi ), yi ) + Ω(∥f ∥H ) .
f ∈Hk m | {z }
| i=1 {z } regularizer
training error

The corresponding generalized Representer Theorem will say that

X
m
fj (x) = αi,j k(xi , x)
i=1

for all j ∈ {1, . . . , k} , so now we have many more coefficients to optimize.

28
28/32
/32
Structured prediction
Multiclass to Structured Prediction
What if we combine f1 , . . . , fk : X → R in the k -class SVM into a single
function f : X × Y → R where Y = {1, . . . , k} ? The loss becomes
X 
ℓ(f, x, y) = 1 − (f (x, y) − f (x, y ′ )) ≥0
.
y ′ ∈Y\{y}

IDEA: Use this to search for f in a joint RKHS of functions


f: X ×Y →R:
• Kernel becomes k((x, y), (x′ , y ′ )) .
• We can now put structure on Y as well as X .
• We are now learning a single mapping f : X → Y directly.
• At the extreme, the distinction between X and Y is blurred.
→ structured prediction

30
30/32
/32
RRM form of Structured Prediction
Let k be a psd kernel k : (X × Y) × (X × Y) → R , let Hk be the
corresponding RKHS, and Ω a monotonically increasing function. Solve
 
1 X
m
fb = arg min ℓ((f (xi , y))y∈Y , yi ) + Ω∥f ∥F ,
f ∈Hk m | {z }
i=1
| {z } regularizer
training error

b = argmaxy∈Y fb(x, y) .
and predict y

The loss (in principle) now depends on f (xi , y) for all possible y .
Therefore, the Representer Theorem says that f is of the form
m X
X
f (x, y) = αi,y∗ k((xi , y ∗ ), (x, y)).
i=1 y ∗ ∈Y

In practice, this is usually unfeasible, so only add αi,y ∗ coefficients to the


optimization on the fly “as needed”.

31
31/32
/32
Kernels for Structured Learning

The simplest way to get a kernel k : (X × Y) × (X × Y) → R :


• Get a kernel kX that quantifies similarity between the x ’s.
• Get a kernel kY that quantifies similarity between the y ’s.
• Define
k((x, y), (x′ , y ′ )) = kX (x, x′ ) · kY (y, y ′ ).
Question: Is this a valid kernel? What is its RKHS?

32
32/32
/32

You might also like