Topic 6: KERNEL METHODS
STAT 37710/CAAM 37710/CMSC 35400 Machine Learning
Risi Kondor, The University of Chicago
General form of kernel methods
k : X × X → R is a PSD kernel and Hk is the associated RKHS. Kernel
methods (aka. Hilbert space learning algorithms) solve the RRM problem
1 X
m
fb = argmin ℓ(f (xi ), yi ) + Ω(∥f ∥Hk ) .
f ∈Hk m
i=1
| {z }
| {z } regularizer
training error
• Ω can be any increasing function R+ → R+ .
• The final hypothesis is b
h(x) = σ(fb(x)) . For example, in the SVM,
simply b
h(x) = sgn(fb(x)) .
• ℓ is the surrogate loss. For example, in classification, the hinge loss is
a surrogate for the zero-one loss.
Instead of actually having to search over a function space, all such problems
reduce to m dimensional optimization thanks to the reproducing property
f (x) = ⟨f, kx ⟩ and the Representer Theorem.
2
2/32
/32
The modularity of kernel methods
Regularized risk minimization in RKHSs is a powerful paradigm because it
has distinct moving parts:
• The loss
◦ Reflects the nature of the problem (classification/regression/ranking/...).
◦ Determines exactly what type of optimization problem we end up with.
• The kernel
◦ Regulates overfitting by determining the regularization term.
◦ Reflects our prior knowledge about the problem.
Can dream up virtually any kernel machine and solve it efficiently as long as
1. The loss only involves function evaluations f (x) = ⟨f, kx ⟩ at data
points;
2. The regularizer is an increasing function of ∥f ∥F .
3
3/32
/32
Loss functions for regression
4
4/32
/32
1. The kernel perceptron
The vanilla perceptron
w←0;
t←1;
while(true){
if w · xt ≥ 0 predict ŷt = 1 ; else predict ŷt = −1 ;
if ((ŷt = −1) and (yt = 1)) let w ← w + xt ;
if ((ŷt = 1) and (yt = −1)) let w ← w − xt ;
t←t+1;
}
At any t , the weight vector is of the form
X
t−1
w= c i xi where ci ∈ {−1, 0, +1} .
i=1
6
6/32
/32
The kernel perceptron
t←1;
while(1){
Pt−1
if i=1 ci k(xi , xt ) ≥ 0 predict ŷt = 1 ; else ŷt = −1 ;
ct ← 0 ;
if ((ŷt = −1) and (yt = 1)) let ct = 1 ;
if ((ŷt = 1) and (yt = −1)) let ct = −1 ;
t←t+1;
}
7
7/32
/32
2. Kernel PCA
PCA in feature space
Recall that in RD (after centering), the first principal component is given by
1 X
m
v1 = arg max (xi · v)2 .
∥v∥=1 m
i=1
Pm
Clearly, v1 lies in the span, i.e., v1 = i=1 αi xi .
Kernel analog:
X
m
f1 = argmax ⟨f, ϕ(xi )⟩2 .
f ∈F ∥f ∥=1 i=1
Pm
Once again, f = i=1 αi ϕ(xi ) for some α1 , . . . , α m ∈ R .
9
9/32
/32
Kernel PCA
As in RD , f will be the highest e-value e-vector of the sample covariance
operator
1 X
m
Σ(f ) = ϕ(xi ) ⟨f, ϕ(xi )⟩ .
m
i=1
Pm
Plugging in f = ℓ=1 αℓ ϕ(xℓ ) and multiplying from the right by any
ϕ(xj ) :
1 XX X
m m m
⟨ϕ(xj ), ϕ(xi )⟩ ⟨ϕ(xi ), ϕ(xℓ )⟩ αℓ = λ ⟨ϕ(xj ), ϕ(xℓ )⟩ αℓ .
m
i=1 ℓ=1 ℓ=1
Using ⟨ϕ(xj ), ϕ(xi )⟩ = k(xi , xj ) and letting K be the Gram matrix,
K 2 α = mλKα =⇒ Kα = mλα,
so kernel PCA reduces to just finding the first eigenvector of the Gram matrix!
10
/32
10/32
11
11/32
/32
3. Ridge Regression
Ridge Regression
Using squared error loss and setting λ = m/2C ,
X
m
fb = argmin (f (xi ) − yi ) + 2
λ ∥f ∥2Hk .
f ∈Hk i=1
| {z }
R[f ]
Pm
By the Representer Theorem, f (x) = i=1 αi k(xi , x), so
m X
X m 2 X
m X
m
R[f ] = αj k(xi , xj ) − yi +λ αi αj k(xi , xj ).
i=1 j=1 i=1 j=1
13
13/32
/32
Ridge Regression
Letting y = (y1 , . . . , ym ) , α = (α1 , . . . , αm )⊤ and Ki,j = k(xi , xj ) ,
R(α) = ∥ Kα − y ∥2 + λα⊤Kα.
At the optimum,
∂R(α)
= [K(Kα − y)]i + λ[Kα]i = 0,
∂αi
so
K(Kα − y) + λKα = 0 =⇒ α = (K + λI)−1 y.
14
14/32
/32
Ridge Regression
Defining kx = k(xi , x) , the final solution is
fb(x) = k⊤ −1
x (K + λI) y.
• In this case RRM reduced to just inverting a matrix.
• In fact, this is just ridge regression, which is a classical method in
statistics, and the simplest non-linear regression/interpolation method
possible.
• Ridge regression is the same as the MAP of a Gaussian Process with
mean zero and covariance function k .
15
15/32
/32
3. Gaussian Processes
17
17/32
/32
Bayesian nonparametric regression
The canonical regression problem: learn a function f : Rd → R from a
training set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )} .
The Bayesian way:
1. Assume that f ∼ p0 (f ) for some appropriate prior p0
2. Assume that yi ∼ p(yi |f (xi )) for some distribution p
3. Use Bayes’ rule
p(D|f ) p0 (f )
p(f |D) = R ′ ′
f ′ p(D|f ) p0 (f )
Qm
with p(D|f ) = i=1 p(yi |f (xi )) .
18
18/32
/32
A prior over functions
The prior p0 should capture that f is expected to be smooth.
Question: But how does one define a distribution over functions?
19
19/32
/32
A prior over functions
IDEA: Assuming that the training points {x}m
i=1 and testing points
′ p
{x }i=1 are known, just focus on the marginals
p0 (f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p ))
p(f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p )|D).
A stochastic process is a distrubition over functions, usually defined by
specifying all possible finite dimensional marginals. → Bayesian
nonparametrics
20
20/32
/32
Gaussian Processes
Given any (suitably smooth) µ : X → R and a p.s.d. k : X × X → R,
GP (µ, k) is a distribution over functions f : X → R such that for any
x1 , . . . , xm ∈ X , if f ∼ GP (µ, k), then
(f (x1 ), . . . , f (xm ))⊤ ∼ N (µ, Σ)
where µi = µ(xi ) and Σi,j = k(xi , xj ).
µ and k are caled the mean and covariance functions of the GP since
E[f (x)] = µ(x)
Cov(f (x), f (x′ )) = k(x, x′ )
21
21/32
/32
Gaussian Processes
Assume for simplicity that y ∼ N (f (x), σ 2 ) . Then, after observing
{(x1 , y1 ), . . . , (xm , ym )} ,
E(f (x)) = k⊤ 2 −1
x (K + σ I) y
Var(f (x)) = κx − k⊤ 2 −1
x (K + σ I) kx
where y = (y1 , . . . , ym ), Ki,j = k(xi , xj ), [kx ]i = k(xi , x) , and
κx = k(xi , xi ).
→ GPs are very easy to use because the maringals and conditionals of
Gaussians are also Gaussian.
22
22/32
/32
Gaussian Process
23
23/32
/32
Gaussian Process
24
24/32
/32
One-class SVM and Multiclass SVM
The one-class SVM (outlier
detection)
RKHS primal form
X
m
1 1
fb = argmin (1 − f (xi ))≥0 + 2
∥f ∥Hk .
f ∈Hk m 2C
i=1
Tries to peg f (xi ) ≥ 0 for all points x1 , . . . , xm in the training set
→ outlier detector.
Dual form
X 1X
maximizeα1 ,...,αm L(α) = αi − αi αj k(xi , xj )
2
i i,j
C
subject to 0 ≤ αi ≤ ∀i
m
26
26/32
/32
The Multiclass SVM
• Defining fz (x) = zf (x)/2 for z = ±1 ,
ℓhinge (f (x), y) = (1 − y f (x))≥0 = (1 − (fy (x) − f−y (x)))≥0 ,
i.e., the correct answer is supposed to beat the incorrect answer by at
least a margin of 1.
• This inspires the multiclass hinge loss
X
ℓ(f1 (x), . . . fk (x), y) = 1 − (fy (x) − fy′ (x)) ≥0
,
y ′ ∈{1,2,...,k}\{y}
which is the basis of the k -class SVM (fj (x) is a bit like a “score”). This is
essentially the same notion of multiclass margin as in the k -class
perceptron. Predict y b = argmaxj∈Y fj (x, j) .
27
27/32
/32
RKHS form of Multiclass SVM
The loss now depends on not just fy (x) , but also fy ′ (x) for all y ′ ̸= y ,
so the RKHS form also needs to be generalized slightly:
X
m
1
fb = argmin ℓ(f1 (xi ), f2 (xi ), . . . , fk (xi ), yi ) + Ω(∥f ∥H ) .
f ∈Hk m | {z }
| i=1 {z } regularizer
training error
The corresponding generalized Representer Theorem will say that
X
m
fj (x) = αi,j k(xi , x)
i=1
for all j ∈ {1, . . . , k} , so now we have many more coefficients to optimize.
28
28/32
/32
Structured prediction
Multiclass to Structured Prediction
What if we combine f1 , . . . , fk : X → R in the k -class SVM into a single
function f : X × Y → R where Y = {1, . . . , k} ? The loss becomes
X
ℓ(f, x, y) = 1 − (f (x, y) − f (x, y ′ )) ≥0
.
y ′ ∈Y\{y}
IDEA: Use this to search for f in a joint RKHS of functions
f: X ×Y →R:
• Kernel becomes k((x, y), (x′ , y ′ )) .
• We can now put structure on Y as well as X .
• We are now learning a single mapping f : X → Y directly.
• At the extreme, the distinction between X and Y is blurred.
→ structured prediction
30
30/32
/32
RRM form of Structured Prediction
Let k be a psd kernel k : (X × Y) × (X × Y) → R , let Hk be the
corresponding RKHS, and Ω a monotonically increasing function. Solve
1 X
m
fb = arg min ℓ((f (xi , y))y∈Y , yi ) + Ω∥f ∥F ,
f ∈Hk m | {z }
i=1
| {z } regularizer
training error
b = argmaxy∈Y fb(x, y) .
and predict y
The loss (in principle) now depends on f (xi , y) for all possible y .
Therefore, the Representer Theorem says that f is of the form
m X
X
f (x, y) = αi,y∗ k((xi , y ∗ ), (x, y)).
i=1 y ∗ ∈Y
In practice, this is usually unfeasible, so only add αi,y ∗ coefficients to the
optimization on the fly “as needed”.
31
31/32
/32
Kernels for Structured Learning
The simplest way to get a kernel k : (X × Y) × (X × Y) → R :
• Get a kernel kX that quantifies similarity between the x ’s.
• Get a kernel kY that quantifies similarity between the y ’s.
• Define
k((x, y), (x′ , y ′ )) = kX (x, x′ ) · kY (y, y ′ ).
Question: Is this a valid kernel? What is its RKHS?
32
32/32
/32