0% found this document useful (0 votes)

6 views32 pages

Kernel Methods

The document discusses kernel methods in machine learning, focusing on regularized risk minimization in reproducing kernel Hilbert spaces (RKHS). It covers various applications such as kernel perceptron, kernel PCA, ridge regression, and Gaussian processes, highlighting their mathematical formulations and optimization techniques. Additionally, it explores multiclass SVMs and structured prediction, emphasizing the flexibility and modularity of kernel methods in addressing different learning problems.

Uploaded by

eswam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views32 pages

Kernel Methods

Uploaded by

eswam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Topic 6: KERNEL METHODS

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning

Risi Kondor, The University of Chicago
General form of kernel methods
k : X × X → R is a PSD kernel and Hk is the associated RKHS. Kernel
methods (aka. Hilbert space learning algorithms) solve the RRM problem

1 X
m
fb = argmin ℓ(f (xi ), yi ) + Ω(∥f ∥Hk ) .
f ∈Hk m
i=1
| {z }
| {z } regularizer
training error

• Ω can be any increasing function R+ → R+ .

• The final hypothesis is b
h(x) = σ(fb(x)) . For example, in the SVM,
simply b
h(x) = sgn(fb(x)) .
• ℓ is the surrogate loss. For example, in classification, the hinge loss is
a surrogate for the zero-one loss.
Instead of actually having to search over a function space, all such problems
reduce to m dimensional optimization thanks to the reproducing property
f (x) = ⟨f, kx ⟩ and the Representer Theorem.

2
2/32
/32
The modularity of kernel methods
Regularized risk minimization in RKHSs is a powerful paradigm because it
has distinct moving parts:
• The loss
◦ Reflects the nature of the problem (classification/regression/ranking/...).
◦ Determines exactly what type of optimization problem we end up with.
• The kernel
◦ Regulates overfitting by determining the regularization term.
◦ Reflects our prior knowledge about the problem.
Can dream up virtually any kernel machine and solve it efficiently as long as
1. The loss only involves function evaluations f (x) = ⟨f, kx ⟩ at data
points;
2. The regularizer is an increasing function of ∥f ∥F .

3
3/32
/32
Loss functions for regression

4
4/32
/32
1. The kernel perceptron
The vanilla perceptron
w←0;
t←1;
while(true){
if w · xt ≥ 0 predict ŷt = 1 ; else predict ŷt = −1 ;
if ((ŷt = −1) and (yt = 1)) let w ← w + xt ;
if ((ŷt = 1) and (yt = −1)) let w ← w − xt ;
t←t+1;
}

At any t , the weight vector is of the form

X
t−1
w= c i xi where ci ∈ {−1, 0, +1} .
i=1

6
6/32
/32
The kernel perceptron

t←1;
while(1){
Pt−1
if i=1 ci k(xi , xt ) ≥ 0 predict ŷt = 1 ; else ŷt = −1 ;
ct ← 0 ;
if ((ŷt = −1) and (yt = 1)) let ct = 1 ;
if ((ŷt = 1) and (yt = −1)) let ct = −1 ;
t←t+1;
}

7
7/32
/32
2. Kernel PCA
PCA in feature space

Recall that in RD (after centering), the first principal component is given by

1 X
m
v1 = arg max (xi · v)2 .
∥v∥=1 m
i=1
Pm
Clearly, v1 lies in the span, i.e., v1 = i=1 αi xi .

Kernel analog:
X
m
f1 = argmax ⟨f, ϕ(xi )⟩2 .
f ∈F ∥f ∥=1 i=1
Pm
Once again, f = i=1 αi ϕ(xi ) for some α1 , . . . , α m ∈ R .

9
9/32
/32
Kernel PCA
As in RD , f will be the highest e-value e-vector of the sample covariance
operator
1 X
m
Σ(f ) = ϕ(xi ) ⟨f, ϕ(xi )⟩ .
m
i=1
Pm
Plugging in f = ℓ=1 αℓ ϕ(xℓ ) and multiplying from the right by any
ϕ(xj ) :

1 XX X
m m m
⟨ϕ(xj ), ϕ(xi )⟩ ⟨ϕ(xi ), ϕ(xℓ )⟩ αℓ = λ ⟨ϕ(xj ), ϕ(xℓ )⟩ αℓ .
m
i=1 ℓ=1 ℓ=1

Using ⟨ϕ(xj ), ϕ(xi )⟩ = k(xi , xj ) and letting K be the Gram matrix,

K 2 α = mλKα =⇒ Kα = mλα,

so kernel PCA reduces to just finding the first eigenvector of the Gram matrix!

10
/32
10/32
11
11/32
/32
3. Ridge Regression
Ridge Regression

Using squared error loss and setting λ = m/2C ,

X
m
fb = argmin (f (xi ) − yi ) + 2
λ ∥f ∥2Hk .
f ∈Hk i=1
| {z }
R[f ]
Pm
By the Representer Theorem, f (x) = i=1 αi k(xi , x), so

m X
X m 2 X
m X
m
R[f ] = αj k(xi , xj ) − yi +λ αi αj k(xi , xj ).
i=1 j=1 i=1 j=1

13
13/32
/32
Ridge Regression

Letting y = (y1 , . . . , ym ) , α = (α1 , . . . , αm )⊤ and Ki,j = k(xi , xj ) ,

R(α) = ∥ Kα − y ∥2 + λα⊤Kα.

At the optimum,

∂R(α)
= [K(Kα − y)]i + λ[Kα]i = 0,
∂αi
so

K(Kα − y) + λKα = 0 =⇒ α = (K + λI)−1 y.

14
14/32
/32
Ridge Regression

Defining kx = k(xi , x) , the final solution is

fb(x) = k⊤ −1
x (K + λI) y.

• In this case RRM reduced to just inverting a matrix.

• In fact, this is just ridge regression, which is a classical method in
statistics, and the simplest non-linear regression/interpolation method
possible.
• Ridge regression is the same as the MAP of a Gaussian Process with
mean zero and covariance function k .

15
15/32
/32
3. Gaussian Processes
17
17/32
/32
Bayesian nonparametric regression

The canonical regression problem: learn a function f : Rd → R from a

training set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )} .

The Bayesian way:

18
18/32
/32
A prior over functions
The prior p0 should capture that f is expected to be smooth.

Question: But how does one define a distribution over functions?

19
19/32
/32
A prior over functions

IDEA: Assuming that the training points {x}m

i=1 and testing points
′ p
{x }i=1 are known, just focus on the marginals

p0 (f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p ))

p(f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p )|D).

A stochastic process is a distrubition over functions, usually defined by

specifying all possible finite dimensional marginals. → Bayesian
nonparametrics

20
20/32
/32
Gaussian Processes

Given any (suitably smooth) µ : X → R and a p.s.d. k : X × X → R,

GP (µ, k) is a distribution over functions f : X → R such that for any
x1 , . . . , xm ∈ X , if f ∼ GP (µ, k), then

(f (x1 ), . . . , f (xm ))⊤ ∼ N (µ, Σ)

where µi = µ(xi ) and Σi,j = k(xi , xj ).

µ and k are caled the mean and covariance functions of the GP since

E[f (x)] = µ(x)

Cov(f (x), f (x′ )) = k(x, x′ )

21
21/32
/32
Gaussian Processes

Assume for simplicity that y ∼ N (f (x), σ 2 ) . Then, after observing

{(x1 , y1 ), . . . , (xm , ym )} ,

E(f (x)) = k⊤ 2 −1
x (K + σ I) y
Var(f (x)) = κx − k⊤ 2 −1
x (K + σ I) kx

where y = (y1 , . . . , ym ), Ki,j = k(xi , xj ), [kx ]i = k(xi , x) , and

κx = k(xi , xi ).

→ GPs are very easy to use because the maringals and conditionals of
Gaussians are also Gaussian.

22
22/32
/32
Gaussian Process

23
23/32
/32
Gaussian Process

24
24/32
/32
One-class SVM and Multiclass SVM
The one-class SVM (outlier
detection)
RKHS primal form

X
m
1 1
fb = argmin (1 − f (xi ))≥0 + 2
∥f ∥Hk .
f ∈Hk m 2C
i=1

Tries to peg f (xi ) ≥ 0 for all points x1 , . . . , xm in the training set

→ outlier detector.

Dual form
X 1X
maximizeα1 ,...,αm L(α) = αi − αi αj k(xi , xj )
2
i i,j

C
subject to 0 ≤ αi ≤ ∀i
m
26
26/32
/32
The Multiclass SVM
• Defining fz (x) = zf (x)/2 for z = ±1 ,

ℓhinge (f (x), y) = (1 − y f (x))≥0 = (1 − (fy (x) − f−y (x)))≥0 ,

i.e., the correct answer is supposed to beat the incorrect answer by at

least a margin of 1.

• This inspires the multiclass hinge loss

X
ℓ(f1 (x), . . . fk (x), y) = 1 − (fy (x) − fy′ (x)) ≥0
,
y ′ ∈{1,2,...,k}\{y}

which is the basis of the k -class SVM (fj (x) is a bit like a “score”). This is
essentially the same notion of multiclass margin as in the k -class
perceptron. Predict y b = argmaxj∈Y fj (x, j) .

27
27/32
/32
RKHS form of Multiclass SVM
The loss now depends on not just fy (x) , but also fy ′ (x) for all y ′ ̸= y ,
so the RKHS form also needs to be generalized slightly:
X
m
1
fb = argmin ℓ(f1 (xi ), f2 (xi ), . . . , fk (xi ), yi ) + Ω(∥f ∥H ) .
f ∈Hk m | {z }
| i=1 {z } regularizer
training error

The corresponding generalized Representer Theorem will say that

X
m
fj (x) = αi,j k(xi , x)
i=1

for all j ∈ {1, . . . , k} , so now we have many more coefficients to optimize.

28
28/32
/32
Structured prediction
Multiclass to Structured Prediction
What if we combine f1 , . . . , fk : X → R in the k -class SVM into a single
function f : X × Y → R where Y = {1, . . . , k} ? The loss becomes
X
ℓ(f, x, y) = 1 − (f (x, y) − f (x, y ′ )) ≥0
.
y ′ ∈Y\{y}

IDEA: Use this to search for f in a joint RKHS of functions

f: X ×Y →R:
• Kernel becomes k((x, y), (x′ , y ′ )) .
• We can now put structure on Y as well as X .
• We are now learning a single mapping f : X → Y directly.
• At the extreme, the distinction between X and Y is blurred.
→ structured prediction

30
30/32
/32
RRM form of Structured Prediction
Let k be a psd kernel k : (X × Y) × (X × Y) → R , let Hk be the
corresponding RKHS, and Ω a monotonically increasing function. Solve

1 X
m
fb = arg min ℓ((f (xi , y))y∈Y , yi ) + Ω∥f ∥F ,
f ∈Hk m | {z }
i=1
| {z } regularizer
training error

b = argmaxy∈Y fb(x, y) .
and predict y

The loss (in principle) now depends on f (xi , y) for all possible y .
Therefore, the Representer Theorem says that f is of the form
m X
X
f (x, y) = αi,y∗ k((xi , y ∗ ), (x, y)).
i=1 y ∗ ∈Y

In practice, this is usually unfeasible, so only add αi,y ∗ coefficients to the

optimization on the fly “as needed”.

31
31/32
/32
Kernels for Structured Learning

The simplest way to get a kernel k : (X × Y) × (X × Y) → R :

• Get a kernel kX that quantifies similarity between the x ’s.
• Get a kernel kY that quantifies similarity between the y ’s.
• Define
k((x, y), (x′ , y ′ )) = kX (x, x′ ) · kY (y, y ′ ).
Question: Is this a valid kernel? What is its RKHS?

32
32/32
/32

Machine Learning: Kernel Methods
No ratings yet
Machine Learning: Kernel Methods
6 pages
Kernel Trick
No ratings yet
Kernel Trick
40 pages
ML Lecture06 2
No ratings yet
ML Lecture06 2
63 pages
Practice Problems For ML Midterms
No ratings yet
Practice Problems For ML Midterms
5 pages
Introduction To Kriging: To Cite This Version
No ratings yet
Introduction To Kriging: To Cite This Version
40 pages
Machine Learning: Kernel Methods Explained
No ratings yet
Machine Learning: Kernel Methods Explained
19 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
SVM Class 2
No ratings yet
SVM Class 2
87 pages
Kernel Methods for Nonlinear Regression
No ratings yet
Kernel Methods for Nonlinear Regression
23 pages
Cours2 ML
No ratings yet
Cours2 ML
21 pages
Introduction To Kernels: Max Welling
No ratings yet
Introduction To Kernels: Max Welling
16 pages
Machine Learning 3
No ratings yet
Machine Learning 3
35 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
8 pages
Financial Market Volatility Forecasting
No ratings yet
Financial Market Volatility Forecasting
52 pages
Kernel Methods for ML Experts
No ratings yet
Kernel Methods for ML Experts
29 pages
Mathematics of Machine Learning Lecture 9
No ratings yet
Mathematics of Machine Learning Lecture 9
6 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
3 pages
Machine Learning Regression Techniques
No ratings yet
Machine Learning Regression Techniques
16 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
Lect 3
No ratings yet
Lect 3
14 pages
Classification
No ratings yet
Classification
47 pages
Kernel Models for Data Scientists
No ratings yet
Kernel Models for Data Scientists
56 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
w04 LectureSlices MA4550
No ratings yet
w04 LectureSlices MA4550
32 pages
Be Central
No ratings yet
Be Central
98 pages
CS229 Midterm Solutions 2010
No ratings yet
CS229 Midterm Solutions 2010
8 pages
Statistical Machine Learning Overview
No ratings yet
Statistical Machine Learning Overview
35 pages
SD-M1 TSI Chapitre 4
No ratings yet
SD-M1 TSI Chapitre 4
42 pages
Weather Wax Hastie Solutions Manual
No ratings yet
Weather Wax Hastie Solutions Manual
18 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
05 Lectureslides Kernels
No ratings yet
05 Lectureslides Kernels
47 pages
03 - Kernelization
No ratings yet
03 - Kernelization
32 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
ML 2022
No ratings yet
ML 2022
3 pages
Support Vector Machine Overview
No ratings yet
Support Vector Machine Overview
45 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
Machine Learning Course - Kernel Regression
No ratings yet
Machine Learning Course - Kernel Regression
9 pages
Introduction to Support Vector Machines
No ratings yet
Introduction to Support Vector Machines
40 pages
Intro To Machine Learning Lecture Notes2
No ratings yet
Intro To Machine Learning Lecture Notes2
7 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
No ratings yet
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
27 pages
cs229 Notes3
No ratings yet
cs229 Notes3
30 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Multivariat Kernel Regression
No ratings yet
Multivariat Kernel Regression
3 pages
Representer Function
No ratings yet
Representer Function
12 pages
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
No ratings yet
Homework 2: SVM, Kernel Methods, Ensemble Learning, Learning Theory
12 pages
CS 189 Cheat Sheet Overview
No ratings yet
CS 189 Cheat Sheet Overview
2 pages
4c Kernels
No ratings yet
4c Kernels
31 pages
Lecture-Notes Kernal Methods
No ratings yet
Lecture-Notes Kernal Methods
12 pages
Lecture 6
No ratings yet
Lecture 6
17 pages
Quiz3 2024
No ratings yet
Quiz3 2024
2 pages
Chap6 1-KernelMethods
No ratings yet
Chap6 1-KernelMethods
36 pages
Lecture03 Kernel
No ratings yet
Lecture03 Kernel
28 pages
Tensor Density Estimator by Convolution-Deconvolution
No ratings yet
Tensor Density Estimator by Convolution-Deconvolution
57 pages
S-SOS: Stochastic Sum-Of-Squares For Parametric Polynomial Optimization
No ratings yet
S-SOS: Stochastic Sum-Of-Squares For Parametric Polynomial Optimization
37 pages
Convex Relation For Fokker-Planck Equation
No ratings yet
Convex Relation For Fokker-Planck Equation
21 pages
Re-Anchoring Quantum Monte Carlo With Tensor-Train Sketching
No ratings yet
Re-Anchoring Quantum Monte Carlo With Tensor-Train Sketching
26 pages
On A Spherically Lifted Spin Model at Finite Temperature
No ratings yet
On A Spherically Lifted Spin Model at Finite Temperature
15 pages
NASA Saturn V Rocket Structural Elements - 1968
92% (13)
NASA Saturn V Rocket Structural Elements - 1968
414 pages
Spring 2021 Info Security Exam Guidelines
No ratings yet
Spring 2021 Info Security Exam Guidelines
2 pages
Art and Magic Exsul Immeritus
No ratings yet
Art and Magic Exsul Immeritus
77 pages
The Voice Internationaal Styleguide PDF
100% (1)
The Voice Internationaal Styleguide PDF
62 pages
Medically Important Pathogens: Mycology
No ratings yet
Medically Important Pathogens: Mycology
17 pages
Jessica-PRACTICAL FILE
No ratings yet
Jessica-PRACTICAL FILE
26 pages
Solids
100% (4)
Solids
135 pages
Carcinoma Buccal Mucosa: Dr. Abhilash G JR-3
No ratings yet
Carcinoma Buccal Mucosa: Dr. Abhilash G JR-3
47 pages
Seafarer Performance Assessment Focuses On Evaluating A Seafarer
No ratings yet
Seafarer Performance Assessment Focuses On Evaluating A Seafarer
2 pages
Ccsu CS Honor PDF
No ratings yet
Ccsu CS Honor PDF
2 pages
Wigeck
No ratings yet
Wigeck
22 pages
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
100% (1)
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
481 pages
Solution Manual For Intermediate Accounting 17th by Kieso Sample
100% (3)
Solution Manual For Intermediate Accounting 17th by Kieso Sample
98 pages
SABSA White Paper
No ratings yet
SABSA White Paper
17 pages
Synonym Test PDF
No ratings yet
Synonym Test PDF
2 pages
Revised Manual 585310 PDF
0% (1)
Revised Manual 585310 PDF
8 pages
Year 5 Geography
No ratings yet
Year 5 Geography
3 pages
Project Management Course Overview
No ratings yet
Project Management Course Overview
2 pages
Cubos de Corsi Idosos
No ratings yet
Cubos de Corsi Idosos
5 pages
Encrypted Data Analysis
No ratings yet
Encrypted Data Analysis
4 pages
Cocky and Funny Lines
86% (7)
Cocky and Funny Lines
171 pages
Chemistry of NP
No ratings yet
Chemistry of NP
7 pages
Signalr Realtime Application Development Einar Ingebrigtsen Instant Download
No ratings yet
Signalr Realtime Application Development Einar Ingebrigtsen Instant Download
50 pages
03 10 2020 - 215334004 PDF
No ratings yet
03 10 2020 - 215334004 PDF
3 pages
Loopmx: General Description Data
No ratings yet
Loopmx: General Description Data
3 pages
Hewlett: Packaro
No ratings yet
Hewlett: Packaro
379 pages
9
0% (1)
9
203 pages
The Alphabet Maltese Letter Approximate Pronunciation Symbol Examples
No ratings yet
The Alphabet Maltese Letter Approximate Pronunciation Symbol Examples
20 pages
Bloom's Verb PDF
100% (1)
Bloom's Verb PDF
10 pages
SAER Vertical and End Suction Pumps
No ratings yet
SAER Vertical and End Suction Pumps
4 pages

Kernel Methods

Uploaded by

Kernel Methods

Uploaded by

Topic 6: KERNEL METHODS

STAT 37710/CAAM 37710/CMSC 35400 Machine Learning

• Ω can be any increasing function R+ → R+ .

At any t , the weight vector is of the form

Recall that in RD (after centering), the first principal component is given by

Using ⟨ϕ(xj ), ϕ(xi )⟩ = k(xi , xj ) and letting K be the Gram matrix,

Using squared error loss and setting λ = m/2C ,

Letting y = (y1 , . . . , ym ) , α = (α1 , . . . , αm )⊤ and Ki,j = k(xi , xj ) ,

K(Kα − y) + λKα = 0 =⇒ α = (K + λI)−1 y.

Defining kx = k(xi , x) , the final solution is

• In this case RRM reduced to just inverting a matrix.

The canonical regression problem: learn a function f : Rd → R from a

The Bayesian way:

Question: But how does one define a distribution over functions?

IDEA: Assuming that the training points {x}m

p0 (f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p ))

p(f (x1 ), . . . , f (xm ), f (x′1 ), . . . , f (x′p )|D).

A stochastic process is a distrubition over functions, usually defined by

Given any (suitably smooth) µ : X → R and a p.s.d. k : X × X → R,

(f (x1 ), . . . , f (xm ))⊤ ∼ N (µ, Σ)

where µi = µ(xi ) and Σi,j = k(xi , xj ).

E[f (x)] = µ(x)

Assume for simplicity that y ∼ N (f (x), σ 2 ) . Then, after observing

where y = (y1 , . . . , ym ), Ki,j = k(xi , xj ), [kx ]i = k(xi , x) , and

Tries to peg f (xi ) ≥ 0 for all points x1 , . . . , xm in the training set

ℓhinge (f (x), y) = (1 − y f (x))≥0 = (1 − (fy (x) − f−y (x)))≥0 ,

i.e., the correct answer is supposed to beat the incorrect answer by at

• This inspires the multiclass hinge loss

The corresponding generalized Representer Theorem will say that

for all j ∈ {1, . . . , k} , so now we have many more coefficients to optimize.

IDEA: Use this to search for f in a joint RKHS of functions

In practice, this is usually unfeasible, so only add αi,y ∗ coefficients to the

The simplest way to get a kernel k : (X × Y) × (X × Y) → R :

You might also like