Support Vector Machines
More Generally Kernel Methods
2 Important Concepts
Seek linear decision boundary with two new
concepts
Optimization based on maximizing margins
(not GD)
Kernel mapping for better linear separation
(“massaging data”)
PR , ANN, & ML 2
Two Class Problem: Linear
Separable Case
Many decision
Class 2
boundaries can
separate these two
classes
Which one should
Class 1
we choose?
3
Example of Bad Decision
Boundaries
Class 2 Class 2
Class 1 Class 1
4
Linear classifiers: Which Hyperplane?
Lots of possible solutions for a,b,c.
Some methods find a separating hyperplane,
but not the optimal one [according to some
criterion of expected goodness] This line
E.g., perceptron, GD represents the
Support Vector Machine (SVM) finds an decision
optimal solution. boundary:
Maximizes the distance between the ax + by - c = 0
hyperplane and the “difficult points” close to
decision boundary
One intuition: if there are no points near the
decision surface, then there are no very
uncertain classification decisions
5
5
Another intuition
If you have to place a fat separator between
classes, you have less choices, and so the
capacity of the model has been decreased
6
6
Support Vector Machine (SVM)
SVMs maximize the margin Support vectors
around the separating hyperplane.
A.k.a. large margin
classifiers
The decision function is fully
specified by a subset of training
samples, the support vectors.
Quadratic programming problem Maximize
margin
Seen by many as most successful
current text classification method
7
7
SVM with Kernel Mapping
Original feature space might not be well
conditioned
X -> f(X) (in a higher dimensional space)
PR , ANN, & ML 8
Non-linear SVMs
Datasets that are linearly separable (with some noise) work out great:
0 x
But what are we going to do if the dataset is just too hard?
0 x
How about … mapping data to a higher-dimensional space:
x2
0 x
9
9
Non-linear SVMs: Feature spaces
General idea: the original feature space
can always be mapped to some higher-
dimensional feature space where the
training set is likely separable:
Φ: x → φ(x)
10
10
Transformation to Feature Space
High computation burden due to high-dimensionality and hard to
get a good estimate
“Kernel tricks” for efficient computation: Only the inner products
of feature vectors are used, mapping is not explicitly computed
(efficiency in time and space)
( )
( ) ( )
( ) ( ) ( )
(.) ( )
( ) ( )
( ) ( )
( ) ( )
( ) ( ) ( )
( )
( )
Input space Feature space
11
Kernel-Induced Feature Spaces
Recall from Perceptron learning rule,
weight is a combination of feature vectors
(those that are difficult to handle, or
classified wrongly)
w ( k ) cyx y ( w ( k ) x) 0, x or
w ( k 1)
w (k )
otherwise
PR , ANN, & ML 12
Kernel-Induced Feature Spaces
Or, the decision rule involves only inner
product of features, not features themselves
h( x) y ( x x)
i i iw y x i i i
i i
The same holds true for mapped space
Dimension is not important (may not
know the mapping)
Inner products can be evaluated at the
low-dimensional space
h( x) i yi ( (x i ) (x)) w i yi ( x i )
i i
PR , ANN, & ML 13
Kernels
A function that returns the value of dot
product of mapping of two arguments
k (x1 , x 2 ) (x1 ) (x 2 ) (x1 ), (x 2 )
The same perceptron style learning can be
applied (just replace all dot products with
kernels; however, SVM has more
sophisticated learning rules)
PR , ANN, & ML 14
Example
: x ( x1 , x2 ) (x) ( x12 ,2 x1 x2 , x22 ) F R 3
k (x, y ) ( x1 y1 x2 y2 ) 2
x12 y12 2 x1 y1 x2 y2 x22 y22
( x12 , 2 x1 x2 , x22 ) ( y12 , 2 y1 y2 , y22 )
( x) ( y )
15
More Example
: x ( x1 , x2 ) (x) ( x12 , x1 x2 , x1 x2 , x22 ) F R 4
k (x, y ) ( x1 y1 x2 y2 ) 2
x12 y12 2 x1 y1 x2 y2 x22 y22
( x12 , x1 x2 , x1 x2 , x22 ) ( y12 , y1 y2 , y1 y2 , y22 )
( x) ( y )
16
Even More Example
1 2
: x ( x1 , x2 ) (x) ( x1 x22 ,2 x1 x2 , x12 x22 ) F R 3
2
k (x, y ) ( x1 y1 x2 y2 ) 2
x12 y12 2 x1 y1 x2 y2 x22 y22
( x12 , 2 x1 x2 , x22 ) ( y12 , 2 y1 y2 , y22 )
1 2 1
( x1 x2 ,2 x1 x2 , x1 x2 )
2 2 2
( y12 y22 ,2 y1 y2 , y12 y22 )
2 2
( x) ( y )
The interpretation of mapping is not
unique even with a single k function
17
Gaussian Kernel
Identical to the Radial Basis Function
xz
2
k (x, z ) (x), (z ) exp( )
2 2
i
Recall that ex
x
i 0 i!
The feature dimension is infinitely high in this
case
Embedding is not explicitly computed
(impossible), only inner product is needed in SVM
18
SVM Cost Function
Similar to logistic function, but piecewise
linear, no penalty for z>1 y=1, and z<-1,
y=0
PR , ANN, & ML 19
Margin
for wTx to be >1 (for y=1) and <-1 (for y=-1), w
must be larger in the left figure than the right. So
large margin imply small |w|2
PR , ANN, & ML 20
Numerical Methods
Maximize margin or minimize |w|2 cannot
be done by gradient descent but by
quadratic programming (DON'T try it
yourself, find a readily available
implementation SVM-light, SVMlib)
h( x) i yi ( xi x) b h( x ) i y i K ( x i x ) b
i i
i is the number of support vectors, xi is the
i-th support vector, aiis the weight, and b is
a bias
PR , ANN, & ML 21
The Optimization Problem
Let {x1, ..., xn} be our data set and let yi
{1,-1} be the class label of xi
The decision boundary should classify all
points correctly hard magin
A constrained optimization problem
||w||2 = wTw
22
Optimization f(X), no constraints
𝛻𝑓 = 0
y
f=𝑥 2 +𝑦 2
PR , ANN, & ML 23
Optimization f(X), equality
constraints
g(X) =0:
Min f +𝜆g or 𝛻𝑓 + 𝜆𝛻𝑔 = 0
y
𝛻𝑓
f=𝑥 2 +𝑦 2
𝛻𝑔
x
g: x+y=1
PR , ANN, & ML 24
Optimization f(X), inequality
constraints
h(X)<=0: Min f +𝜆h or 𝛻𝑓 + 𝜆𝛻ℎ = 0
Necessary condition for optimality: 𝛻𝑓=0
Outside feasible region (h(X)>0), not a solution
Inside feasible region, a solution (h(X)<0), h does
not matter, 𝜆 = 0
On the boundary of feasible region h(X)=0, h
behaves just like g (an equality constraint)
For minimization problem
𝛻𝑓 must point inside
𝛻ℎ must point outside
𝛻𝑓 = −𝜆𝛻ℎ, 𝜆 > 0
25
KKT condition
ℎ 𝑥 ≤0
𝜆>0
𝛻𝑓 + 𝜆𝛻ℎ = 0
PR , ANN, & ML 26
Lagrangian of Original Problem
The Lagrangian is Lagrangian multipliers
i0
Note that ||w||2 = wTw
Setting the gradient of w.r.t. w and b to zero,
𝜕𝐿
=0
𝜕𝑤
𝜕𝐿
=0
𝜕𝑏 27
1 𝑇
L= 𝑤 𝑤 + 𝛼𝑖 (1 − 𝑦𝑖 (𝑤 𝑇 𝑥𝑖 + 𝑏))
2
1 𝑇
= 𝑤 𝑤 + 𝛼𝑖 − 𝑤 𝑇 𝛼𝑖 𝑦𝑖 𝑥𝑖 +𝑏 𝛼𝑖 𝑦𝑖
2
1 𝑇
= 𝑤 𝑤 + 𝛼𝑖 − 𝑤 𝑇 𝑤
2
1
=- 𝑤 𝑇 𝑤 + 𝛼𝑖
2
1
= - (𝛼𝑖 𝑦𝑖 𝑥𝑖 )𝑇 (𝛼𝑗 𝑦𝑗 𝑥𝑗 ) + 𝛼𝑖
2
1
= 𝛼𝑖 - 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 ∙ 𝑥𝑗
2
PR , ANN, & ML 28
The Dual Optimization Problem
We can transform the problem to its dual
Dot product of X
’s New variables
(Lagrangian multipliers)
This is a convex quadratic programming
(QP) problem
Global maximum of i can always be found
well established tools for solving this
optimization problem (e.g. cplex)
29
Quadratic Programming
The cost is quadratic
Without constraint, there is a unique minimum
Any descent search technique (e.g., gradient
descent) should eventually find the minimum
Added twist
The search is with constraint
PR , ANN, & ML 30
A Geometrical Interpretation
Class 2
Support vectors
8=0.6 10=0 ’s with values
different from zero
(they hold up the
7=0 separating plane)!
2=0
5=0
1=0.8
4=0
6=1.4
9=0
3=0
Class 1
31
Classification with SVMs
Given a new point (x1,x2), we can score its
projection onto the hyperplane normal:
In 2 dims: score = w1x1+w2x2+b.
I.e., compute score: wx + b = ΣαiyixiTx + b
Set confidence threshold t.
Score > t: yes
Score < -t: no
7
5
Else: don’t know 32 3
32
Soft Margin Classification
If the training set is not
linearly separable, slack
variables ξi can be added to
allow misclassification of
difficult or noisy examples.
Allow some errors
Let some points be
moved to where they ξi
belong, at a cost ξj
Still, try to minimize training
set errors, and to place
hyperplane “far” from each
class (large margin)
33
33
Soft margin
We allow “error” xi in classification; it is
based on the output of the discriminant
function wTx+b
xi approximates the number of
misclassified samples New objective function:
Class 2 C : tradeoff parameter between
error and margin;
chosen by the user;
large C means a higher
penalty to errors
Class 1 34
Linear Non-Separable Case
When classes are not separable
Introduce slack variables and relax constraints
yi ( w x i ) 1 x i xi 0
Minimize the objective functions
Allow some samples into the buffer zone, but minimize
the number and the amount of protrusion
1
L | w |2 C x i C : 0 ( weight )
2 i
PR , ANN, & ML 35
Wolfe Dual
1
L w w C xi 0 x i 1, C 0
2 i
1
i i j yi y j x i x j
i 2 i j
y
i
i i 0
0 i C
PR , ANN, & ML 36
General Case
1
L w w C xi 0 x i 1, C 0
2 i
1
i i j yi y j k (x i , x j )
i 2 i j
y
i
i i 0
0 i C
PR , ANN, & ML 37
Example
k ( x1 , x 2 ) (1 x1 x 2 ) 2 (1 x11x 21 x12 x 22 ) 2
(1 x11
2
x 221 2x11x12 x 21x 22 x12
2
x 222 2x11x 21 2x12 x 22 )
( x i ) [1, x i21 , 2x i1x i 2 , x i22 , 2x i1 , 2x i 2 ]T
Xor
9 1 1 1
X y 1 9 1 1
(1,1) 1 K
(1,1) 1 1 1 9 1
(1,1) 1 1 1 1 9
(1,1) 1 L 1 2 3 4
1
(912 9 22 9 32 9 42
2
1 2 1 3 1 4
2 3 2 4
3 4 ) 1
L i i j yi y j k (x i , x j )
i 2 i j
PR , ANN, & ML 38
Example (cont.)
91 2 3 4 1
1 9 2 3 4 1
y w , ( x) x1 x2
1 2 9 3 4 1
1 2 3 9 4 1
1 >=0
SVM can solve Xor!
1 2 3 4
8
1
w i yi ( X i ) [ ( X1 ) ( X 2 ) ( X 3 ) ( X 4 )]
i 8
1 1 1 1 0
1 1
1 1 0
1 2 2 2 2 2
8 1 1 1 1 0
2 2 2 2 0
2 2 2 2 0
PR , ANN, & ML 39
Example: 5 1D data points
Value of discriminant function
class 1 class 2 class 1
1 2 4 5 6
40
Example
5 1D data points
x1=1, x2=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class
1 and 4, 5 as class 2 y1=1, y2=1, y3=-1, y4=-1,
y5=1
We use the polynomial kernel of degree 2
K(x,y) = (xy+1)2
C is set to 100
We first find i (i=1, …, 5) by
41
Example
By using a QP solver, we get
1=0, 2=2.5, 3=0, 4=7.333, 5=4.833
Verify (at home) that the constraints are indeed
satisfied
The support vectors are {x2=2, x4=5, x5=6}
The discriminant function is
bis recovered by solving f(2)=1 or by f(5)=-1 or
by f(6)=1, as x , x , x lie on f(y) and all give b=9
2 4 5
with
42
Software
A list of SVM implementation can be found
at [Link]
[Link]/[Link]
Some implementation (such as LIBSVM)
can handle multi-class classification
SVMLight is among one of the earliest
implementation of SVM
Several Matlab toolboxes for SVM are also
available
43
Evaluation: Classic Reuters Data Set
Most (over)used data set
21578 documents
9603 training, 3299 test articles
118 categories
An article can be in more than one category
Learn 118 binary category distinctions
Average document: about 90 types, 200 tokens
Average number of classes assigned
1.24 for docs with at least one category
Only about 10 out of 118 categories are large
• Earn (2877, 1087) • Trade (369,119)
• Acquisitions (1650, 179) • Interest (347, 131)
Common categories
• Money-fx (538, 179) • Ship (197, 89)
(#train, #test) • Wheat (212, 71)
• Grain (433, 149)
• Crude (389, 189) • Corn (182,
44 56)
44
Reuters Text Categorization data set
(Reuters-21578) document
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET"
OLDID="12981" NEWID="798">
<DATE> 2-MAR-1987 [Link].42</DATE>
<TOPICS><D>livestock</D><D>hog</D></TOPICS>
<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>
<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress
kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44
member states determining industry positions on a number of issues, according to the National Pork
Producers Council, NPPC.
Delegates to the three day Congress will be considering 26 resolutions concerning various issues,
including the future direction of farm policy and the tax law as it applies to the agriculture sector.
The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the congress, will feature the latest in technology in all
areas of the industry, the NPPC added. Reuter
45
</BODY></TEXT></REUTERS>
45
New Reuters: RCV1: 810,000 docs
Top topics in Reuters RCV1
46
46
Good practice department:
Confusion matrix
This (i, j) entry means 53 of the docs actually in
class i were put in class j by the classifier.
Class assigned by classifier
Actual Class
53
In a perfect classification, only the diagonal has non-zero entries
47
47
Per class evaluation measures
cii
Recall: Fraction of docs in class i cij
classified correctly: j
Precision: Fraction of docs assigned cii
class i that are actually about class j c ji
i:
“Correct rate”: (1- error rate) i cii
Fraction of docs classified cij
correctly: j i
48
48
Dumais et al. 1998:
Reuters - Accuracy
Rocchio NBayes Trees LinearSVM
earn 92.9% 95.9% 97.8% 98.2%
acq 64.7% 87.8% 89.7% 92.8%
money-fx 46.7% 56.6% 66.2% 74.0%
grain 67.5% 78.8% 85.0% 92.4%
crude 70.1% 79.5% 85.0% 88.3%
trade 65.1% 63.9% 72.5% 73.5%
interest 63.4% 64.9% 67.1% 76.3%
ship 49.2% 85.4% 74.2% 78.0%
wheat 68.9% 69.7% 92.5% 89.7%
corn 48.2% 65.3% 91.8% 91.1%
Avg Top 10 64.6% 81.5% 88.4% 91.4%
Avg All Cat 61.7% 75.2% na 86.4%
Recall: % labeled in category among those stories that are really in category
Precision: % really in category among those stories labeled in category
Break Even: (Recall + Precision) / 2 49
49
Reuters ROC - Category Grain
1
0.9
0.8
0.7
0.6
Recall 0.5
0.4 LSVM
0.3 Decision Tree
0.2
Naïve Bayes
0.1
Find Similar
0
0 0.2 0.4 0.6 0.8 1
Precision
Recall: % labeled in category among those stories that are really in category
50
Precision: % really in category among those stories labeled in category
50
ROC for Category - Crude
1
0.9
0.8
0.7
0.6
Recall
0.5 LSVM
Decision Tree
0.4
Naïve Bayes
0.3 Find Similar
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Precision 51
51
Linear Programming
Standard max Standard min
PR , ANN, & ML 52
Standard Max Example
(x1,x2): (#pants, #shirts)
(b1,b2): (#buttons, #zippers)
(c1,c2): (profit/pant, profit/shirt)
(y1, y2): ($/button, $/zipper) <- shadow price
Aij: i:(buttons, zippers), j:(pants, shirts) use
𝑥1
max 𝑐1, 𝑐2
𝑥2
𝑏𝑛1 𝑏𝑛2 𝑥1 𝑏1
Subject to ≤
𝑧𝑝1 𝑧𝑝2 𝑥2 𝑏2 𝑏1
min𝑦1, 𝑦2
x1 ≥0, x2 ≥0 𝑏2
𝑏𝑛1 𝑏𝑛2
Subject to 𝑦1, 𝑦2 ≥ 𝑐1 𝑐2
𝑧𝑝1 𝑧𝑝2
y1 ≥0, y2 ≥0
PR , ANN, & ML 53
Standard Max
Max is primal Min is dual
Make pants/shirts Sell your material
by yourself to a buyer
Max profits for you Min costs for buyer
Stay within Buyer must offer
available buttons more than what you
and zippers can earn by making
things yourself
PR , ANN, & ML 54
Standard Min Example
(y1,y2): (pound of meat, pound of veggi)
(b1,b2): ($/pound meat, $/pound veggi)
(c1,c2): (required calorie, required protein)
Aij: i:(meat, veggi), j:(calorie, protein) provided
(x1, x2): ($/1 calorie supplement, $/1 protein supplement)
<- shadow price 𝑏1
min 𝑦1, 𝑦2
𝑏2
𝑐/𝑝𝑚𝑒𝑎𝑡 𝑝/𝑝𝑚𝑒𝑎𝑡
Subject to 𝑦1, 𝑦2 ≥ 𝑐1 𝑐2
𝑐/𝑝𝑣𝑒𝑔𝑔𝑖 𝑝/𝑝𝑣𝑒𝑔𝑔𝑖
y1 ≥0, y2 ≥0 𝑥1
max 𝑐1, 𝑐2
𝑥2
𝑐/𝑚𝑒𝑎𝑡 𝑝/𝑚𝑒𝑎𝑡 𝑥1 𝑏1
Subject to ≤
𝑐/𝑣𝑒𝑔𝑔𝑖 𝑝/𝑣𝑒𝑔𝑔𝑖 𝑥2 𝑏2
x1 ≥0, x2 ≥0
PR , ANN, & ML 55
Standard Min
Min is primal Max is dual
Buy meat and veggi Aliens sell you
to provide calorie magic calorie and
and protein protein pills
Min cost for you Max profits for
Supply enough seller
nutrient seller must offer
less than what you
pay to buy for
yourself
PR , ANN, & ML 56
Some Facts about LP
All linear programming problems can be
transformed into either standard max or
standard min
If max if feasible then min if feasible and
vice versa
Optimal max is also the optimal min (no
gap)
PR , ANN, & ML 57
Conversion from Primal to Dual
𝑏1
min 𝑦1, 𝑦2
𝑏2 c1-a11*y1-a21*y2≤0
𝑎11 𝑎12
Subject to 𝑦1, 𝑦2 ≥ 𝑐1 𝑐2 c2-a12*y1-a22*y2≤0
𝑎21 𝑎22
y1 ≥0, y2 ≥0
𝑚𝑎𝑥𝑥1,𝑥2 𝑚𝑖𝑛𝑦1,𝑦2 b1*y1+b2*y2
+x1*(c1-a11*y1-a21*y2) 𝑚𝑎𝑥𝑥1,𝑥2 𝑚𝑖𝑛𝑦1,𝑦2 𝑐1 ∗ 𝑥1 + 𝑐2 ∗ 𝑥2
+x2*(c2-a12*y1-a22*y2) +y1*(b1-a11*x1-a12*x2)
+y2*(b2-a21*x1-a22*x2)
𝑥1
max 𝑐1, 𝑐2
𝑥2
𝑎11 𝑎12 𝑥1 𝑏1
Subject to ≤
𝑎21 𝑎22 𝑥2 𝑏2
x1 ≥0, x2 ≥0
PR , ANN, & ML 58