Linear Classification:
The Perceptron
These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made
their course materials freely available online. Feel free to reuse or adapt these slides for your own academic
purposes, provided that you include proper attribution. Please send comments and corrections to Eric.
Robot Image Credit: Viktoriya Sukhanova © [Link]
Linear Classifiers
• A hyperplane partitions Rd into two half-spaces
– Defined by the normal vector ✓ 2 Rd
• ✓2is orthogonal
Rd to any vector lying ✓
on the hyperplane
– Assumed to pass through the origin
• This is because we incorporated bias term ✓0 into it by x0 = 1
• Consider classification with +1, -1 labels ...
Based on slide by Piyush Rai 2
Linear Classifiers
• Linear classifiers: represent decision boundary by hyperplane
2 3
✓0
6 ✓1 7
6 7 x| = ⇥ 1 x . . . x ⇤ ✓
✓=6 . 7 1 d
4 .. 5
✓d
⇢
| 1 if z 0
h(x) = sign(✓ x) where sign(z) =
1 if z < 0
– Note that: ✓ | x > 0 =) y = +1
|
✓ x < 0 =) y = 1
3
The Perceptron
⇢
| 1 if z 0
h(x) = sign(✓ x) where sign(z) =
1 if z < 0
• The perceptron uses the following update rule each
time it receives a new training instance (x(i) , y (i) )
↵ ⇣ ⇣ (i) ⌘ (i)
⌘
(i)
✓j ✓j h✓ x y xj
2
either 2 or -2
– If the prediction matches the label, make no change
– Otherwise, adjust θ
4
The Perceptron
• The perceptron uses the following update rule each
time it receives a new training instance (x(i) , y (i) )
↵ ⇣ ⇣ ⌘ ⌘
(i) (i) (i)
✓j ✓j h✓ x y xj
2
either 2 or -2
(i) (i)
• Re-write as ✓j ✓j + ↵y xj (only upon misclassification)
– Can eliminate α in this case, since its only effect is to scale θ
by a constant, which doesn’t affect performance
Perceptron
✓ + y (i)
✓Rule: If x(i) is misclassified, do ✓ ✓ + y (i) x(i)
5
Why the Perceptron Update Works
✓old ✓old
x ✓old
+ x + ✓new
+
misclassified
Based on slide by Piyush Rai 6
Why the Perceptron Update Works
• Consider the misclassified example (y = +1)
|
– Perceptron wrongly thinks that ✓old x<0
• Update:
✓new = ✓old + yx = ✓old + x (since y = +1)
• Note that
|
✓new x = (✓old + x)| x
|
= ✓old x + x| x
kxk22 > 0
| | |
• Therefore, is less
✓new
= (✓oldnegative
x + x) xthan ✓old x
<0
|
– So, we are making =
ourselves
✓old + x|correct
x more x on this example!
Based on slide by Piyush Rai 7
The Perceptron Cost Function
• Prediction is correct if y (i) ✓ T x(i) > 0
• Could have used 0/1 loss
Xn
1
J0/1 (✓) = `(sign(✓ T x(i) ), y (i) )
n i=1
where `() is 0 if the prediction is correct, 1 otherwise
Doesn’t produce a useful gradient
Based on slide by Alan Fern 8
The Perceptron Cost Function
• The perceptron uses the following cost function
Xn
1
Jp (✓) = max(0, y (i) ✓ T x(i) )
n i=1
– max(0, y (i) ✓ T x(i) ) is 0 if the prediction is correct
– Otherwise, it is the confidence in the misprediction
Nice gradient
Based on slide by Alan Fern 9
Online Perceptron Algorithm
1.) Let ✓ [0, 0, . . . , 0]
2.) Repeat:
3.) Receive training example (x(i) , y (i) )
4.) if y (i) x(i) ✓ 0 // prediction is incorrect
5.) ✓ ✓ + y (i) x(i)
Online learning – the learning mode where the model update is
performed each time a single observation is received
Batch learning – the learning mode where the model update is
performed after observing the entire training set
Based on slide by Alan Fern 10
Online Perceptron Algorithm
Red points are
labeled +
Blue points are
labeled -
See the perceptron in action: [Link]/watch?v=vGwemZhPlsA
Based on slide by Alan Fern 11
Batch Perceptron
n
1.) Given training data (x(i) , y (i) ) i=1
2.) Let ✓ [0, 0, . . . , 0]
2.) Repeat:
2.) Let [0, 0, . . . , 0]
3.) for i = 1 . . . n, do
4.) if y (i) x(i) ✓ 0 // prediction for ith instance is incorrect
5.) + y (i) x(i)
6.) /n // compute average update
6.) ✓ ✓+↵
8.) Until k k2 < ✏
• Simplest case: α = 1 and don’t normalize, yields the fixed
increment perceptron
• Guaranteed to find a separating hyperplane if one exists
Based on slide by Alan Fern 12
Improving the Perceptron
• The Perceptron produces many θ‘s during training
• The standard Perceptron simply uses the final θ at test time
– This may sometimes not be a good idea!
– Some other θ may be correct on 1,000 consecutive examples,
but one mistake ruins it!
• Idea: Use a combination of multiple perceptrons
– (i.e., neural networks!)
• Idea: Use the intermediate θ‘s
– Voted Perceptron: vote on predictions of the intermediate θ‘s
– Averaged Perceptron: average the intermediate θ‘s
Based on slide by Piyush Rai 13