Prac%cal
Machine
Learning
Verena
Kaynig-Fi4kau
(vkaynig@[Link])
We
are
drowning
in
informa%on
and
starving
for
knowledge
John
Naisbi4
Machine
Learning
Analyze
training
data
Make
predic%ons
for
new
unseen
data:
supervised
learning
Find
pa4erns:
unsupervised
learning
Machine
Learning
Supervised
Learning
SVM,
Decision
Tree,
Boos%ng,
Random
Forest
Unsupervised
Learning
K-means,
mean
shiS
Supervised
Learning
data
points
x2
labels
features
separa%ng
hyper
plane
x1
Features
are
important
?
roundness
weight
Features
are
important
shape
color
Googles
Self-Driving
Car
Car
Features
Laser
scan
Intensity
model
Eleva%on
model
Lane
model
Camera
vision
2D
sta%onary
map
So
just
measure
everything?
More
features
=
be4er
classica%on?
Prac%cal
issues:
Data
volume,
computa%on
overhead
Theore%cal
issues:
Generaliza%on
performance
Curse
of
dimensionality
Supervised
Learning
data
points
x2
labels
features
separa%ng
hyper
plane
x1
Perceptron
x:
data
point
x1
w1
y:
label
x2
w2
w3
w:
weight
vector
x3
b
b:
bias
-1
+1
-1
w
The
XOR
Problem
x3
x2
x1
Support
Vector
Machine
Widely
used
for
all
sorts
of
classica%on
problems
[Link]/isabelle/Projects/SVM/[Link]
Some
people
say
it
is
the
best
of
the
shelf
classier
out
there
Maximum
Margin
Classica%on
x2
x2
x1
x1
What
about
outliers?
:
slack
variables
x2
x1
XOR
problem
revised
x=0
Did
we
add
informa%on
to
make
the
problem
seperable?
Polynomial
Kernel
in
3D
Quadra%c
Kernel
Kernel
Func%ons
Polynomial:
Radial
basis
func%on
(RBF):
Kernel
Trick
for
SVMs
Arbitrary
many
dimensions
Li4le
computa%onal
cost
Maximal
margin
helps
with
curse
of
dimensionality
SVM
Applet
h4p://[Link]/educa%on/
lectures_and_seminars/annex_estat/Classier/
[Link]
Tips
and
Tricks
SVMs
are
not
scale
invariant
Check
if
your
library
normalizes
by
default
Normalize
your
data
mean:
0
,
std:
1
map
to
[0,1]
or
[-1,1]
Normalize
test
set
in
same
way!
Tips
and
Tricks
RBF
kernel
is
a
good
default
For
parameters
try
exponen%al
sequences
Read:
Chih-Wei
Hsu
et
al.,
A
Prac%cal
Guide
to
Support
Vector
Classica%on,
Bioinforma%cs
(2010)
Parameter
Tuning
Given
a
classica%on
task
Which
kernel
?
Which
kernel
parameter
values?
Which
value
for
C?
Try
dierent
combina%ons
and
take
the
best.
Grid
Search
Zang
et
al.,
Iden%ca%on
of
heparin
samples
that
contain
impuri%es
or
contaminants
by
chemometric
pa4ern
recogni%on
analysis
of
proton
NMR
spectral
data,
Anal
Bioanal
Chem
(2011)
Mul%
Class
One
vs.
All
One
vs
All
Train
n
classier
for
n
classes
Take
classica%on
with
greatest
posi%ve
margin
Slow
training
Mul%
Class
One
vs.
One
One
vs
One
Train
n(n-1)/2
classiers
Take
majority
vote
Fast
training
Decision
Tree
aSer
10
no
pm?
yes
got
call
friend
electricity?
got
new
read
book
dvd?
play
watch
tv
computer
Decision
Trees
Fast
training
Fast
prediciton
Easy
to
understand
Easy
to
interpret
Decision
Tree
-
Idea
C
D
A
A
B
C
D
E
Bishop,
Pa4ern
Recogni%on
and
Machine
Learning,
Springer,
2006
Decision
Tree
-
Predic%on
C
D
A
A
B
C
D
E
Decision
Tree
-Training
Learn
the
tree
structure:
which
feature
to
query
which
threshold
to
choose
A
B
C
D
E
Node
Purity
10
7
E
3
5
7
2
B
3
2
C
D
A
A
B
C
D
E
When
to
Stop
node
contains
only
one
class
node
contains
less
than
x
data
points
max
depth
is
reached
node
purity
is
sucient
you
start
to
overt
=>
cross-valida%on
Decision
Trees
-
Disadvantages
Sensi%ve
to
small
changes
in
the
data
Overtng
Only
axis
aligned
splits
Decision
Trees
vs
SVM
Has%e
et
al.,The
Elements
of
Sta%s%cal
Learning:
Data
Mining,
Inference,
and
Predic%on,
Springer
(2009)
Wisdom
of
Crowds
The
collec%ve
knowledge
of
a
diverse
and
independent
body
of
people
typically
exceeds
the
knowledge
of
any
single
individual,
and
can
be
harnessed
by
vo%ng.
James
Surowiecki
h4p://[Link]/
Ensemble
Methods
A
single
decision
tree
does
not
perform
well
But,
it
is
super
fast
What
if
we
learn
mul%ple
trees?
For
mul%ple
trees
we
need
even
more
data!
Bootstrap
Resampling
method
from
sta%s%cs
Useful
to
get
error
bars
on
es%mates
Take
N
data
points
Draw
N
%mes
with
replacement
Get
es%mate
from
each
bootstrapped
sample
Bagging
Bootstrap
aggregating
Sample
with
replacement
from
your
data
set
Learn
a
classier
for
each
bootstrap
sample
Average
the
results
Bagging
Example
x2
x1
Bagging
Reduces
overtng
(variance)
Normally
uses
one
type
of
classier
Decision
trees
are
popular
Easy
to
parallelize
Boos%ng
Also
ensemble
method
like
Bagging
But:
weak
learners
evolve
over
%me
votes
are
weighted
Be4er
than
Bagging
for
many
applica%ons
Very
popular
method
Boos%ng
Boos%ng
is
one
of
the
most
powerful
learning
ideas
introduced
in
the
last
twenty
years.
Has%e
et
al.,The
Elements
of
Sta%s%cal
Learning:
Data
Mining,
Inference,
and
Predic%on,
Springer
(2009)
Adaboost
x2
x1
AdaBoost
Ini%alize
weights
for
data
points
For
each
itera%on:
Fit
classier
to
training
data
Compute
weighted
classica%on
error
Compute
weight
for
classier
from
the
error
Update
weights
for
data
points
Final
classier
is
weighted
sum
of
all
single
classiers
AdaBoost
Has%e
et
al.,The
Elements
of
Sta%s%cal
Learning:
Data
Mining,
Inference,
and
Predic%on,
Springer
(2009)
AdaBoost
AdaBoost
Introduced
by
Freund
and
Schapire
in
1995
Worked
great,
nobody
understood
why!
Then
ve
years
later
(Friedman
et
al.
2000):
Adaboost
minimizes
exponen%al
loss
func%on.
There
s%ll
are
open
ques%ons.
Random
Forest
Builds
upon
the
idea
of
bagging
Each
tree
build
from
bootstrap
sample
Node
splits
calculated
from
random
feature
subsets
h4p://[Link]%[Link]/ar%cles/about/fun
Random
Forest
All
trees
are
fully
grown
No
pruning
Two
parameters
Number
of
trees
Number
of
features
Random
Forest
Error
Rate
Error
depends
on:
Correla%on
between
trees
(higher
is
worse)
Strength
of
single
trees
(higher
is
be4er)
Increasing
number
of
features
for
each
split:
Increases
correla%on
Increases
strength
of
single
trees
Out
of
Bag
Error
Each
tree
is
trained
on
a
bootstrapped
sample
About
1/3
of
data
points
not
used
for
training
Predict
unseen
points
with
each
tree
Measure
error
Out
of
Bag
Error
data
points
sample
lter
bootstrap
unused
sample
data
points
train
test
Out
of
Bag
Error
Very
similar
to
cross-valida%on
Measured
during
training
Can
be
too
op%mis%c
Variable
Importance
Again
use
out
of
bag
samples
Predict
class
for
these
samples
Randomly
permute
values
of
one
feature
Predict
classes
again
Measure
decrease
in
accuracy
Temp%ng
Scenario
Run
random
forest
with
all
features
Reduce
number
of
features
based
on
importance
weights
Run
again
with
reduced
feature
set
and
report
out
of
bag
error
This
does
not
measure
test
performance!
Unbalanced
Classes
The
Problem:
Oversample:
Subsample:
Subsample
for
each
tree!
Random
Forest
Subsampling
sample
train
Random
Forest
Similar
to
Bagging
Easy
to
parallelize
Packaged
with
some
neat
func%ons:
Out
of
bag
error
Feature
importance
measure
Proximity
es%ma%on
Cascade
Classier
Ensemble
methods
are
strong
But
predic%on
is
slow
Solu%on:
Make
predic%on
faster
Idea:
Build
a
cascade
Cascade
Classier
h4p://[Link]/wiki/Viola%E2%80%93Jones_object_detec%on_framework
Viola
Jones
Face
Detec%on
h4p://[Link]/
Viola
Jones
Face
Detec%on
Takes
long
to
train
Predic%on
in
real
%me!
Widely
used
today
Summary
SVMs
Decision
Trees
Bootstrap,
Bagging,
Boos%ng
Random
Forest
Cascade
Classier
Further
Reading
Error
Measures
predicted
True
posi%ve
(tp)
1
-1
True
nega%ve
(tn)
1
False
posi%ve
(fp)
tp
fn
true
False
nega%ve
(fn)
-1
fp
tn
TPR
and
FPR
predicted
True
Posi%ve
Rate:
1
-1
1
tp
fn
true
-1
fp
tn
False
Posi%ve
Rate:
Precision
Recall
predicted
1
-1
Recall:
1
tp
fn
true
-1
fp
tn
Precision:
Precision
Recall
Curve
1
precision
1
recall
Comparison
J.
Davis
&
M.
Goadrich,
The
Rela%onship
Between
Precision-Recall
and
ROC
Curves.,
ICML
(2006)
F-measure
Weighted
average
of
precision
and
recall
Usual
case:
Increasing
allocates
weight
to
recall
Clustering
Evalua%on
Criteria
Based
on
expert
knowledge
Debatable
for
real
data
Hidden
Unknown
structures
could
be
present
Do
we
even
want
to
just
reproduce
known
structure?
Rand
Index
Percentage
of
correct
classica%ons
Compare
pairs
of
elements:
tn
tp
fn
fp
Fp
and
fn
are
equally
weighted
Stability
Stability
What
is
the
right
number
of
clusters?
What
makes
a
good
clustering
solu%on?
Clustering
should
generalize!
Stability
Gini
Impurity
Example:
4
red,
3
green,
3
blue
data
points
random
sample:
red:
4/10
green:
3/10
blue:
3/10
misclassica%on:
red:
4/10
*
(3/10
+
3/10)
true
wrong
class
predic%on
Gini
Impurity
Number
of
classes:
Number
of
data
points:
Number
of
data
points
of
class
i:
true
wrong
class
predic%on
Gini
Impurity
Has%e
et
al.,The
Elements
of
Sta%s%cal
Learning:
Data
Mining,
Inference,
and
Predic%on,
Springer
(2009)
Node
Purity
Gain
Compare:
A
Gini
impurity
of
parent
node
Gini
impurity
of
child
nodes
B
C
Pseudocode
Check
for
base
cases
For
each
a4ribute
a
Calculate
the
gain
from
splitng
on
a
Let
a_best
be
the
a4ribute
with
highest
gain
Create
a
decision
node
that
splits
on
a_best
Repeat
on
the
sub-nodes
h4p://[Link]/wiki/C4.5_algorithm