Introduction to Machine Learning & Deep Learning (Fall 2023)
Lecture 6: Supervised Learning – Classification - SVM
Prof. Damian Borth
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 1
Last Lecture
• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 2
This Lecture
• Supervised Learning
– Classification
• Classifier
– Naïve Bayes
– k-Nearest Neightbour
– Logistic Regression
– Support Vector Machine
• Classifier Fusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 3
Supervised Learning
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 4
Types of Machine Learning
Machine Learning
Supervised Learning Semi-supervised Learning Unsupervised Learning
• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 5
Types of Machine Learning
Machine Learning
Supervised Learning Semi-supervised Learning Unsupervised Learning
• Objective: Learn the • Objective: Learn structure use • Objective: Identification of
relationship between data only few labels to label unknown distributions,
and a desired output pattern and dependencies
• Data:
• Data x contains labels c whose -> few labels are known • Data x contains dependencies
relationship to be learned : • -> many labels are unknown or patterns to be observed
-> labels are known -> labels are unknown
• “Learning known pattern” Self-supervised Learning • ”Learning unknown pattern”
• e.g. decision trees, • Objective: Learn representation • e.g. clustering algorithms,
neural nets, support vector of data by controlled pseudo- principle component analysis,
machines etc. labels for downstream task self-organizing maps etc.
• Data:
-> labels are unknown
Classification Regression • -> representation & lin. classifier Clustering Dim. Reduction
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 6
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)
4. How to combine „fuse“ distinct classifiers ?
5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 7
Parametric vs Non-parametric
• So far, we assumed P(x|c) to be Gaussian.
• What about these distributions?
• Often, we don't know the parametric form of P(x|c)
• Possible approaches:
• mixtures of Gaussians
• non-parametric methods (no parameters, no training)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 8
k-Nearest Neighbor - Idea
Intuitive Understanding
Idea
“Assign each unknown example x to the majority class
y of its k closest neighbors where k is a parameter.” k=1
k=3
k=5
Unknown example x to classify class y=0
class y=1
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 9
k-Nearest Neighbor - Approach
Given
• A set of labeled training samples {xi, yi}
• xi - feature representation of examples
• yi - class labels (e.g. document type, rating on YouTube etc.)
• Unknown sample x that we aim to predict the target
Classification Algorithm
• Compute the distance D(x, xi) of x to every training sample xi
• Select the k closest instances xi1 … xik and their class labels yi1 … yik
• Classify x according to the majority class of its k neighbors
• Calculating the majority class: $
1 1 𝑖𝑓 𝑦%! = 𝑦
𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
!"#
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 10
k-Nearest Neighbor – Distance Measures
“Euclidian” Distance (”L2-norm”)
• Used in the context of continuous variables
• Not very robust, single solution
“Manhattan” Distance (“L1-norm”)
• Used in the context of binary or encoded variables
• Robust, possibly multiple solution
“Hamming” Distance
• Used in the context of categorical variables 0
• E.g. distance between names, document types 1
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 11
k-Nearest Neighbor – Different “k” Example
k=1 k=3
k=10 k=50 k=200
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 12
k-Nearest Neighbor
Summary and Discussion
Pro’s Con’s
• “Non parametric” approach • Computationally expensive
• ”no” assumptions about the data distribution • time: computes all distances
• Simple to implement • Space: stores all examples
• Flexible to feature / distance choices • Sensitive outliers / irrelevant features
Use Cases
$
• Spam filtering 1 1 𝑖𝑓 𝑦%! = 𝑦
• Recommender systems 𝑃 𝑦|𝑥 = ( 𝛿 𝑦%! , 𝑦 , δ = ,
𝑘 0 𝑖𝑓 𝑦%! ≠ 𝑦
• Text classification !"#
• Document similarity
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 13
k-Nearest Neighbor
• When is nearest neighbor (NN) successful?
• we need many samples in small regions!
• Is nearest neighbor better than Gaussians?
• not necessarily – if the underlying class-conditional densities
are truly Gaussian and we can determine parameters reliably,
Gaussians are the optimal model!
• Are there really no parameters?
• there‘s K as hyper-parameter to choose
• low K = high variance
• high K = oversmoothing
• good compromise in practice: K=√n
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 14
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)
4. How to combine „fuse“ distinct classifiers ?
5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 15
Discriminative Models
• We saw a generative model: „Gaussians“
• we know P(x|c) and P(c), i.e. we know P(c|x)
• we can „generate“ samples from P(c|x)
• draw c' from P(c)
• draw x' from P(x|c)
• Alternative:
• omit P(x|c) and P(c), and directly estimate P(c|x) !
→ discriminative models: P(c|x) = fΘ(x)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 16
Logistic Regression - Introduction
Intuitive Understanding
Linear Regression
Sea Bass? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1 (Yes) 1
0.5
(No) 0 (No) 0
Lightness x1 Lightness x1
Classification Hypothesis Challenge
Threshold classifier fΘ(x1) output at 0.5:
“How to handle anomalies or
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” different modalities in the data?”
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 17
Logistic Regression - Introduction
Intuitive Understanding
Linear Regression Challenge: Outlier
Sea Bass? Good hypothesis? P(c|x) = fΘ(x1) =mx1+b, with Θ = (m,b)
(Yes) 1
0.5
(No) 0
Lightness x1
Classification Hypothesis Idee
Threshold classifier fΘ(x1) output at 0.5: Improve “Linear Regression” by:
(1) a non-linear hypothesis with fΘ
• If fΘ(x1) ≥ 0.5, predict c = 1 “Sea Bass” (2) learnable parameters Θ
• If fΘ(x1) < 0.5, predict c = 0 “Salmon”
“Logistic Regression”
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 18
Logistic Regression - Idea (one dimensions)
• remember the Gaussian case P(c|x) was a sigmoid function
fΘ(x)
• where
sigmoid
linear
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 19
Logistic Regression - Idea (one dimensions)
• remember the Gaussian case P(c|x) was a sigmoid function
fΘ(x)
• where
sigmoid
linear
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 20
Logistic Regression - Idea (more dimensions)
• In more dimensions, we have a weight vector w
• The decision boundary becomes a (linear) hyperplane
• We can omit b using augmented vectors:
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 21
Logistic Regression - Approach
Given
• A set of labeled training samples {xi, ci}
• xi - feature representation of examples
• ci - class labels (e.g. document type, rating on YouTube etc.)
• For each weight configuration w we can compute the classification loss 𝓛 “Error”
Training Algorithm (see Bishop p. 205f.)
• Initialize the weight configuration w0 “Gradient Descent Learning”
• Until convergence of loss 𝓛 do:
• Update the weight configuration according
to Gradient Descent Learning
• Increase k = k+1
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 22
Logistic Regression
Summary and Discussion
Pro’s Con’s
• “Discriminative” approach • Non-deterministic results
• learn only the needed • May end up in a local minima
• Results are easy to interpret • Learns linear decision boundaries
• Can be trained fast • Vulnerable to overfitting
Use Cases
• Predictive maintenance
• Medical treatment response
• Customer churn prediction
• Loan default prediction
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 23
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)
4. How to combine „fuse“ distinct classifiers ?
5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 24
Support Vector Machines (SVMs)
• Support Vector Machines were leading the State-of-the-art in
many machine learning tasks (including image recognition)
• A classifier benchmarking experiment:
– More than 100 datasets from the public UCI machine learning repository
– 7 classifiers, with parameters (for example, k in k-NN) optimized by a cross-validation gridsearch
– this illustration counts the datasets on which each classifier works best
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 25
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 26
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 27
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 28
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 29
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction
[ 2, 0, 2, 0 ]
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 30
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction SVM Classification
[ 2, 0, 2, 0 ]
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 31
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction SVM Classification
[ 2, 0, 2, 0 ]
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 32
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction SVM Classification
[ 2, 0, 2, 0 ]
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 33
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction SVM Classification
[ 2, 0, 2, 0 ]
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 34
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction SVM Classification
[ 2, 0, 2, 0 ]
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 35
Support Vector Machines (SVMs)
• SVMs were particularly successful in image recognition
• visual words + SVMs = „standard pipeline“
Visual Word Feature Extraction SVM Classification
[ 2, 0, 2, 0 ]
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 36
Approach
• maximum margin classification
• non-linearity by kernel functions
y distance from origin
angle
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 37
SVM: Notation
• Given:
– training samples x1,..,xnÎ Rd
• Geometric approach:
with labels y1,..,yn Î {-1,1} – find a hyperplane w that
separates the classes
• f(x) = <w,x> + b w
– use: “augmented vectors”
• f(x) = <w,x> (x → [x,1])
– classification of class presence ↔ x > 0
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 38
SVM: The maximum-margin Principle
• Which hyperplane is the best?
– multiple hyperplane possible
• Guiding Principles / Approaches
– generative models
(e.g., Gaussians with identical covariances)
– logistic regression
(likelihood maximization)
– perceptron
(error minimization)
– maximum-margin principle
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 39
SVM: Margin Maximization
• To find the hyperplane w that
w
maximizes the margin, let us
first require that for all
sample xi the following holds:
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 40
SVM: Margin Maximization
We have two kinds of samples: g w
– „safe“ samples xi which are
„far away“ from the decision
boundary: <w,xi> > yi
– „support vectors“ xi samples on
the margin: <w,xi> = yi
Relationship between g and w:
• the size of the margin g is 1/||w||2
• maximizing the margin is equivalent to minimizing ||w||2
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 41
SVM: Margin Maximization
• Altogether, a decision boundary w* w*=
that maximizes the margin can be
computed by solving the following
optimization problem:
• This is a “simple” optimization problem
– the objective function is quadratic, i.e., differentiable and convex
→ quadratic programming
– the constraints are all linear
– a globally optimal solution can be computed in O(n3)
– in practice, an SVM computational effort is » O(c×n1.8)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 42
Classification Problem: Non-Separability
• Problem: in practice, datasets are often not linear separable!
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 43
Classification Problem: Non-Separability
• Problem: in practice, datasets are often not linear separable!
• We can solve this problem by two extensions:
Slack Variables Kernel Function
mapping samples to a (proper)
allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 44
Classification Problem: Non-Separability
• Problem: in practice, datasets are often not linear separable!
• We can solve this problem by two extensions:
Slack Variables Kernel Function
mapping samples to a (proper)
allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 45
SVM: Slack Variables
• What is the better hyperplane for this dataset?
→ allow some training error: introduce slack variables
no training errors, one training error,
but small margin but larger margin
(=likely test errors) (=likely fewer test errors)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 46
SVM: Max Margin & Slack Variables
• Solution: Introduce slack variables x1, .., xn
w*= w*=
• We can satisfy all constraints by making xis large enough
• Hyper-parameter C realizes balancing:
→ C = ∞ i.e „hard“ margin, all xi are 0, no training error allowed
→ the smaller C, the larger the margin (at the cost of incorrectly classified training samples)
• The target function is still convex („simple“ optimization)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 47
Classification Problem: Non-Separability
• Problem: in practice, datasets are often not linear separable!
• We can solve this problem by two extensions:
Slack Variables Kernel Function
mapping samples to a (proper)
allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 48
Classification Problem: Non-Separability
• Problem: in practice, datasets are often not linear separable!
• We can solve this problem by two extensions:
Slack Variables Kernel Function
mapping samples to a (proper)
allowing for errors during training
higher-dimensional vector space
in favor of a max margin
and solving the problem there in a
hyperplane w
linear way
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 49
SVM: Non-(Linear)-Separability
• Slack variables are not enough!
– What is the best decision boundary on this dataset?
• We need non-linear decision boundaries
• Solutions:
– higher order decision functions y
– classifier stacking
– neural networks
x
(will be covered later)
– data transformation
(kernel functions)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 50
SVM: Data Transformation f
• In the example, we can find a transformation
f for the samples xi - such that they become
linearly separable
→ transform each xi to polar coordinates:
y distance from origin
f
x
angle
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 51
SVM: Data Transformation f
• Linear Classification with Data Transformation
→ define a feature transformation f: Rd → Rm
→ perform classification on f(xi) instead of xi
w*= w*=
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 52
SVM: Kernel Trick & Representer Theorem
• Finding „good“ data transformations for the classification problem can be difficult
• Instead, we will omit the transformation f(x) and use a similarity functions
k(xi,xj) that compare two samples xi,xj → this approach is called the kernel trick
• The similarity functions k(xi,xj) are called kernel functions
• The Representer Theorem is the basis of the kernel trick
• It tells us that the maximum-margin
solution lies in the subspace spanned
by the training samples, i.e. we can re-
write the maximum-margin solution w as:
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 53
SVM: Kernel Trick & Representer Theorem
Using the Representer Theorem,
we can rewrite:
w*= w*=
derive
SVM Equation kernel function: <f(xi),f(xj)> = k(xi,xj)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 54
Kernel Trick & Representer Theorem - Consequence
• Kernel Trick
– We can omit the computation of f, and simply
compute the kernel function k(.,.)
• Kernel Function k(.,.)
– The kernel function k(xi,xj) defines a
similarity measure between xi and xj
– there are several kernel functions to choose from
• We do not even have to know f
– this is actually pretty awesome!
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 55
SVM: Training
• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn
• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 56
SVM: Training
• Given
– training set with samples x1,..,xn, w
and its labels y1, .., yn
• Algorithm
1. choose a kernel function k(.,.)
2. estimate a1, .., an by optimizing the SVM equation
(ai ≠ 0 → xi is a „support vector“)
3. These a1, ..,an values define a maximum-margin decision boundary in a
high-dimensional space defined by the kernel function.
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 57
SVM: Classification
• Given
– test samples x1,..,xn → x = w
• Unknown
– labels y1, .., yn → y = { , } ?
• Classification
1. compute k(x,xi) for all x
2. compute classification score:
3. class decision is: sign( f(x))
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 58
SVM: Kernel Best Practice
• How do we choose kernels k(,.,) in practice?
• They can be construct from distance functions
– if d(.,.) is a distance function, then e-d(.,.) i.e exp{-d(.,.)} can be used as a kernel function
• Linear:
• Polynomial:
• Gaussian (RBF) Some practical
kernel functions
• Histogram intersection:
• Chi square
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 59
SVM: Kernel Best Practice
• Kernels should show a class-wise block structure
• Example: β in the Gaussian kernel:
β very large...
(picture: Christoph Lampert)
β very small...
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 60
SVM: Hyper-Parameter Optimization
• Parameter Optimization in SVMs:
– cost of training samples misclassified: C
– kernel parameter β
good parameter choices
β
• Frequently used approach: Grid Search
– test different values of C and β
on a regular grid (alt. log grid)
– for each pair, measure
classification accuracy
on a held-out validation set
C
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 61
SVMs – Summary
• Support Vector Machines were state-of-the-art classifier,
particularly successful in image recognition
• Advantages
– the maximum-margin problem can be solved globally optimally!
– the number of parameters is „independent“ of the feature dimensionality.
This makes SVMs very suitable classifiers for small, high-dimensional training sets!
– flexibility: we can incorporate application-specific kernels
– very good empirical results
• Disadvantages
– often: ad hoc choice of kernel functions
– scalability problems to large training sets
– Limited learning capacity with large number of positive samples
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 62
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)
4. How to combine „fuse“ distinct classifiers ?
5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 63
Early vs. Late Fusion
• Multiple classifiers can combine
different pieces of evidence
• multiple features
• multiple modalities
• multiple classifiers
• multiple training sets
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 64
Early vs. Late Fusion
• Different combination strategies
• early fusion = concatenate features
• late fusion = combine classification results
x1
x2 [x1,x2,..,x
early fusion classifier decision
… M]
xM
x1 classifier P(c|x1)
x2
… classifier P(c|x2) late fusion decision
xM ...
classifier P(c|xM)
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 65
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)
4. How to combine „fuse“ distinct classifiers ?
5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 66
Overview
Agenda
1. What is supervised learning (Classification)?
2. How to build an “optimal” classifier ?
3. What kind of classifiers are there ?
a) “Naive” Bayes
b) Nearest Neighbors
c) Logistic Regression
d) Support Vector Machine (SVM)
4. How to combine „fuse“ distinct classifiers ?
5. Summary and conclusion
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 67
Discussion
• This lecture – four sample classifiers
• Naive Bayes (with Gaussian CCDs)
• K-nearest neighbor
• Logistic regression
• Support Vector Machine (SVM)
• The Big Answer to “Which one is the best?”
• the right classifier depends on the distribution of the target data...
• … on the preprocessing ...
• … on the features...
• … on the amount of training data
→ no-free-lunch theorem
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 68
Questions?
Prof. Damian Borth - Artificial Intelligence & Machine Learning [AI:ML] 69