Computer Vision
Chapter 7 (part 1): Object detection
Course Content
• Chapter 1. Introduction
• Chapter 2. Image formation, acquisition and digitization
• Chapter 3. Image Processing
• Chapter 4. Feature detection and matching
• Chapter 5. Segmentation
• Chapter 6. Motion object detection and tracking
• Chapter 7. Object recognition and deep learning
‒ Object Detection
‒ Object Recognition
‒ Deep Learning
Contents
• Window-based generic object detection: basic
pipeline
• Boosting classifiers
• Face detection as case study
• SVM + HOG for human detection as case study
• Object proposals
• [DPM]
• Evaluation
3
Object Detection
• Problem: Detecting and localizing generic objects
from various categories, such as cars, people, etc.
• Challenges:
‒ Illumination,
‒ viewpoint,
‒ deformations,
‒ Intra-class
variability
4
Window-based generic
object detection
Basic pipeline
5
Generic category recognition:
basic framework
• Build/train object model
‒ Choose a representation
‒ Learn or fit parameters of model / classifier
• Generate candidates in new image
• Score the candidates
6
Window-based models
Building an object model
Given the representation, train a binary classifier
Car/non-car
Classifier
No,Yes,
notcar.
a car.
Slide: Kristen Grauman
7
Window-based models
Generating and scoring candidates
Car/non-car
Classifier
Slide: Kristen Grauman
8
Window-based models
Generating and scoring candidates
• Slide through the image and check if there is an
object at every location
YES!! Person match found
9
Window-based models
Generating and scoring candidates
• But what if we were looking
for buses?
No bus found!
• We will never find the object
if we don’t choose our
window size wisely!
Bus found
10
Multi-scale sliding window
• Work with multiple size windows
• Create a feature pyramid
11
Window-based object detection: recap
Training:
1. Obtain training data
2. Define features
3. Define classifier
Given new image:
1. Slide window Training examples
2. Score by classifier
Car/non-car
Classifier
Feature
extraction
Slide: Kristen Grauman
12
13
Features
• HOG
• Bags of visual words
• Haar features, …
Discriminative classifier construction
Nearest neighbor Neural networks
106 examples
Support Vector Machines Boosting Conditional Random Fields
14
Boosting classifiers
15
Boosting intuition
Weak
Classifier 1
Slide credit: Paul Viola
16
Boosting illustration
Weights
Increased
17
Boosting illustration
Weak
Classifier 2
18
Boosting illustration
Weights
Increased
19
Boosting illustration
Weak
Classifier 3
20
Boosting illustration
Final classifier is
a combination of weak
classifiers
21
Boosting: training
• Initially, weight each training example equally
• In each boosting round:
‒ Find the weak learner that achieves the lowest weighted training error
‒ Raise weights of training examples misclassified by current weak
learner
• Compute final classifier as linear combination of all weak
learners
‒ (weight of each learner is directly proportional to its accuracy)
• Exact formulas for re-weighting and combining weak
learners depend on the particular boosting scheme
(e.g., AdaBoost)
Slide credit: Lana Lazebnik
22
Face detection
as case study
23
Viola-Jones face detector
24
Viola-Jones face detector
Main idea:
‒ Represent local texture with efficiently
computable “rectangular” features within window
of interest
‒ Select discriminative features to be weak classifiers
‒ Use boosted combination of them as final classifier
‒ Form a cascade of such classifiers, rejecting clear
negatives quickly
25
Viola-Jones detector: features
• “Rectangular” filters
Feature output is difference
between adjacent regions
Value at (x,y) is
• Efficiently computable sum of pixels
above and to the
with integral image: left of (x,y)
any sum can be
computed in constant
time.
Integral image
Slide: Kristen Grauman 26
Computing the integral image
Lana Lazebnik
27
Computing the integral image
ii(x, y-1)
s(x-1, y)
i(x, y)
• Cumulative row sum: s(x, y) = s(x–1, y) + i(x, y)
• Integral image: ii(x, y) = ii(x, y−1) + s(x, y)
Lana Lazebnik
28
Computing sum within a rectangle
• Let A,B,C,D be the values of
the integral image at the
corners of a rectangle D B
• Then the sum of original image
values within the rectangle can A
be computed as: C
sum = A – B – C + D
• Only 3 additions are required
for any size of rectangle!
Lana Lazebnik
29
Viola-Jones detector: features
• “Rectangular” filters
Feature output is difference
between adjacent regions
Value at (x,y) is
• Efficiently computable with sum of pixels
above and to the
integral image: any sum left of (x,y)
can be computed in
constant time
Avoid scaling images →
scale features directly for Integral image
same cost
30
Viola-Jones detector: features
Considering all
possible filter
parameters: position,
scale, and type:
180,000+ possible
features associated
with each 24 x 24
window
Which subset of these features should we use
to determine if a window has a face?
Use AdaBoost both to select the informative features
and to form the classifier
31
Viola-Jones detector: AdaBoost
• Want to select the single rectangle feature and threshold
that best separates positive (faces) and negative (non-
faces) training examples, in terms of weighted error.
Resulting weak classifier:
For next round, reweight the
…
examples according to errors,
Outputs of a possible choose another filter/threshold
rectangle feature on
combo.
faces and non-faces.
Slide: Kristen Grauman
32
Start with
AdaBoost Algorithm uniform weights
on training
examples
For T rounds {x1,…xn}
Evaluate
weighted error
for each feature,
pick best.
Re-weight the examples:
Incorrectly classified -> more weight
Correctly classified -> less weight
Final classifier is combination of the
weak ones, weighted according to error
they had.
33
Viola-Jones Face Detector: Results
First two features
selected
34
• Even if the filters are fast to compute, each
new image has a lot of possible windows to
search.
• How to make the detection more efficient?
35
Cascading classifiers for detection
• Form a cascade with low false negative rates early on
• Apply less accurate but faster classifiers first to immediately
discard windows that clearly appear to be negative
Slide: Kristen Grauman
36
Training the cascade
• Set target detection and false positive rates for each
stage
• Keep adding features to the current stage until its target
rates have been met
‒ Need to lower AdaBoost threshold to maximize detection (as
opposed to minimizing total classification error)
‒ Test on a validation set
• If the overall false positive rate is not low enough, then
add another stage
• Use false positives from current stage as the
negative training examples for the next stage
37
Viola-Jones detector: summary
Train cascade of
classifiers with
AdaBoost
Faces
New image
Selected features,
Non-faces thresholds, and weights
• Train with 5K positives, 350M negatives
• Real-time detector using 38 layer cascade
• 6061 features in all layers
[Implementation available in OpenCV] Slide: Kristen Grauman
38
Viola-Jones detector: summary
• A seminal approach to real-time object detection
‒ 26.949 citations
• Training is slow, but detection is very fast
• Key ideas
‒ Integral images for fast feature evaluation
‒ Boosting for feature selection
‒ Attentional cascade of classifiers for fast rejection of non-face
windows
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features.
CVPR 2001.
P. Viola and M. Jones. Robust real-time face detection. IJCV 57(2), 2004.
39
Viola-Jones Face Detector: Results
40
Viola-Jones Face Detector: Results
41
Viola-Jones Face Detector: Results
42
Detecting profile faces?
Can we use the same detector?
43
Viola-Jones Face Detector: Results
Paul Viola, ICCV tutorial 44
Example using Viola-Jones detector
Frontal faces detected and then tracked, character names
inferred with alignment of script and subtitles.
Everingham, M., Sivic, J. and Zisserman, A.
"Hello! My name is... Buffy" - Automatic naming of characters in TV video,
BMVC 2006. http://www.robots.ox.ac.uk/~vgg/research/nface/index.html
45
46
Slide: Kristen Grauman
47
Consumer application: iPhoto
http://www.apple.com/ilife/iphoto/
Slide credit: Lana Lazebnik
48
Consumer application: iPhoto
Things iPhoto thinks are faces
Slide credit: Lana Lazebnik
49
Consumer application: iPhoto
• Can be trained to recognize pets!
http://www.maclife.com/article/news/iphotos_faces_recognizes_cats
Slide credit: Lana Lazebnik
50
Privacy Gift Shop – CV Dazzle
• http://www.wired.com/2015/06/facebook-can-recognize-even-dont-show-face/
• Wired, June 15, 2015
Slide: Kristen Grauman
51
Boosting: pros and cons
• Advantages of boosting
‒ Integrates classification with feature selection
‒ Complexity of training is linear in the number of training examples
‒ Flexibility in the choice of weak learners, boosting scheme
‒ Testing is fast
‒ Easy to implement
• Disadvantages
‒ Needs many training examples
‒ Other discriminative models may outperform in practice (SVMs,
CNNs,…)
• especially for many-class problems
Slide credit: Lana Lazebnik
52
Window-based models:
Two case studies
Boosting + face SVM + person
detection detection
Viola & Jones e.g., Dalal & Triggs
53
SVM + HOG for human detection
as case study
54
Linear classifiers
55
Linear classifiers
• Find linear function to separate positive and negative
examples
x i positive : xi w + b 0
x i negative : xi w + b 0
Which line
is best?
56
Support Vector Machines (SVMs)
• Discriminative
classifier based on
optimal separating
line (for 2d case)
• Maximize the margin
between the positive
and negative training
examples
57
Support vector machines
• Want line that maximizes the margin
x i positive ( yi = 1) : xi w + b 1
x i negative ( yi = −1) : x i w + b −1
For support, vectors, x i w + b = 1
Support vectors Margin
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998
58
Support vector machines
• Want line that maximizes the margin
x i positive ( yi = 1) : xi w + b 1
x i negative ( yi = −1) : x i w + b −1
For support vectors, x i w + b = 1
Distance between point | xi w + b |
and line: || w ||
For support vectors:
wΤ x + b 1 1 −1 2
= M= − =
Support vectors Margin M w w w w w
59
Support vector machines
• Want line that maximizes the margin
x i positive ( yi = 1) : xi w + b 1
x i negative ( yi = −1) : x i w + b −1
For support vectors, x i w + b = 1
Distance between point | xi w + b |
and line: || w ||
Therefore, the margin is 2 / ||w||
Support vectors Margin M
60
Finding the maximum margin line
1. Maximize margin 2/||w||
2. Correctly classify all training data points:
x i positive ( yi = 1) : xi w + b 1
x i negative ( yi = −1) : x i w + b −1
Quadratic optimization problem:
1 T
Minimize 2
w w
Subject to yi(w·xi+b) ≥ 1
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998
61
Finding the maximum margin line
• Solution: w = i i yi x i
learned Support
weight vector
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998
62
Finding the maximum margin line
• Solution: w = i i yi x i
b = yi – w·xi (for any support vector)
w x + b = i i yi x i x + b
• Classification function:
f ( x) = sign (w x + b)
= sign ( y x x + b)
i i i i
If f(x) < 0, classify as negative,
if f(x) > 0, classify as positive
C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and
Knowledge Discovery, 1998
63
Person detection
with HoG’s & linear SVM’s
• Histogram of oriented gradients (HoG):
‒ Map each grid cell in the input window to a histogram
counting the gradients per orientation.
• Train a linear SVM
‒ using training set of pedestrian vs. non-pedestrian
windows.
Dalal & Triggs, CVPR 2005
64
Person detection
with HoGs & linear SVMs
• For more detail about HoG:
‒ Histograms of Oriented Gradients for Human Detection, Navneet Dalal,
Bill Triggs, International Conference on Computer Vision & Pattern
Recognition - June 2005
‒ http://lear.inrialpes.fr/pubs/2005/DT05/
65
Window-based detection: strengths
• Sliding window detection and global appearance
descriptors:
‒ Simple detection protocol to implement
‒ Good feature choices critical
‒ Past successes for certain classes
Slide: Kristen Grauman
66
Window-based detection: Limitations
• High computational complexity
‒ For example: 250,000 locations x 30 orientations x 4
scales = 30,000,000 evaluations!
‒ If training binary detectors independently, means cost
increases linearly with number of classes
• With so many windows, false positive rate better
be low
Slide: Kristen Grauman
67
Limitations (continued)
• Not all objects are “box” shaped
Slide: Kristen Grauman
68
Limitations (continued)
• Non-rigid, deformable objects not captured well with
representations assuming a fixed 2d structure; or must assume
fixed viewpoint
• Objects with less-regular textures not captured well with holistic
appearance-based descriptions
Slide: Kristen Grauman
69
Limitations (continued)
Sliding window Detector’s view
If considering windows in isolation,
context is lost
Figure credit: Derek Hoiem
Slide: Kristen Grauman
70
Limitations (continued)
• In practice, often entails large, cropped training set
(expensive)
• Requiring good match to a global appearance
description can lead to sensitivity to partial occlusions
Slide: Kristen Grauman
71
Object proposals
72
Object proposals
Main idea:
• Learn to generate category-independent regions/boxes
that have object-like properties.
• Let object detector search over “proposals”, not
exhaustive sliding windows
Alexe et al. Measuring the objectness of image windows, PAMI 2012
73
Object proposals
Multi-scale
saliency
Color
contrast
Alexe et al. Measuring the objectness of image windows, PAMI 2012
74
Object proposals
Edge density Superpipxel straddling
Alexe et al. Measuring the objectness of image windows, PAMI 2012
75
Object proposals Yellow box: object detected
Cyan box: groundtruth
More proposals
Alexe et al. Measuring the objectness of image windows, PAMI 2012
76
Deformable Part Model (DPM)
• Represents an object as a
collection of parts arranged in a
deformable configuration
• Each part represents local
appearances
• Spring-like connections between
certain pairs of parts
Fischler and Elschlager, Pictoral Structures,
1973
Felzenszwalb et al. , PAMI 2010
78
Deformable Part Model (DPM)
79
Deformable Part Model (DPM)
• References
‒ Pedro F. Felzenszwalb & Daniel P. Huttenlocher, Pictorial Structures for Object
Recognition, IJCV 2005
• https://www.cs.cornell.edu/~dph/papers/pict-struct-ijcv.pdf
‒ P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object
detection with discriminatively trained part based models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010
80
Object detection: Evaluation
81
Object Detection Benchmarks
• PASCAL VOC Challenge
• ImageNet Large Scale Visual Recognition Challenge
(ILSVR)
‒ 200 Categories for detection
• Common Objects in Context (COCO)
‒ 80 Object categories
82
How do we evaluate object detection?
predictions
ground truth
True positive:
- The overlap of the prediction
with the ground truth is MORE
than a threshold value (0.5)
83
How do we evaluate object detection?
predictions
ground truth
True positive:
False positive:
- The overlap of the prediction
with the ground truth is LESS
than a threshold value (0.5)
84
How do we evaluate object detection?
predictions
ground truth
True positive:
False positive:
False negative:
- The objects that our model
doesn’t find
85
How do we evaluate object detection?
predictions
ground truth
True positive:
False positive:
False negative:
- The objects that our model
doesn’t find
What is a True Negative?
86
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
87
How do we evaluate object detection?
predictions
ground truth
True positive: 1
False positive: 2
False negative: 1
So what is the
- precision?
- recall?
88
Precision versus recall
• Precision:
‒ how many of the object detections
are correct?
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
• Recall:
‒ how many of the ground truth objects 𝑇𝑃
can the model detect? 𝑟𝑒𝑐𝑎𝑙𝑙 =
‒ True Positive Rate (TPR)
𝑇𝑃 + 𝐹𝑁
89
• In reality, our model makes a lot of predictions with varying scores
between 0 and 1
predictions
ground truth
Here are all the boxes that
are predicted with score > 0.
This means that our
- Recall is perfect!
- But our precision is BAD!
90
How do we evaluate object detection?
predictions
ground truth
Here are all the boxes that
are predicted with score > 0.5
We are setting a threshold of
0.5
91
Precision – recall curve (PR curve)
92
Which model is the best?
93
Which model is the best?
• Area under curve (AUC), average precision (AP)
• F1-score (highest value at optimal confidential score)
94
Which model is the best?
AP: The metric calculates the average precision (AP) for each
class individually across all of the IoU thresholds
mAP: the average of AP
95
Summary
• Object recognition as classification task
‒ Boosting (face detection ex)
‒ Support vector machines and HOG (human detection
ex)
‒ Sliding window search paradigm
• Pros and cons
• Speed up with attentional cascade
• Object proposals, proposal regions as alternative
96
References
Most of these slides were adapted from:
1. Kristen Grauman (CS 376: Computer Vision, Spring 2018, The
University of Texas at Austin)
97
Thank
you!
98