0% found this document useful (0 votes)
11 views148 pages

Lecture 4 Detection

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views148 pages

Lecture 4 Detection

Uploaded by

Sazid Azad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

AI6126 Advanced Computer Vision

Last update: 4 February 2022 3:35pm

Object Detection
Chen-Change Loy
吕健勤

http://www.ntu.edu.sg/home/ccloy/
https://twitter.com/ccloy
Assignment 1
• 20 marks (will be normalized)
• Deadline: 14 Feb 11:59 PM
Slide credits
• Justin Johnson EECS 498-007 / 598-005
So far …
Image classification
Computer vision tasks
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Task Definition
Task definition
• Things

• Stuff
Task definition
Challenges
Challenges

There are many “close” false


positives, corresponding to “close but
not correct” bounding boxes.

The detector must find the true


positives while suppressing these
close false positives
Challenges
The pipeline
Detecting a single object
The pipeline
Detecting a single object
The pipeline
Detecting a single object

Treat localization as a
regression problem!
The pipeline
Detecting a single object

Multitask loss

Treat localization as a
regression problem!
The pipeline
Detecting a single object

Often pretrained
on ImageNet
(Transfer learning)
Multitask loss

Treat localization as a
regression problem!
The pipeline
Detecting multiple objects – need different numbers of outputs per image

12
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects

Question How many possible boxes are there in an image of size H x W?


Sliding window
Detecting multiple objects

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w:


Possible x positions: W – w + 1
Possible y positions: H – h + 1

Possible positions:
(W – w + 1) * (H – h + 1)
Sliding window
Detecting multiple objects

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w: Even worse if we consider boxes of different


Possible x positions: W – w + 1 sizes and ratios
Possible y positions: H – h + 1

Possible positions:
(W – w + 1) * (H – h + 1)
Sliding window
Detecting multiple objects
800 x 600 image
has ~58M boxes!
No way we can
evaluate them all

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w: Even worse if we consider boxes of different


Possible x positions: W – w + 1 sizes and ratios
Possible y positions: H – h + 1

Possible positions:
(W – w + 1) * (H – h + 1)
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Region-based CNN and
basic concepts
Region proposals
• Find a small set of boxes that are likely to cover all objects
• Often based on heuristics: e.g. look for “blob-like” image regions
• Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on
CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012


Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge
boxes: Locating object proposals from edges”, ECCV 2014
Selective Search

Step 1: Generate initial sub-segmentation

Goal: Generate many regions, each of which belongs to at most one object.

P. F. Felzenszwalb and D. P. Huttenlocher. “Efficient Graph-Based Image Segmentation.” IJCV, 59:167–181, 2004.
Selective Search
Step 2: Recursively combine similar regions into larger ones.

Greedy algorithm:
Color Similarity – HSV (hue, saturation, value)
1. From set of regions, choose two that are most similar. Texture Similarity – HOG-like features
2. Combine them into a single, larger region.
3. Repeat until only one region remains.

This yields a hierarchy of successively larger regions, just like we want.


Selective Search
Step 2: Use the generated regions to produce candidate object locations.
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region

Bounding box regression:


Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region

Bounding box regression:


Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?


Region proposal: (𝑝! , 𝑝" , 𝑝# , 𝑝$ )
𝑝! exp(𝑡! ) Transform: (𝑡! , 𝑡" , 𝑡# , 𝑡$ )
Output box: (𝑏! , 𝑏" , 𝑏# , 𝑏$ )
𝑝! 𝑡"
𝑝# exp(𝑡# )

Translate relative to box size:


𝑝# 𝑡$
𝑏! = 𝑝! + 𝑝$ 𝑡! 𝑏" = 𝑝" + 𝑝# 𝑡"

Log-space scale transform:


𝑏$ = 𝑝$ exp(𝑡$ ) 𝑏# = 𝑝# exp(𝑡# )

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?


Region proposal: (𝑝! , 𝑝" , 𝑝# , 𝑝$ ) !! − #! !# − ## ! !
Transform: (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ( , , log " , log $ )
𝑝! exp(𝑡! ) #" #$ #" #$
Output box: (𝑏! , 𝑏" , 𝑏# , 𝑏$ )
𝑝! 𝑡"
𝑝# exp(𝑡# )

Translate relative to box size:


𝑝# 𝑡$
𝑏! = 𝑝! + 𝑝$ 𝑡! 𝑏" = 𝑝" + 𝑝# 𝑡"

Log-space scale transform:


𝑏$ = 𝑝$ exp(𝑡$ ) 𝑏# = 𝑝# exp(𝑡# )

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?


Region proposal: (𝑝! , 𝑝" , 𝑝# , 𝑝$ ) !! − #! !# − ## ! !
Transform: (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ( , , log " , log $ )
𝑝! exp(𝑡! ) #" #$ #" #$
Output box: (𝑏! , 𝑏" , 𝑏# , 𝑏$ )
𝑝! 𝑡"
A standard regression model can solve
𝑝# exp(𝑡# )

Translate relative to box size: the problem by minimizing the sum of


𝑝# 𝑡$
𝑏! = 𝑝! + 𝑝$ 𝑡! 𝑏" = 𝑝" + 𝑝# 𝑡" squared loss with regularization.

Log-space scale transform:


𝑏$ = 𝑝$ exp(𝑡$ ) 𝑏# = 𝑝# exp(𝑡# )

Output from bounding box regression function


Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Test time
Input: Single RGB Image

1. Run region proposal method to compute


~2000 region proposals
2. Resize each region to 224x224 and run
independently through CNN to predict
class scores and bbox transform
3. Use scores to select a subset of region
proposals to output
(Many choices here: threshold on
background, or per-category? Or take top
K proposals per image?)
4. Compare with ground-truth boxes

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Which training pairs (P, B) to use?
• Intuitively, if P is far from all ground-truth boxes, then the task of transforming P to a ground-
truth box G does not make sense.
• Therefore, we only learn from a proposal P if it is nearby at least one ground-truth box. We
implement “nearness” by assigning P to the ground-truth box G with which it has maximum
IoU overlap (in case it overlaps more than one) if and only if the overlap is greater than a
threshold (which we set to 0.6 using a validation set).
• All unassigned proposals are discarded.
• We do this once for each object class in order to learn a set of class-specific bounding-box
regressors.
Comparing Boxes: Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
box?
Comparing Boxes: Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
box?

Intersection over Union (IoU)


(Also called “Jaccard similarity”
or “Jaccard index”):
Comparing Boxes: Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
box?

Intersection over Union (IoU)


(Also called “Jaccard similarity”
or “Jaccard index”):

IoU > 0.5 is “decent”


Comparing Boxes: Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
box?

Intersection over Union (IoU)


(Also called “Jaccard similarity”
or “Jaccard index”):

IoU > 0.5 is “decent”,


IoU > 0.7 is “pretty good”
Comparing Boxes: Intersection over Union (IoU)

How can we compare our


prediction to the ground-truth
box?

Intersection over Union (IoU)


(Also called “Jaccard similarity”
or “Jaccard index”):

IoU > 0.5 is “decent”,


IoU > 0.7 is “pretty good”,
IoU > 0.9 is “almost perfect”
Overlapping boxes
Problem: Object detectors often output many
overlapping detections
Non-Max Suppression (NMS)
Problem: Object detectors often output many
overlapping detections

Solution: Post-process raw detections using Non-


Max Suppression (NMS)
1. Select next highest-scoring box
2. Eliminate lower-scoring boxes with IoU >
threshold (e.g. 0.7)
3. If any boxes remain, GOTO 1
Non-Max Suppression (NMS)
Problem: Object detectors often output many
overlapping detections

Solution: Post-process raw detections using Non-


Max Suppression (NMS)
1. Select next highest-scoring box
2. Eliminate lower-scoring boxes with IoU >
threshold (e.g. 0.7)
3. If any boxes remain, GOTO 1
Non-Max Suppression (NMS)
Problem: Object detectors often output many
overlapping detections

Solution: Post-process raw detections using Non-


Max Suppression (NMS)
1. Select next highest-scoring box
2. Eliminate lower-scoring boxes with IoU >
threshold (e.g. 0.7)
3. If any boxes remain, GOTO 1
Non-Max Suppression (NMS)
Problem: Object detectors often output many
overlapping detections

Solution: Post-process raw detections using Non-


Max Suppression (NMS)
1. Select next highest-scoring box
2. Eliminate lower-scoring boxes with IoU >
threshold (e.g. 0.7)
3. If any boxes remain, GOTO 1
Non-Max Suppression (NMS)
Problem: Object detectors often output many
overlapping detections

Solution: Post-process raw detections using Non-


Max Suppression (NMS)
1. Select next highest-scoring box
2. Eliminate lower-scoring boxes with IoU >
threshold (e.g. 0.7)
3. If any boxes remain, GOTO 1

Problem: NMS may eliminate ”good” boxes when


objects are highly overlapping... no good solution
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve

Reject everything: no mistakes


1 Ideal!
Precision

Summarize by area under curve


(avg. precision)

Accept everything:
0 Miss nothing
Recall 1

Source: https://en.wikipedia.org/wiki/Precision_and_recall
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
b. Average Precision (AP) = area under PR curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
b. Average Precision (AP) = area under PR curve

How to get AP = 1.0?


Hit all GT boxes with IoU > 0.5, and have no “false positive”
detections ranked above any “true positives”
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
b. Average Precision (AP) = area under PR curve
3. Mean Average Precision (mAP) = average of AP for each
category
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)


2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
b. Average Precision (AP) = area under PR curve
3. Mean Average Precision (mAP) = average of AP for each
category
4. For “COCO mAP”: Compute mAP@thresh for each IoU
threshold (0.5, 0.55, 0.6, ..., 0.95) and take average

https://cocodataset.org/
Region-based CNN

Problem: Very slow! Need to do ~2k


forward passes for each image!

Solution: Run CNN *before* warping!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Fast R-CNN
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

How to crop features?

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool

Divide into 2x2 grid of


(roughly) equal subregions

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool

Divide into 2x2 grid of


(roughly) equal subregions

Max-pool within each subregion

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Problem: Slight misalignment
“Snap” to due to snapping; different-sized
Divide into 2x2 grid of
grid cells subregions is weird
(roughly) equal subregions

Max-pool within each subregion


RoI pooling quantizes a floating-
number RoI to the discrete
granularity of the feature
map. These quantizations
introduce misalignments
between the RoI and the
extracted features.

While this may not impact


classification (which is robust to
small translations), it has a large
negative effect on predicting
pixel-accurate masks.

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

The results are not sensitive to the


exact sampling locations, or how
many points are sampled, as long
as no quantization is performed.

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a


linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a


linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a


linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a


linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a


linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a


linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a


linear combination of features
at its four neighboring grid cells

This is differentiable! Upstream gradient for sampled feature will flow


backward into each of the four nearest-neighbor gridpoints

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions


No “snapping”
(may not be aligned to grid!)

Sample features at regularly-


spaced points in each subregion
using bilinear interpolation
After sampling, max- pool in
each subregion

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN vs “Slow” R-CNN

Fast R-CNN: Apply differentiable cropping to “Slow” R-CNN: Run CNN independently for
shared image features each region
Fast R-CNN vs “Slow” R-CNN

Fast R-CNN: Apply differentiable cropping to “Slow” R-CNN: Run CNN independently for
shared image features each region
Fast R-CNN vs “Slow” R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN

Problem: Runtime dominated by region


proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN

Recall: Region proposals computed by heuristic ”Selective Search” Problem: Runtime dominated by region
algorithm on CPU -- let’s learn them with a CNN instead! proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Faster R-CNN
Faster R-CNN: Learnable Region Proposals

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Region Proposal Network (RPN)
Region Proposal Network (RPN)

Imagine an anchor box of fixed size at each


point in the feature map
Region Proposal Network (RPN)

Imagine an anchor box of fixed size at each


point in the feature map

At each point, predict whether the corresponding


anchor contains an object (per-cell logistic
regression, predict scores with conv layer)
Region Proposal Network (RPN)

Imagine an anchor box of fixed size at each


point in the feature map

For positive boxes, also predict a box transform to


regress from anchor box to object box
Region Proposal Network (RPN)

Problem: Anchor box may have the wrong


size / shape
Solution: Use K different anchor boxes at
each point!

At test time: sort all K*20*15 boxes by their


score, and take the top ~300 as our region
proposals
Faster R-CNN: Learnable Region Proposals

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Faster R-CNN: Learnable Region Proposals
Faster R-CNN: Learnable Region Proposals
Faster R-CNN: Learnable Region Proposals

Question: Do we really need the


second stage?
Single-Stage Object Detection

RPN: Classify each anchor as object / not object


Single-Stage Detector: Classify each object as
one of C categories (or background)

Anchor category
(C+1) x K x 20 x 15

Box transforms
4K x 20 x 15

Remember: K anchors at each position in image


feature map
Single-Stage Object Detection

RPN: Classify each anchor as object / not object


Single-Stage Detector: Classify each object as
one of C categories (or background)

Anchor category
(C+1) x K x 20 x 15

Box transforms
C x 4K x 20 x 15

Sometimes use category- specific regression:


Predict different box transforms for each category
YOLO
If we have time
Credit
• The original slides of YOLO:
https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz
9Yz2oh4-GTdX6M/edit#slide=id.p

YOLO: You Only Look Once


YOLO
Extremely fast (45 frames per second, compared to Faster R-CNN of 7fps on the
same setting)
We split the image into a S*S grid
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)

Each grid cell predicts a fixed number of boundary boxes.


In this example, the grid cell makes two boundary box
predictions (black boxes) to locate where the car is.
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)

Do you see the weakness of YOLO?


Each cell also predicts a class probability.
Each cell also predicts a class probability.

Bicycle Car

Dog

Dining
Table
Conditioned on object: P(Car | Object)

Bicycle Car

Dog

Dining
Table
Then we combine the box and class predictions.

P(class|Object) * P(Object) = P(class)

Conditional Box confidence Class confidence


class probability score score
Finally we do NMS and threshold detections
Each cell predicts: This parameterization fixes the output size
- For each bounding box:
- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities

For Pascal VOC:


- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
Thus we can train one neural network to be a whole detection pipeline
YOLO Series
• YOLO (Joseph Redmon et al. 2016)
• YOLOv2 (Joseph Redmon and Ali Farhadi 2017)
• YOLOv3 (Joseph Redmon and Ali Farhadi 2018)
• YOLOv4 (Alexey Bochkovskiy 2020)
• YOLOv5 (Ultralytics)
• YOLOX (Zheng Ge et al 2021)
Summary
Object detection: lots of variables!

Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower
- Single-stage methods (SSD) are
much faster, but don’t perform as
well

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower
- Single-stage methods (SSD) are
much faster, but don’t perform as
well
- Bigger backbones improve
performance, but are slower
- Diminishing returns for slower
methods

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
- Single-stage methods have improved

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
- Single-stage methods have improved
- Very big models work better

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
- Single-stage methods have improved
- Very big models work better
- Test-time augmentation pushes numbers up
- Big ensembles, more data, etc

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
New developments
• Anchor-free detectors
• FCOS

• End-to-end (NMS-free) detectors


• CenterNet
• DETR
Object detection: open source codes
Object detection is hard! Don’t implement it yourself

TensorFlow Detection API:


https://github.com/tensorflow/models/tree/master/research/object_detection
Faster R-CNN, SSD, RFCN, Mask R-CNN

Detectron2 (PyTorch):
https://github.com/facebookresearch/detectron2
Fast / Faster / Mask R-CNN, RetinaNet
Object detection: open source codes
Object detection is hard! Don’t implement it yourself

OpenMMLab MMDetection
https://github.com/open-mmlab/mmdetection

Tutorial 2: Faster R-CNN using MMDetection


See tutorial_02_object_detection.ipynb

You might also like