Lecture 4 Detection
Lecture 4 Detection
Object Detection
Chen-Change Loy
吕健勤
http://www.ntu.edu.sg/home/ccloy/
https://twitter.com/ccloy
Assignment 1
• 20 marks (will be normalized)
• Deadline: 14 Feb 11:59 PM
Slide credits
• Justin Johnson EECS 498-007 / 598-005
So far …
Image classification
Computer vision tasks
Outline
• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN
• Tutorial 2: Faster-RCNN
Task Definition
Task definition
• Things
• Stuff
Task definition
Challenges
Challenges
Treat localization as a
regression problem!
The pipeline
Detecting a single object
Multitask loss
Treat localization as a
regression problem!
The pipeline
Detecting a single object
Often pretrained
on ImageNet
(Transfer learning)
Multitask loss
Treat localization as a
regression problem!
The pipeline
Detecting multiple objects – need different numbers of outputs per image
12
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Possible positions:
(W – w + 1) * (H – h + 1)
Sliding window
Detecting multiple objects
Possible positions:
(W – w + 1) * (H – h + 1)
Sliding window
Detecting multiple objects
800 x 600 image
has ~58M boxes!
No way we can
evaluate them all
Possible positions:
(W – w + 1) * (H – h + 1)
Outline
• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN
• Tutorial 2: Faster-RCNN
Region-based CNN and
basic concepts
Region proposals
• Find a small set of boxes that are likely to cover all objects
• Often based on heuristics: e.g. look for “blob-like” image regions
• Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on
CPU
Goal: Generate many regions, each of which belongs to at most one object.
P. F. Felzenszwalb and D. P. Huttenlocher. “Efficient Graph-Based Image Segmentation.” IJCV, 59:167–181, 2004.
Selective Search
Step 2: Recursively combine similar regions into larger ones.
Greedy algorithm:
Color Similarity – HSV (hue, saturation, value)
1. From set of regions, choose two that are most similar. Texture Similarity – HOG-like features
2. Combine them into a single, larger region.
3. Repeat until only one region remains.
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Which training pairs (P, B) to use?
• Intuitively, if P is far from all ground-truth boxes, then the task of transforming P to a ground-
truth box G does not make sense.
• Therefore, we only learn from a proposal P if it is nearby at least one ground-truth box. We
implement “nearness” by assigning P to the ground-truth box G with which it has maximum
IoU overlap (in case it overlaps more than one) if and only if the overlap is greater than a
threshold (which we set to 0.6 using a validation set).
• All unassigned proposals are discarded.
• We do this once for each object class in order to learn a set of class-specific bounding-box
regressors.
Comparing Boxes: Intersection over Union (IoU)
Accept everything:
0 Miss nothing
Recall 1
Source: https://en.wikipedia.org/wiki/Precision_and_recall
Mean Average Precision (mAP)
Evaluating Object Detectors
https://cocodataset.org/
Region-based CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Outline
• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN
• Tutorial 2: Faster-RCNN
Fast R-CNN
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Problem: Slight misalignment
“Snap” to due to snapping; different-sized
Divide into 2x2 grid of
grid cells subregions is weird
(roughly) equal subregions
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN vs “Slow” R-CNN
Fast R-CNN: Apply differentiable cropping to “Slow” R-CNN: Run CNN independently for
shared image features each region
Fast R-CNN vs “Slow” R-CNN
Fast R-CNN: Apply differentiable cropping to “Slow” R-CNN: Run CNN independently for
shared image features each region
Fast R-CNN vs “Slow” R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN
Recall: Region proposals computed by heuristic ”Selective Search” Problem: Runtime dominated by region
algorithm on CPU -- let’s learn them with a CNN instead! proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Outline
• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN
• Tutorial 2: Faster-RCNN
Faster R-CNN
Faster R-CNN: Learnable Region Proposals
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Region Proposal Network (RPN)
Region Proposal Network (RPN)
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Faster R-CNN: Learnable Region Proposals
Faster R-CNN: Learnable Region Proposals
Faster R-CNN: Learnable Region Proposals
Anchor category
(C+1) x K x 20 x 15
Box transforms
4K x 20 x 15
Anchor category
(C+1) x K x 20 x 15
Box transforms
C x 4K x 20 x 15
Bicycle Car
Dog
Dining
Table
Conditioned on object: P(Car | Object)
Bicycle Car
Dog
Dining
Table
Then we combine the box and class predictions.
Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower
- Single-stage methods (SSD) are
much faster, but don’t perform as
well
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower
- Single-stage methods (SSD) are
much faster, but don’t perform as
well
- Bigger backbones improve
performance, but are slower
- Diminishing returns for slower
methods
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
- Single-stage methods have improved
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
- Single-stage methods have improved
- Very big models work better
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!
These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
- Single-stage methods have improved
- Very big models work better
- Test-time augmentation pushes numbers up
- Big ensembles, more data, etc
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
New developments
• Anchor-free detectors
• FCOS
Detectron2 (PyTorch):
https://github.com/facebookresearch/detectron2
Fast / Faster / Mask R-CNN, RetinaNet
Object detection: open source codes
Object detection is hard! Don’t implement it yourself
OpenMMLab MMDetection
https://github.com/open-mmlab/mmdetection