0% found this document useful (0 votes)

11 views148 pages

Lecture 4 Detection

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views148 pages

Lecture 4 Detection

Uploaded by

Sazid Azad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

AI6126 Advanced Computer Vision

Last update: 4 February 2022 3:35pm

Object Detection
Chen-Change Loy
吕健勤

http://www.ntu.edu.sg/home/ccloy/
https://twitter.com/ccloy
Assignment 1
• 20 marks (will be normalized)
• Deadline: 14 Feb 11:59 PM
Slide credits
• Justin Johnson EECS 498-007 / 598-005
So far …
Image classification
Computer vision tasks
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Task Definition
Task definition
• Things

• Stuff
Task definition
Challenges
Challenges

There are many “close” false

positives, corresponding to “close but
not correct” bounding boxes.

The detector must find the true

positives while suppressing these
close false positives
Challenges
The pipeline
Detecting a single object
The pipeline
Detecting a single object
The pipeline
Detecting a single object

Treat localization as a
regression problem!
The pipeline
Detecting a single object

Multitask loss

Treat localization as a
regression problem!
The pipeline
Detecting a single object

Often pretrained
on ImageNet
(Transfer learning)
Multitask loss

Treat localization as a
regression problem!
The pipeline
Detecting multiple objects – need different numbers of outputs per image

12
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects
Sliding window
Detecting multiple objects

Question How many possible boxes are there in an image of size H x W?

Sliding window
Detecting multiple objects

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w:

Possible x positions: W – w + 1
Possible y positions: H – h + 1

Possible positions:
(W – w + 1) * (H – h + 1)
Sliding window
Detecting multiple objects

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w: Even worse if we consider boxes of different

Possible x positions: W – w + 1 sizes and ratios
Possible y positions: H – h + 1

Possible positions:
(W – w + 1) * (H – h + 1)
Sliding window
Detecting multiple objects
800 x 600 image
has ~58M boxes!
No way we can
evaluate them all

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w: Even worse if we consider boxes of different

Possible x positions: W – w + 1 sizes and ratios
Possible y positions: H – h + 1

Possible positions:
(W – w + 1) * (H – h + 1)
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Region-based CNN and
basic concepts
Region proposals
• Find a small set of boxes that are likely to cover all objects
• Often based on heuristics: e.g. look for “blob-like” image regions
• Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on
CPU

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013
Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge
boxes: Locating object proposals from edges”, ECCV 2014
Selective Search

Step 1: Generate initial sub-segmentation

Goal: Generate many regions, each of which belongs to at most one object.

P. F. Felzenszwalb and D. P. Huttenlocher. “Efficient Graph-Based Image Segmentation.” IJCV, 59:167–181, 2004.
Selective Search
Step 2: Recursively combine similar regions into larger ones.

Greedy algorithm:
Color Similarity – HSV (hue, saturation, value)
1. From set of regions, choose two that are most similar. Texture Similarity – HOG-like features
2. Combine them into a single, larger region.
3. Repeat until only one region remains.

This yields a hierarchy of successively larger regions, just like we want.

Selective Search
Step 2: Use the generated regions to produce candidate object locations.
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region

Bounding box regression:

Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Classify each region

Bounding box regression:

Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
proposal
Classify each region
its corresponding
ground truth box
coordinates Bounding box regression:
Predict “transform” to correct the
RoI: 4 numbers (𝑡! , 𝑡" , 𝑡# , 𝑡$ )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Region proposal: (𝑝! , 𝑝" , 𝑝# , 𝑝$ )
𝑝! exp(𝑡! ) Transform: (𝑡! , 𝑡" , 𝑡# , 𝑡$ )
Output box: (𝑏! , 𝑏" , 𝑏# , 𝑏$ )
𝑝! 𝑡"
𝑝# exp(𝑡# )

Translate relative to box size:

𝑝# 𝑡$
𝑏! = 𝑝! + 𝑝$ 𝑡! 𝑏" = 𝑝" + 𝑝# 𝑡"

Log-space scale transform:

𝑏$ = 𝑝$ exp(𝑡$ ) 𝑏# = 𝑝# exp(𝑡# )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Region proposal: (𝑝! , 𝑝" , 𝑝# , 𝑝$ ) !! − #! !# − ## ! !
Transform: (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ( , , log " , log $ )
𝑝! exp(𝑡! ) #" #$ #" #$
Output box: (𝑏! , 𝑏" , 𝑏# , 𝑏$ )
𝑝! 𝑡"
𝑝# exp(𝑡# )

Translate relative to box size:

𝑝# 𝑡$
𝑏! = 𝑝! + 𝑝$ 𝑡! 𝑏" = 𝑝" + 𝑝# 𝑡"

Log-space scale transform:

𝑏$ = 𝑝$ exp(𝑡$ ) 𝑏# = 𝑝# exp(𝑡# )

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Region proposal: (𝑝! , 𝑝" , 𝑝# , 𝑝$ ) !! − #! !# − ## ! !
Transform: (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ( , , log " , log $ )
𝑝! exp(𝑡! ) #" #$ #" #$
Output box: (𝑏! , 𝑏" , 𝑏# , 𝑏$ )
𝑝! 𝑡"
A standard regression model can solve
𝑝# exp(𝑡# )

Translate relative to box size: the problem by minimizing the sum of

𝑝# 𝑡$
𝑏! = 𝑝! + 𝑝$ 𝑡! 𝑏" = 𝑝" + 𝑝# 𝑡" squared loss with regularization.

Log-space scale transform:

𝑏$ = 𝑝$ exp(𝑡$ ) 𝑏# = 𝑝# exp(𝑡# )

Output from bounding box regression function

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Test time
Input: Single RGB Image

1. Run region proposal method to compute

~2000 region proposals
2. Resize each region to 224x224 and run
independently through CNN to predict
class scores and bbox transform
3. Use scores to select a subset of region
proposals to output
(Many choices here: threshold on
background, or per-category? Or take top
K proposals per image?)
4. Compare with ground-truth boxes

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Region-based CNN
Which training pairs (P, B) to use?
• Intuitively, if P is far from all ground-truth boxes, then the task of transforming P to a ground-
truth box G does not make sense.
• Therefore, we only learn from a proposal P if it is nearby at least one ground-truth box. We
implement “nearness” by assigning P to the ground-truth box G with which it has maximum
IoU overlap (in case it overlaps more than one) if and only if the overlap is greater than a
threshold (which we set to 0.6 using a validation set).
• All unassigned proposals are discarded.
• We do this once for each object class in order to learn a set of class-specific bounding-box
regressors.
Comparing Boxes: Intersection over Union (IoU)

How can we compare our

prediction to the ground-truth
box?
Comparing Boxes: Intersection over Union (IoU)

How can we compare our

prediction to the ground-truth
box?

Intersection over Union (IoU)

(Also called “Jaccard similarity”
or “Jaccard index”):
Comparing Boxes: Intersection over Union (IoU)

How can we compare our

prediction to the ground-truth
box?

Intersection over Union (IoU)

(Also called “Jaccard similarity”
or “Jaccard index”):

IoU > 0.5 is “decent”

Comparing Boxes: Intersection over Union (IoU)

How can we compare our

prediction to the ground-truth
box?

Intersection over Union (IoU)

(Also called “Jaccard similarity”
or “Jaccard index”):

IoU > 0.5 is “decent”,

IoU > 0.7 is “pretty good”
Comparing Boxes: Intersection over Union (IoU)

How can we compare our

prediction to the ground-truth
box?

Intersection over Union (IoU)

(Also called “Jaccard similarity”
or “Jaccard index”):

IoU > 0.5 is “decent”,

IoU > 0.7 is “pretty good”,
IoU > 0.9 is “almost perfect”
Overlapping boxes
Problem: Object detectors often output many
overlapping detections
Non-Max Suppression (NMS)
Problem: Object detectors often output many
overlapping detections

Solution: Post-process raw detections using Non-

Max Suppression (NMS)
1. Select next highest-scoring box
2. Eliminate lower-scoring boxes with IoU >
threshold (e.g. 0.7)
3. If any boxes remain, GOTO 1
Non-Max Suppression (NMS)
Problem: Object detectors often output many
overlapping detections

Solution: Post-process raw detections using Non-

Max Suppression (NMS)
1. Select next highest-scoring box
2. Eliminate lower-scoring boxes with IoU >
threshold (e.g. 0.7)
3. If any boxes remain, GOTO 1

Problem: NMS may eliminate ”good” boxes when

objects are highly overlapping... no good solution
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)

2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)

2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve

Reject everything: no mistakes

1 Ideal!
Precision

Summarize by area under curve

(avg. precision)

Accept everything:
0 Miss nothing
Recall 1

Source: https://en.wikipedia.org/wiki/Precision_and_recall
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)

2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)

2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
b. Average Precision (AP) = area under PR curve
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)

How to get AP = 1.0?

Hit all GT boxes with IoU > 0.5, and have no “false positive”
detections ranked above any “true positives”
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)

2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
b. Average Precision (AP) = area under PR curve
3. Mean Average Precision (mAP) = average of AP for each
category
Mean Average Precision (mAP)
Evaluating Object Detectors

1. Run object detector on all test images (with NMS)

2. For each category, compute Average Precision (AP)
= area under Precision vs Recall Curve
a. For each detection (highest score to lowest score)
i. If it matches some GT box with IoU > 0.5, mark
it as positive and eliminate the GT
ii. Otherwise mark it as negative
iii. Plot a point on PR Curve
b. Average Precision (AP) = area under PR curve
3. Mean Average Precision (mAP) = average of AP for each
category
4. For “COCO mAP”: Compute mAP@thresh for each IoU
threshold (0.5, 0.55, 0.6, ..., 0.95) and take average

https://cocodataset.org/
Region-based CNN

Problem: Very slow! Need to do ~2k

forward passes for each image!

Solution: Run CNN before warping!

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. Figure copyright Ross Girshick, 2015
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Fast R-CNN
Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN

How to crop features?

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool

Divide into 2x2 grid of

(roughly) equal subregions

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool

Divide into 2x2 grid of

(roughly) equal subregions

Max-pool within each subregion

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Pool
Problem: Slight misalignment
“Snap” to due to snapping; different-sized
Divide into 2x2 grid of
grid cells subregions is weird
(roughly) equal subregions

Max-pool within each subregion

RoI pooling quantizes a floating-
number RoI to the discrete
granularity of the feature
map. These quantizations
introduce misalignments
between the RoI and the
extracted features.

While this may not impact

classification (which is robust to
small translations), it has a large
negative effect on predicting
pixel-accurate masks.

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

The results are not sensitive to the

exact sampling locations, or how
many points are sampled, as long
as no quantization is performed.

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a

linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a

linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a

linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a

linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a

linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a

linear combination of features
at its four neighboring grid cells

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation

Feature fxy for point (x,y) is a

linear combination of features
at its four neighboring grid cells

This is differentiable! Upstream gradient for sampled feature will flow

backward into each of the four nearest-neighbor gridpoints

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Cropping features – RoI Align

Divide into equal-sized subregions

No “snapping”
(may not be aligned to grid!)

Sample features at regularly-

spaced points in each subregion
using bilinear interpolation
After sampling, max- pool in
each subregion

Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015
Fast R-CNN vs “Slow” R-CNN

Fast R-CNN: Apply differentiable cropping to “Slow” R-CNN: Run CNN independently for
shared image features each region
Fast R-CNN vs “Slow” R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN

Problem: Runtime dominated by region

proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Fast R-CNN vs “Slow” R-CNN

Recall: Region proposals computed by heuristic ”Selective Search” Problem: Runtime dominated by region
algorithm on CPU -- let’s learn them with a CNN instead! proposals!
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014. He et al, “Spatial pyramid
pooling in deep convolutional networks for visual recognition”, ECCV 2014
Girshick, “Fast R-CNN”, ICCV 2015
Outline

• Task definition
• Region-based CNN and basic concepts
• Intersection over Union (IoU)
• Non-Max Suppression (NMS)
• Mean Average Precision (mAP)
• Fast R-CNN
• Faster R-CNN

• Tutorial 2: Faster-RCNN
Faster R-CNN
Faster R-CNN: Learnable Region Proposals

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Region Proposal Network (RPN)
Region Proposal Network (RPN)

Imagine an anchor box of fixed size at each

point in the feature map
Region Proposal Network (RPN)

Imagine an anchor box of fixed size at each

point in the feature map

At each point, predict whether the corresponding

anchor contains an object (per-cell logistic
regression, predict scores with conv layer)
Region Proposal Network (RPN)

Imagine an anchor box of fixed size at each

point in the feature map

For positive boxes, also predict a box transform to

regress from anchor box to object box
Region Proposal Network (RPN)

Problem: Anchor box may have the wrong

size / shape
Solution: Use K different anchor boxes at
each point!

At test time: sort all K2015 boxes by their

score, and take the top ~300 as our region
proposals
Faster R-CNN: Learnable Region Proposals

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Figure copyright 2015, Ross Girshick
Faster R-CNN: Learnable Region Proposals
Faster R-CNN: Learnable Region Proposals
Faster R-CNN: Learnable Region Proposals

Question: Do we really need the

second stage?
Single-Stage Object Detection

RPN: Classify each anchor as object / not object

Single-Stage Detector: Classify each object as
one of C categories (or background)

Anchor category
(C+1) x K x 20 x 15

Box transforms
4K x 20 x 15

Remember: K anchors at each position in image

feature map
Single-Stage Object Detection

RPN: Classify each anchor as object / not object

Single-Stage Detector: Classify each object as
one of C categories (or background)

Anchor category
(C+1) x K x 20 x 15

Box transforms
C x 4K x 20 x 15

Sometimes use category- specific regression:

Predict different box transforms for each category
YOLO
If we have time
Credit
• The original slides of YOLO:
https://docs.google.com/presentation/d/1kAa7NOamBt4calBU9iHgT8a86RRHz
9Yz2oh4-GTdX6M/edit#slide=id.p

YOLO: You Only Look Once

YOLO
Extremely fast (45 frames per second, compared to Faster R-CNN of 7fps on the
same setting)
We split the image into a S*S grid
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)

Each grid cell predicts a fixed number of boundary boxes.

In this example, the grid cell makes two boundary box
predictions (black boxes) to locate where the car is.
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)
Each cell predicts B boxes(x,y,w,h) and confidences of
each box: P(Object)

Do you see the weakness of YOLO?

Each cell also predicts a class probability.
Each cell also predicts a class probability.

Bicycle Car

Dog

Dining
Table
Conditioned on object: P(Car | Object)

Bicycle Car

Dog

Dining
Table
Then we combine the box and class predictions.

P(class|Object) * P(Object) = P(class)

Conditional Box confidence Class confidence

class probability score score
Finally we do NMS and threshold detections
Each cell predicts: This parameterization fixes the output size
- For each bounding box:
- 4 coordinates (x, y, w, h)
- 1 confidence value
- Some number of class
probabilities

For Pascal VOC:

- 7x7 grid
- 2 bounding boxes / cell
- 20 classes
7 x 7 x (2 x 5 + 20) = 7 x 7 x 30 tensor = 1470 outputs
Thus we can train one neural network to be a whole detection pipeline
YOLO Series
• YOLO (Joseph Redmon et al. 2016)
• YOLOv2 (Joseph Redmon and Ali Farhadi 2017)
• YOLOv3 (Joseph Redmon and Ali Farhadi 2018)
• YOLOv4 (Alexey Bochkovskiy 2020)
• YOLOv5 (Ultralytics)
• YOLOX (Zheng Ge et al 2021)
Summary
Object detection: lots of variables!

Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower
- Single-stage methods (SSD) are
much faster, but don’t perform as
well

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

Takeaways:
- Two stage method (Faster R-CNN)
get the best accuracy, but are slower
- Single-stage methods (SSD) are
much faster, but don’t perform as
well
- Bigger backbones improve
performance, but are slower
- Diminishing returns for slower
methods

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

These results are a few years old ... since then GPUs have
gotten faster, and we’ve improved performance with many
tricks:
- Train longer!
- Multiscale backbone: Feature Pyramid Networks
- Better backbone: ResNeXt
- Single-stage methods have improved

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Object detection: lots of variables!

Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
New developments
• Anchor-free detectors
• FCOS

• End-to-end (NMS-free) detectors

• CenterNet
• DETR
Object detection: open source codes
Object detection is hard! Don’t implement it yourself

TensorFlow Detection API:

https://github.com/tensorflow/models/tree/master/research/object_detection
Faster R-CNN, SSD, RFCN, Mask R-CNN

Detectron2 (PyTorch):
https://github.com/facebookresearch/detectron2
Fast / Faster / Mask R-CNN, RetinaNet
Object detection: open source codes
Object detection is hard! Don’t implement it yourself

OpenMMLab MMDetection
https://github.com/open-mmlab/mmdetection

Tutorial 2: Faster R-CNN using MMDetection

See tutorial_02_object_detection.ipynb

cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
No ratings yet
cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
60 pages
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
42 pages
CS60010 - CNN 4
No ratings yet
CS60010 - CNN 4
32 pages
CSE4261 Lecture-12
No ratings yet
CSE4261 Lecture-12
24 pages
RCNN
No ratings yet
RCNN
25 pages
L10 Lecture Detection - Segmentation v2.5
No ratings yet
L10 Lecture Detection - Segmentation v2.5
35 pages
Object Detection
No ratings yet
Object Detection
76 pages
Understanding Object Detection Techniques
No ratings yet
Understanding Object Detection Techniques
46 pages
R-CNN and Selective Search Overview
No ratings yet
R-CNN and Selective Search Overview
6 pages
Deep Learning Algorithms For Object Detection
No ratings yet
Deep Learning Algorithms For Object Detection
43 pages
L7 Detection
No ratings yet
L7 Detection
54 pages
He 2017
No ratings yet
He 2017
9 pages
Mask R-CNN for Indoor Scene Segmentation
No ratings yet
Mask R-CNN for Indoor Scene Segmentation
13 pages
AI-Powered Object Segmentation
No ratings yet
AI-Powered Object Segmentation
12 pages
Mask R-CNN: Instance Segmentation Framework
No ratings yet
Mask R-CNN: Instance Segmentation Framework
9 pages
He Mask R-CNN ICCV 2017 Paper PDF
No ratings yet
He Mask R-CNN ICCV 2017 Paper PDF
9 pages
Chapter 7 - Part 3 - DL For CV
No ratings yet
Chapter 7 - Part 3 - DL For CV
79 pages
CVR FDP
No ratings yet
CVR FDP
37 pages
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
No ratings yet
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
8 pages
R-CNN vs Fast R-CNN Analysis
No ratings yet
R-CNN vs Fast R-CNN Analysis
4 pages
Lec36 Obj Detn
No ratings yet
Lec36 Obj Detn
60 pages
Lenc 15 RCNN
No ratings yet
Lenc 15 RCNN
12 pages
R-CNN: Overview of Object Detection Models
No ratings yet
R-CNN: Overview of Object Detection Models
28 pages
NNDL Unit 5
No ratings yet
NNDL Unit 5
21 pages
Object Detection Techniques A Review
No ratings yet
Object Detection Techniques A Review
9 pages
A Comprehensive Survey of The R-CNN Family For Object Detection
No ratings yet
A Comprehensive Survey of The R-CNN Family For Object Detection
6 pages
Generalized R-CNN for Researchers
No ratings yet
Generalized R-CNN for Researchers
127 pages
Lecture 4
No ratings yet
Lecture 4
46 pages
R-CNN: Enhanced Object Detection Method
No ratings yet
R-CNN: Enhanced Object Detection Method
8 pages
Faster R-CNN with Region Proposal Networks
No ratings yet
Faster R-CNN with Region Proposal Networks
9 pages
Object Detection
No ratings yet
Object Detection
57 pages
Najibi G-CNN An Iterative CVPR 2016 Paper
No ratings yet
Najibi G-CNN An Iterative CVPR 2016 Paper
9 pages
R-CNN Minus R: Karel Lenc Andrea Vedaldi
No ratings yet
R-CNN Minus R: Karel Lenc Andrea Vedaldi
9 pages
Object Detection for the Visually Impaired
No ratings yet
Object Detection for the Visually Impaired
4 pages
Simultaneous Detection and Segmentation
No ratings yet
Simultaneous Detection and Segmentation
16 pages
Real Time Object Detection System
No ratings yet
Real Time Object Detection System
31 pages
Beginner's Guide to R-CNN Basics
No ratings yet
Beginner's Guide to R-CNN Basics
6 pages
10 1109@access 2019 2932731
No ratings yet
10 1109@access 2019 2932731
9 pages
A Review of Object Detection Based On Convolutional Neural Network
No ratings yet
A Review of Object Detection Based On Convolutional Neural Network
6 pages
IMINT Target Acquisition Using Deep Learning
No ratings yet
IMINT Target Acquisition Using Deep Learning
5 pages
Frequency-Aware Feature Fusion For Dense Image Prediction
No ratings yet
Frequency-Aware Feature Fusion For Dense Image Prediction
18 pages
Lecture 6: Cnns For Detection, Tracking, and Segmentation: Region Based CNN (RCNN) Selective Search
No ratings yet
Lecture 6: Cnns For Detection, Tracking, and Segmentation: Region Based CNN (RCNN) Selective Search
9 pages
Object Detection in Deep Learning
No ratings yet
Object Detection in Deep Learning
61 pages
Literature Survey For Robotics
No ratings yet
Literature Survey For Robotics
6 pages
AML - Lecture - 10 - 15nov24
No ratings yet
AML - Lecture - 10 - 15nov24
169 pages
Object Detection Using CNN-RCNN.-1
No ratings yet
Object Detection Using CNN-RCNN.-1
14 pages
Module 6
No ratings yet
Module 6
83 pages
2.ObjectDetection Two Stage
No ratings yet
2.ObjectDetection Two Stage
66 pages
Li 2021 J. Phys.: Conf. Ser. 1827 012085
No ratings yet
Li 2021 J. Phys.: Conf. Ser. 1827 012085
11 pages
CenterNet-Based Object & Face Detection
No ratings yet
CenterNet-Based Object & Face Detection
7 pages
Region-Based Object Detection and Classification Using Faster R-CNN
No ratings yet
Region-Based Object Detection and Classification Using Faster R-CNN
6 pages
Image Segmentation For Object Detection Using Mask R-CNN in Colab
No ratings yet
Image Segmentation For Object Detection Using Mask R-CNN in Colab
5 pages
Object Detection1
No ratings yet
Object Detection1
29 pages
CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)
No ratings yet
CS7015 (Deep Learning) : Lecture 12: Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only Look Once (YOLO)
47 pages
Real-Time Object Detection App
No ratings yet
Real-Time Object Detection App
6 pages
Bottom-Up Object Detection by Grouping Extreme and Center Points
No ratings yet
Bottom-Up Object Detection by Grouping Extreme and Center Points
10 pages
Image and Video Analytics Unit 3
No ratings yet
Image and Video Analytics Unit 3
18 pages
Burner With Stove Quotations
No ratings yet
Burner With Stove Quotations
2 pages
KENT Crystal Star Black
No ratings yet
KENT Crystal Star Black
1 page
Lecture 3 CNN Training
No ratings yet
Lecture 3 CNN Training
109 pages
Mirpur School List
100% (4)
Mirpur School List
8 pages
Black and Orange Marketing Plan Presentation
No ratings yet
Black and Orange Marketing Plan Presentation
25 pages
RAKEZ Brochure English
No ratings yet
RAKEZ Brochure English
9 pages
Adsa Question - Bank 2025
No ratings yet
Adsa Question - Bank 2025
10 pages
Almugren, Alshamlan - 2019 - A Survey On Hybrid Feature Selection Methods in Microarray Gene Expression Data For Cancer Classification
No ratings yet
Almugren, Alshamlan - 2019 - A Survey On Hybrid Feature Selection Methods in Microarray Gene Expression Data For Cancer Classification
16 pages
DSP for Engineering Students
No ratings yet
DSP for Engineering Students
19 pages
Artificial Neural Networks (Anns) : 1 Brief History of The Method
No ratings yet
Artificial Neural Networks (Anns) : 1 Brief History of The Method
4 pages
Engineers' Guide to Inverse Square Root
No ratings yet
Engineers' Guide to Inverse Square Root
5 pages
AI Lab: Greedy & A* Algorithms
No ratings yet
AI Lab: Greedy & A* Algorithms
10 pages
Module 6 2nd Ungraded Quizz
No ratings yet
Module 6 2nd Ungraded Quizz
13 pages
Data Mining Syllabus Overview
No ratings yet
Data Mining Syllabus Overview
5 pages
and A - and A - and A
No ratings yet
and A - and A - and A
1 page
syllabusENSC429 3
No ratings yet
syllabusENSC429 3
2 pages
Lesson Plan Bubble Sort CSharp International Standard
No ratings yet
Lesson Plan Bubble Sort CSharp International Standard
3 pages
Production Activity Control Techniques
No ratings yet
Production Activity Control Techniques
39 pages
Image Processing Assignment
50% (2)
Image Processing Assignment
10 pages
Understanding Clustering in Self-Organizing Networks
No ratings yet
Understanding Clustering in Self-Organizing Networks
9 pages
Rockman EQ Curve Presets For Adobe Audition
No ratings yet
Rockman EQ Curve Presets For Adobe Audition
2 pages
Approximate Association Rule Mining: Jyothsna R. Nayak and Diane J. Cook
No ratings yet
Approximate Association Rule Mining: Jyothsna R. Nayak and Diane J. Cook
6 pages
Design and Analysis of Algorithms Lecture Notes
No ratings yet
Design and Analysis of Algorithms Lecture Notes
136 pages
11 Traveling Salesperson Problem (TSP) Using DP
No ratings yet
11 Traveling Salesperson Problem (TSP) Using DP
8 pages
ANN Syllabus
No ratings yet
ANN Syllabus
2 pages
Pretrained Transformers Insights
No ratings yet
Pretrained Transformers Insights
42 pages
Artificial Intelligence: Dr. Sheraz Naseer Irfan Malik
No ratings yet
Artificial Intelligence: Dr. Sheraz Naseer Irfan Malik
23 pages
Polynomials Class 10 Maths Notes - Free PDF
No ratings yet
Polynomials Class 10 Maths Notes - Free PDF
5 pages
CT Week 8 GA
No ratings yet
CT Week 8 GA
23 pages
CSC 301 - Design and Analysis of Algorithms
No ratings yet
CSC 301 - Design and Analysis of Algorithms
8 pages
Introductory Pages
No ratings yet
Introductory Pages
17 pages
Analysis & Design of Algorithms: Binary Search
No ratings yet
Analysis & Design of Algorithms: Binary Search
21 pages
NVS GAN Benefit of Generative Adversarial N - 2024 - International Journal of I
No ratings yet
NVS GAN Benefit of Generative Adversarial N - 2024 - International Journal of I
12 pages
Layout Methods
No ratings yet
Layout Methods
25 pages
Modified Artificial Bee Colony Algorithm For Optimal Distributed Generation Sizing and Allocation in Distribution Systems
No ratings yet
Modified Artificial Bee Colony Algorithm For Optimal Distributed Generation Sizing and Allocation in Distribution Systems
9 pages
1 Discuss Various Applications of Soft Computing
No ratings yet
1 Discuss Various Applications of Soft Computing
10 pages

Lecture 4 Detection

Uploaded by

Lecture 4 Detection

Uploaded by

AI6126 Advanced Computer Vision

Last update: 4 February 2022 3:35pm

There are many “close” false

The detector must find the true

Question How many possible boxes are there in an image of size H x W?

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w:

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w: Even worse if we consider boxes of different

Question How many possible boxes are there in an image of size H x W?

Consider a box of size h x w: Even worse if we consider boxes of different

Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012

Step 1: Generate initial sub-segmentation

This yields a hierarchy of successively larger regions, just like we want.

Bounding box regression:

Bounding box regression:

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Translate relative to box size:

Log-space scale transform:

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Translate relative to box size:

Log-space scale transform:

What is the target values (𝑡! , 𝑡" , 𝑡# , 𝑡$ ) ?

Translate relative to box size: the problem by minimizing the sum of

Log-space scale transform:

Output from bounding box regression function

1. Run region proposal method to compute

How can we compare our

How can we compare our

Intersection over Union (IoU)

How can we compare our

Intersection over Union (IoU)

IoU > 0.5 is “decent”

How can we compare our

Intersection over Union (IoU)

IoU > 0.5 is “decent”,

How can we compare our

Intersection over Union (IoU)

IoU > 0.5 is “decent”,

Solution: Post-process raw detections using Non-

Solution: Post-process raw detections using Non-

Solution: Post-process raw detections using Non-

Solution: Post-process raw detections using Non-

Solution: Post-process raw detections using Non-

Problem: NMS may eliminate ”good” boxes when

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

Reject everything: no mistakes

Summarize by area under curve

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

How to get AP = 1.0?

1. Run object detector on all test images (with NMS)

1. Run object detector on all test images (with NMS)

Problem: Very slow! Need to do ~2k

Solution: Run CNN *before* warping!

How to crop features?

Divide into 2x2 grid of

Divide into 2x2 grid of

Max-pool within each subregion

Max-pool within each subregion

While this may not impact

Divide into equal-sized subregions

Divide into equal-sized subregions

Sample features at regularly-

The results are not sensitive to the

Divide into equal-sized subregions

Sample features at regularly-

Solution: Run CNN before warping!

At test time: sort all K2015 boxes by their