DEEP LEARNING
Object Classification, Detection,
Localization and Segmentation
Adrian Horzyk AGH University of
Science and Technology
[email protected] Krakow, Poland
Object Classification, Localization and Detection
Tasks that can be performed on images: Classification: Car
• Classification
• Classification with localization
• Detection
• Instance Segmentation
• Semantic Segmentation
Classification is to determine to which class belongs the main object (or sometimes all
objects) in the image.
Classification with localization not only classifies the main object in the image but also
localizes it in the image determining its bounding box (position and size or localization Classification and Localization
anchors).
Detection tries to find all object of the previously trained (known) classes in the image
and localize them.
Instance Segmentation is to …
Semantic Segmentation is to distinguish between …
Car Detection
Classification with Localization
Classification using DL is to determine the class of the main object
(that is usually in the centre of the image):
• The number of classes is usually limited and the rest is classified as
background or nothing:
car
pedestrian
…
background
bx, by, bh, bw
• When localizing the object the output of the network contains extra outputs
for a defining bounding box (bx, by, bh, bw) of the object:
• bx – x-axis coordinate of the center of the object
• by – y-axis coordinate of the center of the object
• bh – height of the object (its bounding box)
• bw – width of the object (its bounding box)
Defining Target Labels for Training
Example 1: If there is Example 2: If there is
an object of class 𝑐2 : no object of any of
1 the defined classes:
𝑏𝑥 0
𝑏𝑦 ?
𝑏ℎ ?
𝑦 = 𝑏𝑤 ?
0 𝑦= ?
1 ?
0 ?
0 ?
𝑝𝑐 Where
?
𝑏𝑥 𝑝𝑐 – probability of the detection of an object of the specified class
in the image, which is equal to 1 when the object is present and
𝑏𝑦 0 otherwise during the training
𝑏ℎ 𝑏𝑥 – x-coordinate of the bounding box of the object
𝑏𝑦 – y-coordinate of the bounding box of the object
𝑦 = 𝑏𝑤 𝑏ℎ – height of the bounding box of the object
𝑐1 𝑏𝑤 – width of the bounding box of the object
𝑐2 𝑐1 , 𝑐2 , … , 𝑐𝐾 – the possible trained classes of the input image, where
only one 𝑐𝑘 is equal to 1 and the others are equal to 0
⋮ ? – are not taken into account in the loss function because we do not
𝑐𝐾 care these values while no object is detected
Landmark Detection
In the similar way, we can detect various landmarks in the images and
use it to compute facial gesture, emotion expressions or model it:
Object Detection and Cropping Out
Object detection can be made in a few ways:
• using sliding window of the same size or various sizes with different strides
(high computational cost because of many strides) – sliding window detection
• using a grid (mesh) of fixed windows (YOLO – you only look once)
and put the cropped image on the input of the ConvNet:
Convolutional Implementation of Sliding Windows
Many computations for sliding windows repeat as presented by the blue sliding
window and the red one (the shared area) after the two-pixel stride.
Therefore, we implement sliding windows parallelly and share these computations
that are the same for different sliding windows to proceed computations faster.
Convolutional Implementation of Sliding Windows
We can see how the convolutional implementation of the sliding window works
on the image. The drawback is the position of the bounding box designated by the
sliding window might not be very accurate. Moreover, if we want to fit each object
better, we have to use many such parallel convolutional networks for various sizes
of sliding windows. Even though we cannot use appropriately adjusted sizes of
such windows and achieve poor bounding boxes for the classified objects.
YOLO – You Only Look Once
In YOLO, we put the grid of the fixed sizes on the image:
• Each object is classified only in a single grid cell where is the midpoint of this object
taking into account the ground-truth frame of it defined in the training dataset:
• In all other cells, this object is not represented even if they contain fragments of
this object or its bounding box (frame).
• For each of the grid cell, we create
an (K+5)-dimensional vector storing
bounding box and class parameters:
• The target (trained) output is a 3D matrix
of S x S x (K+5) dimensions, where
S is the number of grid cells in each row
and column.
• This approach works as long as there is only one
object in each grid cell. In practice, the grid is
usually bigger than in this example, e.g. 19x19,
so there is a less chance to have more one middle point of the object inside each
grid cell.
YOLO’s bounding boxes
The YOLO’s bounding boxes are computed using the following formulas:
𝑏𝑥 , 𝑏𝑦 , 𝑏𝑤 , 𝑏ℎ
𝑏𝑥 = 𝜎 𝑡𝑥 + 𝑐𝑥
𝑏𝑦 = 𝜎 𝑡𝑦 + 𝑐𝑦
𝑏𝑤 = 𝑝𝑤 ∙ 𝑒 𝑡𝑤
𝑏ℎ = 𝑝ℎ ∙ 𝑒 𝑡ℎ
where
𝑡𝑥 , 𝑡𝑦 , 𝑡𝑤 , 𝑡ℎ is what the YOLO network outputs,
𝑐𝑥 and 𝑐𝑦 are the top-left coordinates of the grid cell, and
𝑝𝑤 and 𝑝ℎ are the anchors dimensions for the grid cell (box).
Specifying the Bounding Boxes in YOLO
We specify the bounding boxes in YOLO in such a way:
• Each upper-left corner of each grid cell has (0,0) coordinates.
• Each bottom-right corner of each grid cell has (1,1) coordinates.
• We measure the midpoint of the object
in these coordinates, here (0.4,0.3).
• The width (height) of the object is measured
as the fraction of the overall width (height) of
this grid cell box (frame).
𝑝𝑐 1
𝑏𝑥 0.4
𝑏𝑦 0.3
𝑏ℎ 0.9
𝑦 = 𝑏𝑤 = 0.8
𝑐1 1
𝑐2 0
⋮ ⋮
𝑐𝐾 0
• The midpoints are always between 0 and 1, while widths and heights could be greater than 1.
• If we want to use a sigmoid function (not ReLU) in an output layer and we need to have
all widths and heights between 0 and 1, we can divide widths by the number of grid cells
in a row (𝑏𝑤 /𝑆), and divide heights by the number of grid cells in a column (𝑏ℎ /𝑆).
Intersection Over Union
Intersection Over Union (IOU):
• Is used to measure the quality of the estimated bounding box to the ground-
truth bounding box defined in the training dataset.
• Is treated as correct if IOU ≥ 0.5 or more dependently on the application.
• Is a measure of the overlap between two bounding boxes.
• Is computed as the ratio of the size of size of
the intersection between two bounding boxes IOU =
and the union of these bounding boxes: size of
Non-Max Suppression of YOLO
Non-max suppression avoids multiple bounding boxes for
the detected objects leaving only one with the highest IOU.
• When using bigger grids,
many grid cells might think
that they represent the
midpoint of the detected
object.
• In result, every such cell
will produce a bounding box,
so we get multiple bounding
boxes for the same object.
• YOLO chooses the one with
the highest probability 𝑝𝑐
computed for each grid cell.
Non-Max Suppression of YOLO
Non-Max Suppression works as follows:
1. Discard all bounding boxes estimated by the convolutional network which
probability is 𝑝𝑐 ≤ 0.6.
2. While there are any remaining
bounding boxes:
1. Pick this one with the largest 𝑝𝑐 ,
and output that as a prediction of
the detected object.
(selection step)
2. Discard any remaining bounding
box with IOU ≥ 0.5 with the box
output in the previous step.
(pruning/suppression step)
For multiple object detection of
the different classes, we perform
the non-max suppression for each
of these classes independently.
Anchor Boxes for Multiple Object Detection
When two or more objects are in almost the same place in the image
and their midpoints of their ground-truth bounding boxes fall into
the same grid cell, we cannot use the previous algorithm but define
a few anchor boxes with the predefined shapes associated with
different classes of objects that can occur in the same grid cell:
Example:
Anchor box 1 (A1):
Anchor box 2 (A2):
The YOLO algorithm with anchor boxes assigns each object in
training image to the grid cell that contains the object’s midpoint
and the appropriate anchor box for the grid cell with the highest IOU.
Anchor Boxes and Target Setup
For two anchor boxes in the grid cell, we consider four cases:
1. There are no midpoints of objects in the cell.
2. There is one midpoint of the object of the anchor 1 and class c1 in the cell.
3. There is one midpoint of the object of the anchor 2 and class c2 in the cell.
4. There is two midpoints of two object of the anchor 1 and the anchor 2 and both classes
c1 and c2 in the cell.
𝑝𝑐𝐴1
𝑏𝑥𝐴1 0 1
0 1
𝑏𝑥𝐴1
𝑏𝑦𝐴1 𝑏𝑥𝐴1 ?
? 𝑏𝑦𝐴1
𝑏ℎ𝐴1 𝑏𝑦𝐴1 ?
? ?
𝐴1 ? 𝑏ℎ𝐴1 𝑏ℎ𝐴1
𝑏𝑤 ? 𝐴1
? 𝐴1 𝑏𝑤
𝑐1𝐴1 𝑏𝑤 ?
? 1
𝑐2𝐴1 1 ? 0
? 0
⋮ ⋮ ⋮
⋮ ⋮
𝑐𝐾𝐴1 ?
𝑦 = 𝐴2 (1) 𝑦= ? (2) 𝑦= 0 (3) 𝑦= 1 (4) 𝑦= 0
𝑝𝑐 0 1
0 𝑏𝑥𝐴2
𝑏𝑥𝐴2 ? 𝑏𝑥𝐴2
? 𝑏𝑦𝐴2
? 𝑏𝑦𝐴2
𝑏𝑦𝐴2 ? ?
? 𝑏ℎ𝐴2 𝑏ℎ𝐴2
𝑏ℎ𝐴2 ? 𝐴2
? 𝑏𝑤 𝐴2
𝑏𝑤𝐴2 ? 𝑏𝑤
? 0
? 0
𝑐1𝐴2 ? 1 1
⋮ ⋮ ⋮
𝑐2𝐴2 ? ⋮
⋮ ? 0 0
𝑐𝐾𝐴2
YOLO Detection Model
Classic YOLO Network Architecture
YOLO network architecture is convolutional with the output defined
as a 3D matrix of the S x S x (A x 8) sizes:
• S – is the number or cells in each row and column
• A – is the number of anchors
However, we can modify the original YOLO model in such a way that the number of
cells in rows and columns differ.
YOLOv3 Network Architecture
YOLOv3 network uses extra operations (concatenation and addition)
as well as residual blocks, detection and upsampling layers.
Precision and Recall
Confusion Matrix
• Specifies how many examples were correctly classified as positive
(TP), negative (TN) and how many were misclassified as positive
(FP) or negative (FN).
Precision
• measures how accurate is your
predictions. i.e. the percentage of
your predictions are correct.
𝑇𝑃
𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑇𝑃 + 𝐹𝑃
Recall
• measures how good you find all the positives. For example, we can
find 80% of the possible positive cases in our top K predictions.
𝑇𝑃
𝑹𝒆𝒄𝒂𝒍𝒍 =
𝑇𝑃 + 𝐹𝑁
Mean Average Precision
Average Precision (AP):
• Is a popular metric in measuring the accuracy of object detectors
like Faster R-CNN, SSD, YOLO, etc. Average precision computes
the average precision value for recall value over 0 to 1.
1
𝐴𝑃 = න 𝑝 𝑟 𝑑𝑟
0
• where 𝑝 𝑟 is a precision-recall curve.
Mean Average Precision (mAP):
• Is a popular metric in measuring the accuracy of object detectors
like Faster R-CNN, SSD, YOLO, etc. Average precision computes
the average precision value for recall value over 0 to 1.
R-CNN, Fast R-CNN, and Faster R-CNN
R-CNN stands for Regions with ConvNet detection:
• Is a segmentation algorithm.
• The algorithm is run on a big number of block to classify them
• R-CNN proposes regions at a time.
• We get an output label + bounding box
Fast R-CNN:
• A convolutional implementation
of sliding windows to classify
all the proposed regions.
Faster R-CNN:
• Uses a convolutional network to propose regions.
Semantic Segmentation Using Deep Learning
Semantic Segmentation:
• Xxxx
• https://www.cs.toronto.edu/~tingwuwang/semantic_segmentation.pdf
• https://www.mathworks.com/help/vision/ug/getting-started-with-semantic-
segmentation-using-deep-learning.html
• https://medium.com/nanonets/how-to-do-image-segmentation-using-
deep-learning-c673cc5862ef
RetinaNet
RetinaNet:
• can have ~100k boxes with the resolve of class imbalance problem using focal loss.
• Many one-stage detectors do not achieve good enough performance, so there are
build new two-stage detectors like RetinaNet:
RetinaNet
RetinaNet:
• In RetinaNet, an one-stage detector, by using focal loss, lower loss is contributed by “easy”
negative samples so that the loss is focusing on “hard” samples, which improves the prediction
accuracy. With ResNet+FPN as backbone for feature extraction, plus two task-specific subnetworks
for classification and bounding box regression, forming the RetinaNet, which achieves state-of-
the-art performance, outperforms Faster R-CNN, the well-known two-stage detectors. It is a 2017
ICCV Best Student Paper Award paper with more than 500 citations. (The first author, Tsung-Yi Lin,
has become Research Scientist at Google Brain when he was presenting RetinaNet in 2017 ICCV.)
(Sik-Ho Tsang @ Medium).
• https://www.youtube.com/watch?v=44tlnmmt3h0
Let’s start with powerful computations!
✓ Questions?
✓ Remarks?
✓ Suggestions?
✓ Wishes?
Bibliography and Literature
1. https://www.cs.toronto.edu/~tingwuwang/semantic_segmentation.pdf
2. https://www.mathworks.com/help/vision/ug/getting-started-with-semantic-segmentation-using-
deep-learning.html
3. https://medium.com/nanonets/how-to-do-image-segmentation-using-deep-learning-c673cc5862ef
4. https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-
45c121a31173 Adrian Horzyk
5. https://pjreddie.com/darknet/yolo/
6. https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/ [email protected]
7. https://blog.paperspace.com/how-to-implement-a-yolo-v3-object-detector-from-scratch-in-pytorch- Google: Horzyk
part-2/
8. https://blog.paperspace.com/how-to-implement-a-yolo-v3-object-detector-from-scratch-in-pytorch-
part-3/
9. https://blog.paperspace.com/how-to-implement-a-yolo-v3-object-detector-from-scratch-in-pytorch-
part-4/
10. https://blog.paperspace.com/how-to-implement-a-yolo-v3-object-detector-from-scratch-in-pytorch-
part-5/
11. https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg
12. https://arxiv.org/pdf/1708.02002.pdf
13. https://www.youtube.com/watch?v=44tlnmmt3h0 University of Science
14. https://towardsdatascience.com/review-retinanet-focal-loss-object-detection-38fba6afabe4 and Technology
in Krakow, Poland