Unit 3
Unit 3
Figure 5-1 Object detection means identifying and localization of the object. In the first image, we can classify
if it is a vacuum cleaner, while in the second image, we are drawing a box around it, which is the localization of
the image
To scale the solution, we can have multiple objects in the same image and even multiple
objects of different categories in the same image, and we have to identify all of them. And
draw the bounding boxes around them. An example can be of a solution trained to detect
cars. On a busy road, there will be many cars, and hence the solution should be able to
detect each of them and draw bounding boxes around them.
Object detection is surely a fantastic solution. We will now discuss the major object
detection use cases in the next section.
1. Facial Recognition
2. Cancer Detection
3. Other vehicles
4.Pedestrians
5.Cyclists
6. Traffic signals
7. Lane markings
8.Construction
The following Deep Learning architectures are commonly being used for Object
Detection:
1. R-CNN: Regions with CNN features. It combines Regional Proposals with CNN.
2. Fast R-CNN: A Fast Region–based Convolutional Neural Network.
3. Faster R-CNN: Object detection networks on Region Proposal algorithms to
hypothesize object locations.
4. Mask R-CNN: This network extends Faster R-CNN by adding the prediction of
segmentation masks on each region of interest.
5. YOLO: You Only Look Once architecture. It proposes a single Neural Network to
predict bounding boxes and class probabilities from an image in a single evaluation.
6. SSD: Single Shot MultiBox Detector. It presents a model to predict objects in images
using a single deep Neural Network.
When we want to detect objects, a very simple approach can be: why not divide the
image into regions or specific areas and then classify each one of them. This approach for
object detection is sliding window. As the name suggests, it is a rectangular box which
slides through the entire image. The box is of fixed length and width with a stride to move
over the entire image.
Look at the image of the vacuum cleaner in Figure 5-2. We are using a sliding window
at each part of the image. The red box is sliding over the entire image of the vacuum
cleaner. From left to right and then vertically, we can observe that different parts of the
image are becoming the point of observation. Since the window is sliding, it is referred to
as the sliding window approach.
Figure 5-2 The sliding window approach to detect an object and identify it. Notice how the sliding box
is moving across the entire image; the process is able to detect but is really a time-consuming process and
computationally expensive too
Then for each of these regions cropped, we can classify whether this region contains
an object that interests us or not. And then we increase the size of the sliding window
and continue the process.Sliding window has proven to work, but it is a computationally
very expensive technique and will be slow to implement as we are classifying all the
regions in an image.
Also, to localize the objects, we need a small window size and small stride. But still it is a
simple approach to understand.
Bounding box can generate the x coordinate, y coordinate, height, and width of the bounding box and the
class probability score
If an object lies over multiple grids, then the grid that contains the midpoint of that object
is responsible for detecting that object.
Intersection over Union ( IoU):
Intersection over Union is a test to ascertain how close is our prediction to the actual
truth.
It is represented by Equation 5-1 and is shown in Figure 5-4.
Figure 5-4 Intersection over Union is used to measure the performance of detection. The numerator is the
common area, while the denominator is the complete union of the two areas. The higher the value of IoU, the better
it is
Figure 5-5 IoU values for different positions of the overlapping blocks. If the value is closer to 1.0, it means that
the detection is more accurate as compared to the value of 0.15
As we can see in Figure 5-5, for IoU of 0.15, there is very less overlap between the two
boxes as compared to 0.85 or 0.90. It means that the one with 0.85 IoU is a better solution
to the one with 0.15 IoU. The detection solution can hence be compared directly.
Intersection over Union allows us to measure and compare the performance of various
solutions. It also makes it easier for us to distinguish between useful bounding boxes and
not-so-important ones. Intersection over Union is an important concept with wide usages.
Using it, we can compare and contrast the acceptability of all the possible solutions and
choose the best one from them.
Figure 5-8 The process in R-CNN. Here, we extract region proposals from the input image, compute the CNN
features, and then classify the regions. Image source: https://arxiv.org/pdf/1311.2524.pdf and published here with the
permission of the researchers
With reference to Figure 5-8 where we have shown the process, let us understand the
entire process in detail now:
1.
The first step is to input an image, represented by step 1 in Figure 5-8.
2. Then get the regions we are interested in, which is shown in step 2 in Figure 5-8. These are the 2000
proposed regions. They are detected using the following steps:
a) We create the initial segmentation for the image.
b) Then we generate the various candidate regions for the image.
c) We combine similar regions into larger ones iteratively. A greedy search approach is used for it.
d) Finally, we use the generated regions to output the final region proposals.
3. Then in the next step, we reshape all the 2000 regions as per the implementation
in the CNN.
4.
We then pass through each region through CNN to get features for each
region.
5.
The extracted features are now passed through a support vector machine to
classify the presence of objects in the region proposed.
6.
And then, we predict the bounding boxes for the objects using bounding box
regression. This means that we are making the final prediction about the image. As
shown in the last step, we are making a prediction if the image is an airplane or a
person or a TV monitor.
The preceding process is used by R-CNN to detect the objects in an image. It is
surely an innovative architecture, and it proposes a region of interest as an impactful
concept to detect objects.
But there are a few challenges with R-CNN, which are
1. R-CNN implements three algorithms (CNN for extracting the features, SVM for
the classification of objects, and bounding box regression for getting the bounding
boxes). It makes R-CNN solutions quite slow to be trained.
2. It extracts features using CNN for each image region. And the number of regions
is 2000. It means if we have 1000 images, the number of features to be extracted
is 1000 times 2000 which again makes it slower.
3. Because of these reasons, it takes 40–50 seconds to make a prediction for an image,
and hence it becomes a problem for huge datasets.
4. Also, the selective search algorithm is fixed, and not much improvements can be
made.
As R-CNN is not very fast and quite difficult to implement for huge datasets, the
same authors proposed Fast R-CNN to overcome the issues. w
Faster R-CNN
To overcome the slowness in R-CNN and Fast R-CNN, Shaoqing Ran et al. proposed
Faster R-CNN. The intuition behind the Faster R-CNN is to replace the selective search
which is slow and time-consuming. Faster R-CNN uses the Regional Proposal Network or
RPN.
The architecture of Faster R-CNN is shown in Figure 5-10.
Figure 5-10 Faster R-CNN is an improvement over the previous versions. It consists of two modules – one
is a deep convolutional network, and the other is the Fast R-CNN detector
Faster R-CNN, is composed of two modules. The first module is a deep fully
convolutional network that proposes regions, and the second module is the Fast R-
CNN detector that uses the proposed regions. The entire system is a single, unified
network for object detection.
Faster R-CNN is able to combine the intelligence and use deep convolution fully
connected layers and Fast R-CNN using proposed regions. The entire solution is a
single and unified solution for object detection.
Though Faster R-CNN is surely an improvement in terms of performance over RCNN
and Fast R-CNN, still the algorithm does not analyze all the parts of the image
simultaneously. Instead, each and every part of the image is analyzed in a sequence. Hence,
it requires a large number of passes over a single image to recognize all the objects.
Moreover, since a lot of systems are working in a sequence, the performance of one
depends on the performance of the preceding steps.
You Only Look Once (YOLO)
You Only Look Once or YOLO is targeted for real-time object detection. The previous
algorithms we discussed use regions to localize the objects in the image. Those algorithms
look at a part of the image and not the complete image, whereas in YOLO a single CNN
predicts both the bounding boxes and the respective class probabilities. YOLO was
proposed in 2016 by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.
The actual paper can be accessed at https://arxiv.org/pdf/1506.02640v5.pdf.
To quote from the actual paper, “We reframe object detection as a single regression
problem, straight from image pixels to bounding box coordinates and class probabilities.”
As shown in Figure 5-12, YOLO divides an image into a grid of cells (represented by
S). Each of the cells predicts bounding boxes (represented by B). Then YOLO works on
each bounding box and generates a confidence score about the goodness of the shape of
the box. The class probability for the object is also predicted. Finally, the bounding box
having class probability scores above are selected, and they are used to locate the object
within that image.
Figure 5-12 The YOLO process is simple; the image has been taken from the original paper
https://arxiv.org/pdf/1506.02640v5.pdf
1. YOLO divides the input image into an SxS grid. To be noted is that each grid is
responsible for predicting only one object. If the center of an object falls in a grid
cell, that grid cell is responsible for detecting that object.
2. For each of the grid cells, it predicts boundary boxes (B). Each of the boundary boxes
has five attributes – the x coordinate, y coordinate, width, height, and a confidence
score. In other words, it has (x, y, w, h) and a score. This confidence score is the
confidence of having an object inside the box. It also reflects the accuracy of the
boundary box.
3. The width w and height h are normalized to the images’ width and height. The x and
y coordinates represent the center relative to the bound of the grid cells.
We will now examine how we calculate the loss function in YOLO. It is important to
get the loss function calculation function before we can study the entire architecture in
detail.
(Equation 5-3)
In Equation 5-3, we have localization loss, confidence loss, and classification loss,
where 1 obj i denotes if the object appears in cell i and 1 obj ij denotes that the jth bounding
box predictor in cell i is “responsible” for that prediction.
Let’s describe the terms in the preceding equation. Here, we have
A. Localization loss is to measure the errors for the predicted boundary boxes. It
measures their location and size errors. In the preceding equation, the first two terms
represent the localization loss. 1 obj i is 1 if the jth boundary box in cell i is
responsible for detecting the object, else the value is 0. λcoord is responsible for the
increase in the weight for the loss in the coordinates of the boundary boxes. The
default value of λcoord is 5.
B. Confidence loss is the loss if an object is detected in the box. It is the second loss
term in the equation shown. In the term earlier, we have
C. The next term is a confidence loss if the object is not detected. In the term earlier,
we have
D. The final term is the classification loss. If an object is indeed detected, then for
each cell it is the squared error of the class probabilities for each class.
The final loss is the sum total of all these components. As the objective of any Deep
Learning solution, the objective will be to minimize this loss value.
In the paper, the authors have mentioned that the network has been an inspiration from
GoogLeNet. The network has 24 convolutional layers followed by 2 fully connected layers.
Instead of Inception modules used by GoogLeNet, YOLO uses 1x1 reduction layers
followed by 3x3 convolutional layers. YOLO might detect the duplicates of the same
object. For this, non-maximal suppression has been implemented.
This removes the duplicate lower confidence score.
In Figure 5-14, we have a figure having 13x13 grids. In total, 169 grids are there
wherein each grid predicts 5 bounding boxes. Hence, there are a total of 169*5 = 845
bounding boxes. When we apply a threshold of 30% or more, we get 3 bounding boxes as
shown in Figure 5-14.
Figure 5-14 The YOLO process divides the region into SxS grids. Each grid predicts five bounding boxes,
and based on the threshold setting which is 30% here, we get the final three bounding boxes; the image has been
taken from the original paper
So, YOLO looks at the image only once but in a clever manner. It is a very fast algorithm for
real-time processing. To quote from the original paper:
1. YOLO is refreshingly simple.
2. YOLO is extremely fast. Since we frame detection as a regression problem we don’t
need a complex pipeline. We simply run our Neural Network on a new image at test
time to predict detections. Our base network runs at 45 frames per second with no
batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This
means we can process streaming video in real-time with less than 25 milliseconds of
latency. Furthermore, YOLO achieves more than twice the mean average precision of
other real-time systems.
3. YOLO reasons globally about the image when making predictions. Unlike sliding
window and region proposal-based techniques, YOLO sees the entire image during
training and test time so it implicitly encodes contextual information about classes as
well as their appearance.
4. YOLO learns generalizable representations of objects. When trained on natural
images and tested on artwork, YOLO outperforms top detection methods like DPM
and R-CNN by a wide margin. Since YOLO is highly generalizable it is less likely to
break down when applied to new domains or unexpected inputs.
There are a few challenges with YOLO too. It suffers from high localization error.
Moreover, since each of the grid cells predicts only two boxes and can have only one class
as the output, YOLO can predict only a limited number of nearby objects. It suffers from
a problem of low recall too. And hence in the next version of YOLOv2 and YOLOv3,
these issues were addressed.
YOLO is one of the most widely used object detection solutions. Its uniqueness lies in
its simplicity and speed.
Questions:
Part-A
1. What is the concept of anchor boxes and non-max suppression?
To generate the final object detections, tiled anchor boxes that belong to the
background class are removed, and the remaining ones are filtered by their confidence
score. Anchor boxes with the greatest confidence score are selected using nonmaximum
suppression (NMS).
In the context of digital image processing, the bounding box denotes the border's
coordinates on the X and Y axes that enclose an image. They are used to identify a target
and serve as a reference for object detection and generate a collision box for the object.
3. How are R-CNN , Fast R-CNN and Faster R-CNN different and what are
the improvements?
The mAP on Pascal 53.3 65.7 (when trained 67.0(when trained with
VOC 2012 test dataset with VOC 2012 VOC 2012 only)
(%) only) 70.4 (when trained with
68.4 (when trained with
VOC 2007 and 2012 both)
VOC 2007 and 2012
75.9(when trained with
both)
VOC 2007 and 2012 and
COCO)
4. What is IoU?
IoU calculates intersection over the union of the two bounding boxes, the bounding box
of the ground truth and the predicted bounding box.
mAP (mean Average precision) is a popular metric in measuring the accuracy of object
detectors. Average precision calculates the average precision value for recall value over 0
to 1.
6. What is NMS?
b) Start from the top scores, ignore any current prediction if we find any previous
predictions that have the same class and IoU > Threshold(generally we use 0.5) with
the current prediction.
c) Repeat the above step until all predictions are checked.
YOLO uses a sum of squared error between the predictions and the ground truth to
calculate the loss. The loss function composes of:
• The Classification loss.
• The Localization loss (errors between the predicted boundary
box and the ground truth).
• The Confidence loss (the objectness of the box).
9. What is FPN?