Li 2021 J. Phys.: Conf. Ser. 1827 012085
Li 2021 J. Phys.: Conf. Ser. 1827 012085
Conference Series
Wenze Li1*
1
Faculty of Engineering and Information Technology, University of Technology
Sydney, Sydney, Australia
*[email protected]
Abstract. The related regions with convolutional neural networks (R-CNN) models
have been widely used in the field of object detection. Faster R-CNN significantly
improves the overall performance by adding RPN, especially in terms of detection speed.
However, the application of different pre-training models will result in a great difference
in the performance of Faster R-CNN. This paper analyzed the performance of Faster R-
CNN models based on different pre-training models and conducted a comprehensive
evaluation of the performance of Faster R-CNN. The experimental results showed the
accuracy and detection speed of R-CNN, fast R-CNN and faster R-CNN based on three
different data sets. They can objectively and comprehensively evaluate the performance
of R-CNN, fast R-CNN, and faster R-CNN.
1. Introduction
Object detection is a hot topic of computer vision. The main purpose of object detection is to find objects
of interest in images or videos and detect their position and size simultaneously. Object detection is an
image segmentation based on the geometric and statistical characteristics of the object, and it combines
the segmentation and recognition of the object. The accuracy and real-time performance are important
capabilities of the entire system for object detection. Especially in complex scenes, automatic object
extraction and recognition are particularly important when multiple objects need to be processed in real-
time. In recent years, object detection has been widely used in artificial intelligence, face recognition,
unmanned driving, and other fields. The existing object detection algorithms include traditional detection
algorithms and detection algorithms based on deep learning. Traditional object detection algorithms are
mainly based on sliding window frames or matching based on feature points. Although this method has
achieved good results, the lack of pertinence when using sliding windows for region selection leads to
high time complexity and window redundancy. In addition, the methods based on manual feature
selection are often not very robust. With the development of deep learning technology, object detection
algorithms have switched from traditional methods based on manually selected features to detection
methods based on deep neural networks. Detection methods based on deep neural networks can be mainly
divided into two categories: one is a two-stage object detection algorithm combining region proposal and
convolutional neural networks (CNN), such as R-CNN, and the other one is a one-stage algorithm that
converts object detection into a regression problem (for example, YOLO).
R-CNN algorithm was first proposed by Girshick et al. in 2014, who applied a large-capacity CNN to
bottom-up region proposals [1]. The R-CNN affected all subsequent two-stage algorithms and made
CNN-based object detection algorithms gradually become the mainstream. The application of deep
learning has improved detection accuracy and detection speed. In 2015, Girshick proposed Fast R-CNN
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
based on R-CNN, which simplified the spatial pyramid pooling (SPP) layer to the region of interest (ROI)
Pooling layer on the basis of SPP-Net, and decomposed the output of the fully connected layer into (SVD)
to obtain two output vectors: the classification score of softmax and the window regression of the
bounding rectangle of the Boundingbox. This method improvement combines the classification problem
with the bounding box regression problem. Fast R-CNN uses softmax instead of singular value
decomposition (SVM) and stores all the features in the video memory, which reduces the storage space
occupation and greatly accelerates the detection speed [2]. However, neither R-CNN nor Fast R-CNN
can solve the following problem: using selective search and similar methods to select region suggestions
will generate a large number of invalid regions. It leads to inefficiency and waste of computing power.
Ren et al. proposed to use Region Proposal Networks (RPN) to replace the Selective Search algorithm
on the basis of Fast R-CNN. RPN can use neural networks to learn its own strategy for generating region
proposals and make full use of feature maps. RPN replaces time-consuming selective search and other
similar algorithms to make detection faster [3].
The main contributions of this work can be summarized as follows:
1. Faster R-CNN algorithm is implemented based on Pytorch, and VGG16 and ResNet101 are used
as pre-training models to train and record the time on the two data sets of Pascal VOC and COCO,
respectively.
2. The trained model is tested to get the accuracy rate (mAP) and detection time.
3. The performance of Faster R-CNN is analyzed under different pre-training models and data sets.
The rest of this paper is organized as follows. In Section 2, the network structure of the Faster R-CNN
algorithm will be introduced in detail. In Section 3, faster R-CNN test results based on different pre-
training models and data sets will be displayed, and this part will analyze the performance of faster R-
CNN. Finally, Section 4 concludes the performance analysis results of Faster R-CNN under different data
sets and pre-training models.
2. Related works
According to the description of Ren et al. in the original paper, the entire process of Faster R-CNN can
be divided into four parts: Conv layers, Region Proposal Networks, Roi Pooling and Classification [3].
First, a P Q image of any size is scaled to a fixed size M N and sent to Conv layers to extract the
feature map. This feature map is shared for subsequent RPN layer and fully connected layer. The feature
maps then input to the Region Proposal Network to generate Region of Interest (ROIs). RPN uses softmax
to determine whether the anchors are positive or negative, and then corrects the anchors to obtain an
accurate proposal through the bounding box regression. The Roi Pooling layer gathers the feature maps
and proposals from the input and integrates that information for extracts the proposal feature maps, and
then sends them to the fully connected layer to determine the object category. Finally, the Classification
part calculates the category of the proposal by proposal feature maps, and it also obtains the final precise
position of the detection frame by bounding box regression at the same time. Figure 1 shows the flow
chart of Faster R-CNN.
2
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
3
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
2.2.1 Anchors
At each sliding window position, multiple candidate boxes are predicted where the maximum number of
candidate boxes at each position is represented by k. Therefore, the regression layer has 4k output codes
and k candidate boxes, and the classification layer has 2k output scores to estimate the probability that
each candidate box is a background class. The k candidate frames are parameterized, and they are called
anchors. An anchor is essentially the center of the sliding window and is related to the scale and aspect
ratio. By default, 3 scales and 3 aspect ratios are used to generate 9 anchors at each position. For a
convolution feature map of size W H , there are a total of W H k anchors.
4
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
The input of this regression function is the feature map and the offset between GroundTruth and the
anchor, and the output is d A , x, y, w, h . Assuming that the input feature map is A , then:
d* ( A) W*T ( A) (5)
The loss function is:
N
Loss i
| t*i W*T ( A) | (6)
Among them, t∗ is the offset between GroundTruth and anchor:
G x Ax Gy Ay
tx , ty (7)
Aw Ah
G G
t w log( w ) , t h log( h ) (8)
Aw Ah
The function optimization goals are:
W* arg min w* i t*i W*T ( A) W*
N
(9)
5
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
2.4 Classification
The Classification part calculates the categories that each proposal belongs to (such as desk, cars, people,
etc.) by using the proposal feature maps, full connect layer and softmax, and outputs the cls_prob
probability vector. Moreover, Faster R-CNN will use bounding box regression again to obtain the position
offset bbox_pred of each proposal at the same time, which is used to return to a more accurate target
detection frame.
3. Experiment
3.1 Datasets
The data sets used in this experiment are the three most common public data sets in the field of object
detection, which are PASCAL VOC2007, COCO and ILSVRC.
3.1.2 MS COCO
The full name of MS COCO is Microsoft Common Objects in Context, which is the Microsoft COCO
data set that Microsoft funded and annotated in 2014. The COCO dataset is large and rich object detection,
segmentation and captioning dataset with a size of more than 25. The main purpose of the COCO data
set is to understand the scene, and the pictures in the data set are mainly obtained from complex daily
scenes. The target in the image is calibrated through precise segmentation. The data set includes 91 types
of targets, 328,000 images and 2.5 million labels. The number of individuals in the entire dataset exceeds
1.5 million, and it has been the largest dataset of semantic segmentation so far.
3.1.3 ILSVRC
The IMAGENET Large-scale Visual Recognition Challenge (ILSVRC) is a data set often used in deep
learning and computer vision. Most of the research work is based on this data set, such as image
classification, positioning and detection. The size of the ILSVRC data set is about 148G, and it has about
15 million pictures. It covers more than 20,000 categories, more than one million pictures of which have
clear category annotations and the location of objects in the images. The ILSVRC data set has detailed
documents, and it is maintained by a dedicated team, so it is very convenient to use. The ILSVRC data
set is widely used in research papers in the field of computer vision, and it has almost become the current
"standard" data set for algorithm performance testing in the field of deep learning images.
6
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
3.2.2 mAP
Mean Average Precision (mAP) is the average value of Average Precision (AP), which is the main
evaluation index of the target detection algorithm. The mAP was first used in PASCAL Visual Objects
Classes (VOC) challenge. Speed and accuracy (mAP) are usually used to describe the performance of
object detection models. The higher the mAP value is, the better the detection effect of the target detection
model on a given data set can be.
K
APi
mAP i 1
(12)
K
3.2.3 Speed
Many practical applications of target detection technology have high requirements on accuracy and speed.
If the speed performance index is not considered, only the breakthrough of accuracy performance is
emphasized, but the cost is higher computational complexity and more memory requirements. Generally
speaking, the speed evaluation indicators in target detection are:
(1) FPS, the number of pictures that the detector can process per second
(2) The time required for the detector to process each picture
The speed evaluation index must be carried out on the same hardware, ensuring that its maximum
floating-point number of operations per second represents the hardware performance (FLOPS) is the
same. Different networks require different floating-point operands (FLOPs) to process each picture, so
the smaller the FLOPs required to process the same picture on the same hardware, the more pictures can
be processed at the same time.
7
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
1.2
0.8
Precision
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8
Recall
RCNN Fast RCNN Faster RCNN
Figure 4. The Precision-Recall curve of R-CNN, Fast R-CNN and Faster R-CNN
Table 2. The mAP(%) of R-CNN, Fast R-CNN and Faster R-CNN on three datasets
Datasets PASCAL COCO ILSVRC
Metrics VOC 2007
R-CNN 54.2 24.6 31.4
Fast R-CNN 66.9 35.9 24.9
Faster R-CNN 75.1 42.5 46.9
Table 3 The test time per image of R-CNN, Fast R-CNN and Faster R-CNN
R-CNN Fast R- Faster R-
CNN CNN
Test time per image 47 seconds 2 seconds 0.2 seconds
Speed up 1× 23.5× 235×
8
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
4. Conclusion
This article reviews the development process of the target detection algorithm and the R-CNN series of
algorithms and analyzes the algorithm structure of Faster R-CNN in detail. According to the experimental
results, the performance of Faster R-CNN is analyzed, which includes the accuracy and speed of Faster
R-CNN when dealing with target detection problems. The experimental results demonstrate that the
accuracy and detection speed of Faster R-CNN are greatly improved compared to R-CNN and Fast R-
CNN because it uses RPN to replace the time-consuming Selective Search.
In the future, more in-depth testing and analysis can be conducted on the robustness of Faster R-CNN
based on current work. For example, the defence capabilities of Faster R-CNN against adversarial
samples can be tested through experiments. In addition, the testing of small data sets is also important to
work directly in the future. At present, Faster R-CNN uses the transfer learning method to train in the
existing dataset and fine-tune the trained model. If the object is not in the dataset used for pre-training,
Faster R-CNN may not achieve good results.
References
[1] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for Accurate Object
Detection and Semantic Segmentation," 2014 IEEE Conference on Computer Vision and
Pattern Recognition, Columbus, OH, pp. 580-587, 2014.
[2] R. Girshick, "Fast R-CNN," 2015 IEEE International Conference on Computer Vision (ICCV),
Santiago, pp. 1440-1448, 2015.
[3] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 39, no. 6, pp. 1137-1149, 2017.
9
ICETIS 2021 IOP Publishing
Journal of Physics: Conference Series 1827 (2021) 012085 doi:10.1088/1742-6596/1827/1/012085
10