Comparison On Object Detection Algorithms A Taxonomy
Comparison On Object Detection Algorithms A Taxonomy
2022 3rdInternational
International Conference
Conferenceon
onElectronic
ElectronicCommunication
Communication and
andArtificial
ArtificialIntelligence
Intelligence(IWECAI)
(IWECAI)
&RPSDULVRQRQ2EMHFW'HWHFWLRQ$OJRULWKPV$
Comparison on Obj ect Detection Algorithms : A
7D[RQRP\
2022 3rd International Conference on Electronic Communication and Artificial Intelligence (IWECAI) | 978-1-6654-7997-4/22/$31.00 ©2022 IEEE | DOI: 10.1109/IWECAI55315.2022.00047
Taxonomy
<L+DQ
Yi Han
Data
DataScience
Science and
andTechnology
Technology
North
NorthUniversity
Universityof
ofChina
China
7DL\XDQ&KLQD
Taiyuan, China
#TTFRP
77262 1 686@[Link]
Abstract²
Abstract- 9LVXDO
Visual REMHFW
object GHWHFWLRQ
detection LV
is D
a SRSXODU
popular WDVN
task, ZKLFK
which LQSXW
input WR
to ILQHWXQLQJ
fine-tuning WKH
the &11
CNN PRGHO
model. :KDW
What WKH\
they XVH
use WR
to
FDWHJRUL]HV
categorizes DOO
all WKH
the GHILQHG
defined REMHFWV
objects LQ
in WKHZKROHLPDJHV
the whole images. :LWKWKH
With the SUHWUDLQHG
pretrained LVis XVH
use Da ODUJHVFDOH
large-scale GDWDVHW
dataset ZKLFK
which LV
is OLNH
like WKH
the
HPHUJLQJ
emerging RIof QXPHURXV
numerous REMHFW
object GHWHFWLRQ
detection IUDPHZRUNV
frameworks, PDQ\
many ,PDJH1HW$WWKLVVWDJHDOOUHJLRQSURSRVDOVDUHGHWHUPLQHGE\
ImageNet. At this stage, all region proposals are determined by
GHWHFWLRQPHWKRGV
detection methods KDYHEHHQSURSRVHG,QWKLVSDSHUZHDLPWR
have been proposed. In this paper, we aim to WKHLU,R8EHWZHHQWKHJURXQGWUXWK
their loU between the ground truth.
VXUYH\VHYHUDOGLVWLQJXLVKHGGHWHFWLRQPHWKRGVZKLFKFRQVLVWVRI
survey several distinguished detection methods, which consists of
WZRVWDJHEDVHGPHWKRGVDQGRQHVWDJHEDVHGPHWKRGV,QVSHFLILF
two-stage based methods and one-stage based methods. In specific, &ODVVLILFDWLRQVXSSRUWYHFWRUPDFKLQHFODVVLILHUWUDLQLQJD
3 . Classification support vector machine classifier training a
ZH
we VXPPDUL]H
summarize 9 GLIIHUHQW
different REMHFW
object GHWHFWLRQ
detection SDSHUV
papers, LQFOXGLQJ
including JURXSRIFODVVLILFDWLRQOLQHDUVXSSRUWYHFWRUPDFKLQHFODVVLILHUV
group of classification linear support vector machine classifiers
5&11
RCNN EDVHG
based PHWKRGV
methods, <2/2
YOLO, 0'HW
M2Det DQGand &RUQHU1HW
CornerNet, HWF
etc. $OO
All DUHWUDLQHGZLWKIL[HGOHQJWKIHDWXUHVH[WUDFWHGE\&11LQVWHDG
are trained with fixed length features extracted by CNN instead
WKHVH
these SDSHUV
papers SOD\
play LPSRUWDQW
important UROHV
roles DQG
and REWDLQ
obtain VWDWHRIWKHDUW
state-of-the-art RIVRIWPD[LPXPFODVVLILHUVOHDUQHGE\ILQHWXQLQJ,WGHILQHVD
of soft maximum classifiers learned by fine tuning. It defines a
SHUIRUPDQFHV:HVXPPDUL]HDOOWKHIUDPHZRUNVPRWLYDWLRQVDQG
performances. We summarize all the frameworks, motivations and ORW
lot RI
of SRVLWLYH
positive H[DPSOHV
examples WRto EH
be JURXQG
ground WUXWK
truth ER[HV
boxes LQ
in RUGHU
order WRto
WUDLQLQJVWHSVIRUDOOPHWKRGV:HKRSHWKDWRXUUHYLHZFDQJLYH
training steps for all methods. We hope that our review can give WUDLQLQJ
training 690
SVM FODVVLILHUV
classifiers, ZKLFK
which DUH
are XVHG
used IRU
for HDFK
each FODVV
class. ,W
It LV
is
DQLQGHSWKJXLGDQFHIRUWKHEHJLQQHUVIRUUHFRJQL]LQJWKLVVHDUFK
an in-depth guidance for the beginners for recognizing this search QHJDWLYH
negative IRU
for WKH the FODVV
class LI
if WKH
the ,28
IOU RYHUODS
overlap RIof WKH
the UHJLRQ
region
ILHOGV
fields. UHFRPPHQGDWLRQZLWKDOOJURXQGWUXWKLQVWDQFHVRIWKHFODVVLV
recommendation with all ground truth instances of the class is
OHVV
less WKDQ
than
0 . 3 . %XW
But WR
to SD\
pay DWWHQWLRQ
attention, QRWLFH
notice WKDW
that WKH
the H[DPSOHV
examples
Keywords—Computer
Keywords-Computer vision,
vision, Visual
Visual object
object detection,
detection, RCNN,
RCNN, ZKDWHYHU
whatever WKH\
they DUHare SRVLWLYH
positive RU
or QHJDWLYH
negative, ZKLFK
which DUH
are GHILQHG
defined LQin
YOLO,
YOLO, CornerNet.
CornerNet.
RUGHUWRWUDLQLQJWKH690FODVVLILHUVWKH\KDYHDGLIIHUHQWIURP
order to training the SVM classifiers, they have a different from
,
I. ,INTRODUCTION
1752'8&7,21 WKRVHZKLFKGHILQHGIRUILQHWXQLQJWKH&11
those which defined for fine-tuning the CNN.
2EMHFW
Object GHWHFWLRQ
detection LVis WR
to ORFDWH
locate WKH
the REMHFW
object LQVWDQFHV
instances IURP
from
4. )RU
For HDFK
each REMHFW
object FODVV
class ZLWK
with &11
CNN FKDUDFWHULVWLFV
characteristics, OHDUQ
learn
SUHGHILQHGFDWHJRULHVZKLFKLVDIXQGDPHQWDODQGFKDOOHQJLQJ
predefined categories, which is a fundamental and challenging FODVVVSHFLILF
class-specific ERXQGLQJ
bounding ER[
box UHJUHVVLRQ
regression RU
or WUDLQ
train ERXQGLQJ
bounding ER[
box
WDVNV
tasks LQ
in FRPSXWHU
computer YLVLRQ
vision. 7DUJHW
Target GHWHFWLRQ
detection LV
is DOVR
also D
a SRSOXDU
popluar UHJUHVVLRQ
regress10n.
UHVHDUFK
research ILHOG
field. ,W
It ODLG
laid WKH
the IRXQGDWLRQ
foundation IRU
for PRGHUQ
modern LPDJH
image <2/2>@ˈLWLVDGHWHFWRUFDVWLQJREMHFWGHWHFWLRQZKLFK
YOLO [ 5] , it is a detector casting object detection which
XQGHUVWDQGLQJ
understanding DQG and FRPSXWHU
computer YLVLRQ
vision. 0RUHRYHU
Moreover, FRPSOH[
complex RU or
LVXQLILHG,WIRUPVDERXQGDU\ER[VHSDUDWLQJLPDJHSL[HOVDQG
is unified. It forms a boundary box separating image pixels and
DGYDQFHGYLVXDOWDVNVFDQQRWEHVROYHGZLWKRXWWDUJHWGHWHFWLRQ
advanced visual tasks cannot be solved without target detection.
VSDFHDQGPDNHVVRPHLQVWUXFWLRQV)RUH[DPSOHDGHVFULSWLRQ
space and makes some instructions. For example, a description
)RU
For H[DPSOH
example, WKH the VHJPHQWDWLRQ
segmentation, REMHFW
object WUDFNLQJ
tracking, LPDJH
image
RIUHODWHGFODVVHV,W VDOVRDUHJUHVVLRQSUREOHP
of related classes. It's also a regression problem
FDSWLRQLQJ
captioning, HYHQW
event GHWHFWLRQ
detection DQG
and DFWLYLW\
activity UHFRJQLWLRQ:KDW
[Link] LV is
PRUHREMHFWGHWHFWLRQLVDOVRZLGHO\XVHG,WFDQEHDSSOLHGLQ
more, object detection is also widely used. It can be applied in 5LJKWQRZ
Right now, <2/2XVHV
YOLO uses RQO\D
only a VPDOOSDUWRIWKH
small part of the GHWHFWLRQ
detection
PDQ\
many ILHOGV
fields OLNH
like URERW
robot YLVLRQ
vision, FRQVXPHU
consumer HOHFWURQLFV
electronics, VHFXULW\
security, DUHDWKDQNVWRWKHEHQHILWVWKDWWKHUHJLRQSURSRVDOJHQHUDWLRQ
area, thanks to the benefits that the region proposal generation
DXWRQRPRXVGULYLQJDQGDXJPHQWHGUHDOLW\
autonomous driving, and augmented reality. VWDJHZKLFKLVWRWDOO\GURSSHG<2/2XVHVWKHIHDWXUHVZKLFK
stage which is totally dropped. YOLO uses the features which
DUHIURPDWRWDOLPDJHJOREDOO\UDWKHUWKDQXVHVUHJLRQEDVHG
are from a total image globally, rather than uses region-based
,Q
In REMHFW
object GHWHFWLRQ
detection DUHD
area, QXPHURXV
numerous PHWKRGV
methods KDYH
have EHHQ
been
DSSURDFKHVVXFKDV)DVWHU5&11ZKRVHSUHGLFWGHWHFWLRQVDUH
approaches, such as Faster RCNN, whose predict detections are
SURSRVHG7KH&11VLQVSLUHGWKH5&11>@$VZHDOONQRZQ
proposed. The CNNs inspired the RCNN [ 1 ] . As we all known,
DOO
all EDVHG
based RQ
on IHDWXUHV
features WKDW
that IURP
from Da ORFDO
local UHJLRQ
region. 6SHFLDOO\
Specially, WKH
the
WKH
the &11V
CNNs KDYH
have Da ELJ
big EUHDNWKURXJK
breakthrough LQ
in LPDJH
image FODVVLILFDWLRQ
classification
LPDJH
image LVis GLYLGHG
divided LQWR
into DQ
an 6 x 6
S î S JULG
grid E\
by <2/2
YOLO, FODVV
class &
C
UHVXOWV&11VKDYHDJUHDWVXFFHVVLQWKHVHOHFWLYHVHDUFKZKLFK
results. CNNs have a great success in the selective search which
SUREDELOLWLHV
probabilities ZKLFK
which DUH
are SUHGLFWHG
predicted E\
by HDFK
each VTXDUH
square. ,Q
In DGGLWLRQ
addition,
IRUKDQGFUDIWHGIHDWXUHV>@LQUHJLRQSURSRVDO7RH[SORUH
for hand-crafted features [ 1 -3] in region proposal. To explore
ER[ORFDWLRQVDUHERXQGHGE\%ZKLFKFRQILGHQFHWKHVFRUHV
box locations are bounded by B, which confidence the scores.
&11VLQJHQHULFREMHFWGHWHFWLRQDQGGHYHORSHG5&11KDQG
CNNs in generic object detection and developed RCNN, hand
6LQFH
Since WKH
the UHJLRQ
region SURSRVDO
proposal JHQHUDWLRQ
generation VWHS
step KDV
has EHHQ
been JRW
got ULG
rid RI
of
FUDIWHG
crafted IHDWXUHV
features >@
[ 1 -3] ZHUH
were DOPRVW
almost WKH
the ILUVW
first DQG
and WKH\
they XVHG
used Da
WRWDOO\<2/2KDVEHHQIDVWVLQFHLWZDVSURGXFHG1RUPDOO\LW
totally, YOLO has been fast since it was produced. Normally, it
UHJLRQSURSRVDOVHOHFWLYHVHDUFK>@WRLQWHJUDWHV$OH[1HW>@
region proposal selective search [2] to integrates AlexNet [4] .
LV
is UXQQLQJ
running XVXDOO\
usually DW
at DERXW
about
45 )36
FPS LQin UHDO
real WLPH
time EXW
but WKH
the )DVW
Fast
7UDLQLQJ
Training DQ
an 5&11
RCNN IUDPHZRUN
framework UHTXLUHV
requires PDQ\
many PXOWLVWDJH
multistage
YHUVLRQ
version RI
of<2/2
YOLO ZKLFK
which FDQ
can UHDFK
reach DW
at DERXW
about
155 )36
FPS. <2/2
YOLO
SLSHOLQHV
pipelines:
XVXDOO\
usually SXWV
puts WKH
the REMHFW
object FODVVHV
classes ZLWK
with LPSOLFLWO\
implicitly HQFRGHV
encodes
7KHUH
l .There LV
is D
a PRGHODJQRVWLF
model-agnostic UHJLRQ
region SURSRVDOV
proposals, ZKLFK
which LV
is LQIRUPDWLRQEHFDXVHLWFDQVHHWKHZKROHLPDJHZKHQLWLVGRLQJ
information because it can see the whole image when it is doing
EHORQJHG
belonged WRto UHJLRQ
region SURSRVDO
proposal FRPSXWDWLRQ
computation FODVV
class. 7KH\
They DUH
are SUHGLFWLRQV:KDWLVPRUHLQWKHEDFNJURXQGWKHIDOVHSRVLWLYHV
predictions. What is more, in the background, the false positives
FDQGLGDWHUHJLRQVDQGDORWRIREMHFWVPLJKWEHFRQWDLQHGLQLW
candidate regions, and a lot of objects might be contained in it. ZKLFKDUHWKH<2/2OHVVZDQWHGWRSUHGLFW
which are the YOLO less wanted to predict.
8VXDOO\WKH\DUHREWDLQHGWKURXJKDVHOHFWLYHVHDUFK>@
Usually, they are obtained through a selective search [2] .
0RUHUHFHQWO\>@UDLVHGDTXHVWLRQZKLFKLVDERXWZKHWKHU
More recently, [ 6] raised a question which is about whether
2. $QRWKHU
Another RQH
one LV
is FDOOHG
called &11
CNN PRGHO
model ILQHWXQLQJ
finetuning 5HJLRQ
Region WKHDQFKRUER[HVKDYHEHHQYDOLGO\XVHGLQ6R$REMHFWGHWHFWLRQ
the anchor boxes have been validly used in SoA object detection
SURSRVDOV)LUVWO\WKH\ZLOOEHFURSSHGIURPWKHLPDJH7KHQ
proposals. Firstly, they will be cropped from the image. Then IUDPHZRUNV>@RUQRW,WDOVRTXHVWLRQVWKDWZKHWKHU
frameworks [5, 7-9] or not. It also questions that whether WKH
the
WKH\
they ZLOO
will EH
be ZDUSHG
warped LQWR
into WKH
the VDPH
same VL]H
size. )LQDOO\
Finally, WKH\
they DUH
are WKH
the ER[HVKDYHSOD\HGWKHGRPLQDQWUROHLQWKHIUDPHZRUNV>@DOVR
boxes have played the dominant role in the frameworks. [6] also
978-1-6654-7997-4/22/$3 1 .00©2022
978-1-6654-7997-4/22/$31.00 ©2022IEEE
IEEE 204
204
DOl10.1109/IWECAI55315.2022.00047
DOI 1 0 . 1 1 0911WECAI553 15.2022.00047
Authorized licensed use limited to: Infineon Technologies AG. Downloaded on March 28,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
DUJXHVWKDWWKHUHKDVDORWRIGUDZEDFNV>@ZKLFKDUHXVHG
argues that there has a lot of drawbacks [6, 1 1] , which are used 7KHUHDUHWZRSRLQWVKHUHERWKRIZKLFKDUHFULWLFDO7KHILUVW
There are two points here, both of which are critical. The first
IRU
for DQFKRU
anchor ER[HV
boxes, DQG
and WKLV
this LV
is HVSHFLDOO\
especially WUXH
true ZLWK
with RQHVWDJH
one-stage LGHDLVWKDWLQRUGHUWRIDFLOLWDWHORFDOL]DWLRQDQGVHJPHQWDWLRQ
idea is that, in order to facilitate localization and segmentation
GHWHFWRUV>@)RUH[DPSOHHQRUPRXVLPEDODQFHKDVEHHQ
detectors [5,9- 1 1 ] . For example, enormous imbalance has been RI
of REMHFWV
objects, FRQYROXWLRQDO
convolutional QHXUDO
neural QHWZRUNV
networks ZKLFK
which ZLWK
with KLJK
high
FDXVHGEHWZHHQWKHSRVLWLYHDQGQHJDWLYHH[DPSOHV,QDGGLWLRQ
caused between the positive and negative examples. In addition, FDSDFLW\FDQEHDSSOLHGWRUHJLRQVXJJHVWLRQVZKLFKLVERWWRP
capacity can be applied to region suggestions which is bottom
LQ
in RUGHU
order WR
to LQWURGXFLQJ
introducing H[WUD
extra K\SHUSDUDPHWHUV
hyperparameters, LWit ZRXOG
would VORZ
slow XS7KHVHFRQGLGHDLVWRGRSUHWUDLQLQJSDUWLFXODUO\VXSHUYLVHG
up. The second idea is to do pre-training, particularly supervised
GRZQWKHWUDLQLQJ7KHUHDVRQZK\>@FDQSURSRVH&RUQHUQHWLV
down the training. The reason why [ 6] can propose Cornernet is SUHWUDLQLQJ
pre-training, IRU
for DQFLOODU\
ancillary WDVNV
tasks ZKHUH
where WUDLQLQJ
training GDWD
data LV
is VFDUFH
scarce.
WKDWLWUHIHUVWRDUHVHDUFKUHVXOWDERXWFRUUHODWLRQHPEHGGLQJLQ
that it refers to a research result about correlation embedding in 7KHQPDNHVXEWOHDGMXVWPHQWVIRUWKHDUHDVZKHUHLVVSHFLILF
Then, make subtle adjustments for the areas where is specific.
PXOWLSHUVRQDWWLWXGHHVWLPDWLRQ>@,QDGGLWLRQLWGHILQHGWKH
multi-person attitude estimation [ 12] . In addition, it defined the %\GRLQJVRSHUIRUPDQFHFDQEHVLJQLILFDQWO\LPSURYHG
By doing so, performance can be significantly improved.
ERXQGLQJ
bounding ER[box REMHFW
object GHWHFWLRQ,W
detection. It GHILQHV
defines WKLV
this GHWHFWLRQ
detection DV
as
GHWHFWLQJ 7KH
The REMHFW
object GHWHFWLRQ
detection V\VWHP
system ZKLFK
which LV
is FRQVLVWHG
consisted RI
of WKUHH
three
detecting NH\
key SRLQWV
points, ZKLFK
which DUH
are SDLUHG
paired ZLWK
with WKH
the WRSOHIW
top-left DQG
and
ORZHUERWWRPFRUQHUV PRGXOHV
modules. 7KH
The ILUVW
first JHQHUDWHV
generates FDWHJRU\LQGHSHQGHQW
category-independent UHJLRQ
region
lower bottom-corners.
SURSRVDOV7KHVHFRQG
proposals. The second PRGXOHLVDODUJHFRQYROXWLRQDOQHXUDO
module is a large convolutional neural
,QRUGHUWREHWWHUORFDWHWKHFRUQHUVWKHUHLVDPHWKRGZKLFK
In order to better locate the corners, there is a method which QHWZRUN7KHWKLUGPRGXOHLVDVHWRIFODVVVSHFLILFOLQHDU690V
network. The third module is a set of class specific linear SVMs.
LVFDOOHGWKHVLPSOHFRUQHUSRROLQJ
is called the simple corner pooling PHWKRG,QWKH&RUQHU1HW
method. In the CornerNet,
WZRVWDFNHG+RXUJODVVQHWZRUNV>@IRUPDEDFNERQHQHWZRUN
two stacked Hourglass networks [ 1 3 ] form a backbone network. R-CNN: Regions with CNNfeatures
7KH&RUQHU1HWLVVXSHULRUWRDOOSUHYLRXVVLQJOHVWDJHGHWHFWRUV
The CornerNet is superior to all previous single-stage detectors
as it achieves 42. 1 percent AP in MS COCO. However, it was
DVLWDFKLHYHVSHUFHQW$3LQ06&2&2+RZHYHULWZDV
FOHDU
clear WKDW
that 66'
SSD >@
[9] DQG
and <2/2
YOLO >@[5] DYHUDJHG
averaged IDVWHU
faster LQIHUHQFH
inference
WLPHVWKDQ7LWDQDQG7LWDQ;*38VKDGDWLPHRIDURXQG)36
times than Titan, and Titan X GPUs had a time of around 4FPS.
5HFHQWREMHFWGHWHFWLRQPHWKRGVFDQEHFDWHJRUL]HGDVWZR 1. Input [Link] region 3 . Compute 4. Classify
image proposals (�2k) CNN features regions
Recent object detection methods can be categorized as two
W\SHVLQFOXGLQJWZRVWDJHGHWHFWRUDQGRQHVWDJHGHWHFWRU
types, including two-stage detector and one-stage detector.
)LJXUH7KHIUDPHZRUNRI5&11
Figure . l The framework ofRCNN.
7ZRVWDJHWDUJHWGHWHFWLRQFRQVLVWVRIWZRVWDJHV7KHUHIRUH
Two-stage target detection consists of two stages. Therefore,
LWFDQDOVREHVHHQDVDFDVFDGH7KHWDVNRIWKHILUVWVWDJHZDV
it can also be seen as a cascade. The task of the first stage was 7KH
The VHFRQG
second DUWLFOH
article LV
is 6331HW
SPPNet >@
[ 1 6] . 6331HW
SPPNet PD\
may H[WUDFW
extract
WR
to UHPRYH
remove D a ODUJH
large QXPEHU
number RI
of EDFNJURXQG
background GHWHFWRUV
detectors DQG
and WKH
the IHDWXUH
feature LPDJHV
images DW
at RQH
one VFDOH
scale RU
or PXOWLSOH
multiple VFDOHV
scales, EXW
but WKH\
they RQO\
only
VHFRQG
second VWDJH
stage ZDV
was WR
to FODVVLI\
classify WKH
the UHPDLQLQJ
remaining DUHDV
areas. 6LQFH
Since WKH
the H[WUDFW
extract IHDWXUH
feature LPDJHV
images RQFH
once IURP
from WKH
the HQWLUH
entire LPDJH
image. $IWHU
After
DGYHQWRI5&11>@WKHGRPLQDQWVWUDWHJ\KDVEHHQWKHUHJLRQ
advent ofRCNN [ 1 ] , the dominant strategy has been the region H[WUDFWLRQ
extraction, IHDWXUH
feature PDSSLQJ
mapping LVis SHUIRUPHG
performed IRU for HDFK
each FDQGLGDWH
candidate
EDVHGSLSHOLQHVWUDWHJ\)RUH[DPSOHWKHUHDUHQRZPDQ\PDMRU
based pipeline strategy. For example, there are now many major ZLQGRZDQGWKHQVSDWLDOS\UDPLGLVDSSOLHGWRHDFKZLQGRZ
window, and then spatial pyramid is applied to each window.
UHVXOWVZKLFKDUHRQSRSXODUEHQFKPDUNGDWDVHWVDUHEDVHGRQ
results, which are on popular benchmark datasets, are based on $QGLVUHSUHVHQWHGE\WKHIL[HGOHQJWKRIWKHSRROHGZLQGRZ
And is represented by the fixed length of the pooled window.
)DVWHU5&11>@+RZHYHUEHFDXVHVRPHGHYLFHVKDYHYHU\
Faster RCNN [ 14] . However, because some devices have very 8VLQJWKLVPHWKRGWKHVSHHGFDQEHJUHDWO\LPSURYHGDWPRVW
Using this method, the speed can be greatly improved, at most
OLPLWHG
limited VWRUDJH
storage FDSDFLW\
capacity DQG
and FRPSXWLQJ
computing SRZHU
power, WKH\
they DUH
are YHU\
very E\
by VHYHUDO
several RUGHUV
orders RI
of PDJQLWXGH
magnitude, EHFDXVH
because RQO\only RQH
one WLPH
time
H[SHQVLYH
expensive LQin WKH
the UHJLRQEDVHG
region-based FRPSXWLQJ
computing PHWKRG
method. (VSHFLDOO\
Especially consuming convolution is applied.
FRQVXPLQJFRQYROXWLRQLVDSSOLHG
IRUFXUUHQWPRELOHRUZHDUDEOHGHYLFHV
for current mobile or wearable devices.
7KH
The LQSXW
input RIof )DVW
Fast 5&11>@QHWZRUN
R-CNN[ 1 7]network LV is GLYLGHG
divided LQWR
into WZR
two
On WKH
2Q the RWKHU
other KDQG
hand, LQin WKH
the RYHUDOO
overall VHWWLQJ
setting, Da VLQJOH
single SDUWV2QHSDUWLVWKHHQWLUHLPDJHDQGWKHRWKHUSDUWLVDVHWRI
parts. One part is the entire image, and the other part is a set of
IHHGIRUZDUG&11FDQEHXVHGWRSUHGLFWWKHFODVVSUREDELOLWLHV
feedforward CNN can be used to predict the class probabilities REMHFWSURSRVDOV7KHQHWZRUNILUVWSURFHVVHVWKHHQWLUHLPDJH
object proposals. The network first processes the entire image,
RIWKHZKROHLPDJHDQGWKHVWUXFWXUHRIWKHERXQGLQJER[RIIVHWV
of the whole image and the structure ofthe bounding box offsets, XVLQJFRQYROXWLRQDQG
using convolution and PD[LPXPSRROLQJOD\HUVDQGWKHQWKH
maximum pooling layers, and then the
ZKLFKFDQEHGLUHFWO\SUHGLFWHG7KHQDOOWKHFDOFXODWLRQVDUH
which can be directly predicted. Then, all the calculations are &219
CONV IHDWXUH
feature PDS
map LV
is JHQHUDWHG
generated. 1H[W
Next, Da IHDWXUH
feature YHFWRU
vector ZLWK
with
HQFDSVXODWHGLQDQHWZRUN7KLVLVKRZWKHSULPDU\GHWHFWRURI
encapsulated in a network. This is how the primary detector of IL[HGOHQJWKLVH[WUDFWHGIURPWKHIHDWXUHJUDSK)LQDOO\IHDWXUH
fixed length is extracted from the feature graph. Finally, feature
WKH
the XQLIRUP
uniform SLSHOLQH
pipeline LV
is XVHG
used, DQG
and WKH
the RYHUDOO
overall VHWXS
setup RI
of SRVW
post YHFWRUVDUHHQWHUHGRQHE\RQHLQDIXOO\FRQQHFWHGVHTXHQFHRI
vectors are entered one by one in a fully connected sequence of
FODVVLILFDWLRQ
classification RU
or IHDWXUH
feature UHVDPSOLQJ
resampling LV is QRW
not LQYROYHG
involved LQin WKH
the OD\HUV7KH\DUHWKHQGLYLGHGLQWRWZRRXWSXWOD\HUVZKLFKDUH
layers. They are then divided into two output layers which are
JHQHUDWLRQ
generation RIof WKH
the SURSRVHG
proposed UHJLRQ
region. ,Q
In DGGLWLRQ
addition, HQGWRHQG
end-to-end HTXLYDOHQW,Q)DVW5&11WUDLQLQJVWRFKDVWLFJUDGLHQWGHVFHQW
equivalent. In Fast RCNN training, stochastic gradient descent
RSWLPL]DWLRQFDQEHFDUULHGRXWIRUGHWHFWLRQSHUIRUPDQFHDQG
optimization can be carried out for detection performance and 6*' PLQLEDWFKHVDUHVDPSOHGKLHUDUFKLFDOO\+HUHLVDYHU\
(SGD) minibatches are sampled hierarchically. Here is a very
GHWHFWLRQ
detection VSHHG
speed, VLQFH
since LW
it LV
is D
a VLQJOH
single QHWZRUN
network IRU
for WKH
the HQWLUH
entire VWUHDPOLQHG
streamlined WUDLQLQJ
training SURFHVV
process, XVHG
used E\
by )DVW
Fast 5&11
R-CNN, ZKLFK
which
SLSHOLQH,QWKLVSDSHUZHDLPWRVXPPDU\WKHSRSXODUSDSHUV
pipeline. In this paper, we aim to summary the 9 popular papers LQFOXGHV
includes D a ILQHWXQLQJ
fine-tuning VWDJH
stage, LQ
in ZKLFK
which WKH
the WZR
two UHJUHVVRUV
regressors DUH
are
LQREMHFWGHWHFWLRQWDVNV:HFDWHJRUL]HWKRVHSDSHUVLQWRWZR
in object detection tasks. We categorize those papers into two MRLQWO\
jointly RSWLPL]HG
optimized. 6RIWPD[
Softmax FODVVLILHU
classifier DQG
and ERXQGDU\
boundary ER[
box
W\SHVFRQVLVWLQJRIWZRVWDJHGHWHFWRUDQGRQHVWDJHGHWHFWRU
types, consisting of two-stage detector and one-stage detector. UHJUHVVRUUHVSHFWLYHO\
regressor respectively.
,,
II. 2%-(&7'
OBJECT (7(&7,210
DETECTION (7+2'6
METHODS 7KH5HJLRQSURSRVDOQHWZRUNLVRQHRIWKHNH\FRPSRQHQWV
The Region proposal network is one of the key components
LQ)DVWHU5&11>@,WVLQSXWLVDQLPDJHWKDWFDQEHRIDQ\
in Faster RCNN [ 1 8] . Its input is an image that can be of any
A. Two-stage
Two-stage detector
detector VL]HDQGLWVRXWSXWLVDVHWRIUHFWDQJXODUREMHFWSURSRVDOVDQG
size and its output is a set of rectangular object proposals, and
:LWKWKHXVHRI6,)7DQG+2*SURJUHVVHVRQYDULRXVYLVXDO
With the use of SIFT and HOG, progresses on various visual DQREMHFWGHJUHHVFRUHZLOOH[LVWLQHDFKSURSRVDO
an object degree score will exist in each proposal.
UHFRJQLWLRQWDVNVKDVDFKLHYHGJUHDWVXFFHVVLQWKHODVWGHFDGH
recognition tasks has achieved great success in the last decade.
2EMHFWGHWHFWLRQSHUIRUPDQFHKDVVWDELOL]HGRYHUWKHSDVWIHZ
Object detection performance, has stabilized over the past few 7R
To GR
do WKLV
this, \RX
you VKRXOG
should PDNH
make PRGHOV
models E\
by XVLQJ
using D
a IXOO\
fully
\HDUV
years LQ
in PHDVXULQJ
measuring, GHWHFWLQJ
detecting REMHFWV
objects RQ
on VWDQGDUG
standard 3$6&$/
PASCAL FRQYROXWLRQDOQHWZRUN>@)LUVWJHQHUDWLQJWKHWUDQVIRUPDWLRQ
convolutional network[ 1 9] . First, generating the transformation
92&GDWDVHWV,QWHJUDWHGV\VWHPVZKLFKDUHFRPSOH[DUHXVHG
VOC datasets. Integrated systems which are complex, are used OD\HUDQGILQGWKHODVWVKDUHGWUDQVIRUPDWLRQOD\HU7KLVLVWKHQ
layer and find the last shared transformation layer. This is then
WR
to FRPELQLQJ
combining KLJKOHYHO
high-level FRQWH[W
context ZLWK
with PDQ\
many ORZOHYHO
low-level LPDJH
image XVHGWRFRQYHUWWKHRXWSXWLQWRDWUDQVIRUPDWLRQIHDWXUHJUDSK
used to convert the output into a transformation feature graph,
IHDWXUHV
features. ,Q
In DGGLWLRQ
addition, WKLVthis LV
is SUREDEO\
probably WKH
the EHVW
best PHWKRG
method IRU
for DQG
and Da VPDOO
small QHWZRUN
network VOLGHV
slides RYHU
over LW
it. +RZHYHU
However, EHFDXVH
because PLQL
mini
SHUIRUPDQFH
performance. &11
CNN FODVVLILFDWLRQ
classification UHVXOWV
results ZKLFK
which DUH
are RQ
on WKH
the QHWZRUNV
networks RSHUDWH
operate LQ
in D
a VOLGLQJ
sliding ZLQGRZ
window, IXOO\
fully FRQQHFWHG
connected OD\HUV
layers
,PDJH1HWFDQEHJHQHUDOL]HGWRDODUJHH[WHQWWRGHWHFWUHVXOWV
ImageNet can be generalized to a large extent to detect results DUHVKDUHGDWDOPRVWDOOVSDWLDOORFDWLRQV0RUHRYHUWKHV\VWHP
are shared at almost all spatial locations. Moreover, the system
EDVHG
based RQ
on REMHFWV
objects LQ
in 3$6&$/
PASCAL 92& VOC FKDOOHQJHV
challenges. )LUVW
First, ZH
we LVHDV\WRLPSOHPHQW:HFDQXVHDQQ
is easy to implement. We can use an n *QFRQYOD\HUIROORZHG
n conv layer followed
LQWURGXFH
introduce 5&11
RCNN >@
[ 1 5 ] , ZKLFK
which LV
is WKH
the ILUVW
first WZRVWDJH
two-stage GHWHFWRU
detector.
205
205
Authorized licensed use limited to: Infineon Technologies AG. Downloaded on March 28,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
E\WZRVLEOLQJ
by two sibling 1 *FRQYOD\HUV IRUUHJDQGFOVUHVSHFWLYHO\
1 conv layers (for reg and cls, respectively) WKHPRGHOSUHGLFWVWKHFRUUHVSRQGLQJHPEHGGHGYHFWRUIRUHDFK
the model predicts the corresponding embedded vector for each
WRLPSOHPHQWLW)RUH[DPSOH5H/8V>@LVXVHGLQKHUH
to implement [Link] example,ReLUs [20] is used in here. FRUQHU7KHGLVWDQFHEHWZHHQWKHYHUWH[SDLUVRIWKHVDPHWDUJHW
comer. The distance between the vertex pairs of the same target
FDQ
can EH
be VKRUWHVW
shortest WKURXJK
through WKH
the HPEHGGHG
embedded YHFWRU
vector. 7KHUHIRUH
Therefore, WKH
the
7KHUH
There DUH
are PDQ\
many LPSOHPHQWDWLRQV
implementations RIof IXOO
full FRQYROXWLRQDO
convolutional PRGHO FDQ JURXS HDFK YHUWH[ WKURXJK WKH
model can group each vertex through the
QHWZRUNV>@RQHRIZKLFKLV531,ILWLVWUDLQHGHQGWRHQG
networks [ 1 9] , one of which is RPN. If it is trained end-to-end, HPEHGGHGYHFWRU)LQDOO\
[Link], DIWHU
after 06
MS &2&2
COCO WHVWLQJ
testing, WKH
the
LWFDQXVHEDFNSURSDJDWLRQDQGVWRFKDVWLFJUDGLHQWGHVFHQW>@
it can use back propagation and stochastic gradient descent [2 1 ] . &RUQHU1HW
ComerNet RXWSHUIRUPV
outperforms DOO
all WKH
the RQHVWDJH
one-stage WDUJHW
target GHWHFWLRQ
detection
$QGQRZDQHZWUDLQLQJDOJRULWKPKDVEHHQGHYHORSHGWKHUH
And now, a new training algorithm has been developed, there PHWKRGV
methods.
DUH
are IRXU
four VWHSV
steps. 7KLV
This DOJRULWKP
algorithm OHDUQV
learns VKDUHG
shared IHDWXUHV
features WKURXJK
through
DOWHUQDWHRSWLPL]DWLRQ
alternate optimization. Heatmaps Embr>ddings
0DVN5&11>@DOVRDGRSWVWZRVWDJHZKLFKLVGLYLGHG
Mask R-CNN [22] also adopts two-stage, which is divided
LQWR
into WZR
two VWDJHV
stages. 7KH
The ILUVW
first VWDJH
stage LV
is WKH
the VDPH
same DV
as 531
RPN, EXW
but WKH
the
VHFRQGVWDJHLVGLIIHUHQW,QWKHVHFRQGSKDVHLWRXWSXWVDELQDU\
second stage is different. In the second phase, it outputs a binary
PDVNIRUHDFK5RO$QGDWWKLVVWDJHLWLVSDUDOOHOLQSUHGLFWLQJ
mask for each Rol. And, at this stage, it is parallel in predicting
FODVVDQGER[RIIVHWV0RUHRYHULWDOVRGHILQHVWKHPXOWLWDVN
class and box offsets. Moreover, it also defines the multi-task
loss. They will make L =/Lc1s
ORVV7KH\ZLOOPDNH/ FOV/ ER[/
+ Loox PDVNWREHGHILQHGDV
+ Lmask to be defined as
ORVVHVRQ5ROSHUVDPSOH
losses on Rol per sample.
,Q
In DGGLWLRQ
addition, D
a QHZ
new OD\HU
layer ZDV
was DOVR
also SURSRVHG
proposed E\
by WKHP
them. 7KH
The
QDPHRIWKLVOD\HULV5RO$OLJQ7KHSXUSRVHRIWKLVOD\HULVWR
name of this layer is RolAlign. The purpose of this layer is to
VWULS
strip DZD\
away VRPH
some RIof WKH
the KDUVK
harsh TXDQWL]DWLRQ
quantization DQG
and DOLJQ
align WKH
the
H[WUDFWHG
extracted IHDWXUHV
features FRUUHFWO\
correctly ZLWK
with WKH
the LQSXW
input. 0RUHRYHU
Moreover, WKH\
they )LJXUHˊ2YHUDOOIUDPHZRUNRI&RUQHU1HW
Figure 2. Overall framework ofComerNet.
LQVWDQWLDWHG
instantiated 0DVN
Mask 5&11
R-CNN DQG
and DSSOLHG
applied PXOWLSOH
multiple DUFKLWHFWXUHV
architectures.
7KH\KDYHDOVRLPSURYHGWKHQHWZRUNKHDGHU2QWKHEDVLVRI
They have also improved the network header. On the basis of 0'HW>@DOVRILUVWO\H[WUDFWVWKHIHDWXUHVIURPWKHLQSXW
M2Det [27] also firstly extracts the features from the input
WKH
the SUHYLRXVO\
previously SURSRVHG
proposed DUFKLWHFWXUH
architecture, WKH\
they DGG
add Da EUDQFK
branch IRU
for LPDJH
image. 0RUHRYHU
Moreover, EDFNERQH
backbone QHWZRUN
network DQG
and PXOWLOHYHO
multilevel IHDWXUH
feature
SUHGLFWLRQRIIXOO\FRQYROXWLRQPDVNV
prediction of fully convolution masks. S\UDPLGQHWZRUNFDQEHXVHG7KHQOHDUQIURPWKHP7KURXJK
pyramid network can be used. Then, learn from them. Through
OHDUQLQJ
learning, ERXQGDU\
boundary ER[HV
boxes ZKLFK
which LV
is GHQVH
dense DQG
and FDWHJRU\
category VFRUHV
scores
B.
B. One-stage
One-stagedetector
detector ZLOOEHJHQHUDWHG)LQDOO\UHVXOWVDUHSURGXFHGE\SHUIRUPLQJD
will be generated. Finally, results are produced by performing a
7KHUH
There are many JULGV
DUH PDQ\ grids LQ
in <2/2
YOLO >@
[23 ] . $OVR
Also, WKHUH
there ZLOO
will EH
be QRQPD[LPXPVXSSUHVVLRQPHWKRG
non-maximum suppression method.
REMHFW
object FHQWHUVIDOOLQJ
centers falling LQ
in WKH
the JULGFHOO7KH
grid cell. The JULGFHOOLV
grid cell is XVHGWR
used to
GHWHFWWKLVREMHFW3UHGLFWLRQVDUHPDGHIRUHDFKJULGFHOO)RU
detect this object. Predictions are made for each grid cell. For $QHZ
A new PRGXOH
module ZDV
was FDPH
came XS7KLV
up. This LVD
is a PRGXOHZKLFKFDQ
module which can
H[DPSOH%ERXQGLQJER[HVDQGFRQILGHQFHVFRUHVDERXWWKHVH
example, B bounding boxes and confidence scores about these DXWRPDWLFDOO\IXVH,WLVFDOOHGWKH$XWRIXVLRQ>@$QGLWKDV
automatically fuse. It is called the Auto-fusion [28] .And it has
ER[HV
boxes DUH
are SUHGLFWHG,Q
[Link] DGGLWLRQHDFK
addition,each JULG
grid FHOO
cell DOVR
also SUHGLFWV
predicts &C Da IXOO\
fully FRQQHFWHG
connected VHDUFK
search VSDFH
space. 2Q
On WKH
the RWKHU
other KDQG
hand, GLIIHUHQW
different
FRQGLWLRQDOFODVVSUREDELOLWLHV3U &ODVVL_2EMHFW
conditional class probabilities,Pr(ClassdObject). H[WHQGHG
extended FRQYROXWLRQ
convolution ZLOO
will EH
be DFFRPPRGDWHG
accommodated DW at HDFK
each IHDWXUH
feature
OHYHOVXFK
level,such DVas 7ULGHQW1HW
TridentNet >@
[29] . 7KLV
This PDNHV
makes LWit HDVLHU
easier WR
to DVVLJQ
assign
:KHQ
When WHVWHG
tested, HDFK
each ER[
box ZDV
was JLYHQ
given D
a VSHFLILF
specific FDWHJRU\
category IRU
for WKHPDSSURSULDWHUHFHSWLRQWKUHVKROGV6RWKHUHDUHDOONLQGVRI
them appropriate reception thresholds. So, there are all kinds of
UHOLDELOLW\VFRUHV7KHZD\WRGRWKLVLVVLQFHHDFKER[KDVD
reliability scores. The way to do this is, since each box has a H[WHQGHG
extended FRQYROXWLRQ
convolution RSHUDWLRQV
operations, DQG
and WKH\ UH DOO
they're all LQFOXGHG
included LQ
in
UHOLDEOHSUHGLFWLRQZHPXOWLSO\WKHFRQGLWLRQDOSUREDELOLW\E\LW
reliable prediction,we multiply the conditional probability by it. DXWRPDWLFIXVLRQ
automatic fusion.
7KH\
They LPSOHPHQW
implement WKLV
this PRGHO
model DV
as Da FRQYROXWLRQDO
convolutional QHXUDO
neural 7KH\DOVRKDYHQHZLQVLJKWVLQWRIDVWWUDLQLQJ7KH\XVHD
They also have new insights into fast training. They use a
QHWZRUNDQGHYDOXDWHLWRQWKH3$6&$/92&GHWHFWLRQGDWDVHW
network and evaluate it on the PASCAL VOC detection dataset ILUVWRUGHU
first-order DSSUR[LPDWLRQ
approximation, DV as LQ
in >@
[30] . 7KH
The WUDLQLQJ
training GDWD
data ZHUH
were
>@7KH\DOVRWUDLQDIDVWYHUVLRQRI<2/2GHVLJQHGWRSXVK
[24] .They also train a fast version of YOLO designed to push UDQGRPO\GLYLGHGE\ILUVWRUGHUDSSUR[LPDWLRQ7KH\DUHVSOLW
randomly divided by first-order approximation. They are split
WKHERXQGDULHVRIIDVWREMHFWGHWHFWLRQ7KHILQDORXWSXWRIWKH
the boundaries of fast object detection. The final output of the LQWR
into WZR
two VHWV
sets WKDW
that DUH
are WKH
the VDPH
same VL]H
size DQG
and GR
do QRW
not LQWHUVHFW,Q
[Link]
QHWZRUNLVWKH WHQVRURISUHGLFWLRQV
network is the 7*7*30 tensor of predictions. addition,three indices are also considered for C (a,ߚሻ)LQDOO\
DGGLWLRQWKUHHLQGLFHVDUHDOVRFRQVLGHUHGIRUܥሺߙǡ (J). Finally,
)LUVWWKH\SUHWUDLQWKHFRQYROXWLRQOD\HU7KHPHWKRGLVWR
First, they pre-train the convolution layer. The method is to WKHUHDUHPDQ\UHJXODUL]DWLRQPHWKRGV)RUWKHWDUJHWIXQFWLRQ
there are many regularization methods. For the target function,
XVH
use Da FRQWHVW
contest GDWD
data VHW>@
set[25], ZKLFK
whichKDV
has
1 000 OHYHOV
levels. 7KHQ
Then, WKH\
they ZHFDQDGGUHVRXUFHFRQVWUDLQWV7KHUHIRUHWKLVPHWKRGLVXVHG
we can add resource constraints. Therefore, this method is used
WHVWHG
tested LW
it. 7KH
The PHWKRG
method RI
of GHWHFWLRQ
detection LV
is WR
to XVH
use Da WUDQVIRUPDWLRQ
transformation DQGKDVDYHU\JRRGHIIHFW0DQ\FXUUHQWRSWLPL]DWLRQVDSSO\
and has a very good effect. Many current optimizations apply
PRGHO
model. /LQHDU
Linear DFWLYDWLRQ
activation IXQFWLRQV
functions DUH
are XVHGIRU
used for WKH
the ODVW
last OD\HU
layer, UHVRXUFHFRQVWUDLQWVDQGWKLVDSSURDFKLVFRQYHQLHQW
resource constraints, and this approach is convenient.
DQG
and WKH
the RWKHU
other OD\HUV
layers DUH
are DFWLYDWHG
activated XVLQJ
using OHDNFRUUHFWHG
leak-corrected OLQHV
lines. ,,, ;3(5,0(17$/$1$/<6,6
III. (EXPERIMENTAL ANALYSIS
'HWDLOVDUHDVIROORZV7KH\RSWLPL]HIRUVXPVTXDUHGHUURULQ
Details are as follows :They optimize for sum-squared error in
WKHRXWSXWRIWKHLUPRGHO7RUHPHG\WKLVWKH\LQFUHDVHWKHORVV
the output of their model. To remedy this, they increase the loss ,QDGGLWLRQZHDOVRVXPPDUL]HWKHH[SHULPHQWDOUHVXOWVRI
In addition, we also summarize the experimental results of
IURPERXQGLQJER[FRRUGLQDWHSUHGLFWLRQVDQGGHFUHDVHWKHORVV
from bounding box coordinate predictions and decrease the loss WKHDERYHPHWKRGVZKLFKLVVKRZQLQWDEOH:KDWLVPRUHˈ
the above methods, which is shown in table 1 . What is more ,
IURPFRQILGHQFHSUHGLFWLRQVIRUER[HVWKDWGRQ¶WFRQWDLQREMHFWV
from confidence predictions for boxes that don't contain objects. ,I VRUWHG
sorted RXW
out WKH
the DFFXUDF\
accuracy DQG
and UXQQLQJ
running VSHHG
speed RI
ofHDFK
each PHWKRG
method
EDVHGRQGLIIHUHQWGDWDVHWV6HHWKHIROORZLQJWDEOH
based on different data sets. See the following table.
7KH\SUHGLFWWKHVTXDUHURRWRIWKHERXQGLQJER[ZLGWKDQG
They predict the square root of the bounding box width and
KHLJKW
height LQVWHDG
instead RI
of WKH
the ZLGWK
width DQG
and KHLJKW
height GLUHFWO\7R
directly .To DYRLG
avoid 7TABLE
$%/(1&COMPARJSON OF DIFFERENT OBJECT DETECTION METHODS
203$5,6212)',))(5(172%-(&7'(7(&7,210(7+2'6
RYHUILWWLQJWKH\XVHGURSRXWDQGH[WHQVLYHGDWDDXJPHQWDWLRQ
overfitting they use dropout and extensive data augmentation.
0HWKRG
Method 92&
VOC2 92&
VOC2 92&
VOC2 ,/695
ILSVR 06&2
MSCO 6SHHG
Speed
&RUQHU1HW>@DQHZRQHVWDJHDSSURDFKWRREMHFWGHWHFWLRQ
ComerNet [26], a new one stage approach to object detection
007
010
012 &
C20 1 3 &2
C0201
WKDW
that GRHV
does DZD\
away ZLWK
with DQFKRU
anchor ER[HV
boxes. 7KH
The &RUQHU1HW
ComerNet PRGHO
model 5
DUFKLWHFWXUHFRQVLVWVRIWKUHHSDUWVWKHWRUXVQHWZRUNWKHORZHU
architecture consists of three parts: the toms network, the lower R- CNN
5&11
58.5%
53.7%
53.3%
3 1 .4%
ULJKWDQGXSSHUOHIWKHDWPDSVDQGWKHSUHGLFWLRQPRGXOH)LUVW
right and upper-left heat maps, and the prediction module. First, 633QHW
SPPnet
54.2%
3 1 .84%
WKHDQJOHVDUHGHWHFWHGDQGWKHQJURXSHG,QWKHWUDLQLQJVWDJH
the angles are detected and then grouped. In the training stage,
206
206
Authorized licensed use limited to: Infineon Technologies AG. Downloaded on March 28,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
)DVW
Fast-
70.0%
68.8%
68.4%
19.7° o
9 GLVWLQJXLVKHG
distinguished SDSHUV
papers, LQFOXGLQJ
including 5&11
RCNN, )DVW
Fast 5&11
RCNN, )DVWHU
Faster
5&11
RCNN 5&110DVN5&116331HW&RUQHU1HW0'HW$XWR)XVLRQ
RCNN, Mask RCNN, SPPNet, CornerNet, M2Det, AutoFusion,
)DVWHU
Faster-
78.8%
75.9%
21.9% ISV
5fps <2/2:HFDWHJRUL]HDOOWKHSDSHUVLQWRWZRGLIIHUHQWW\SHV
YOLO. We categorize all the papers into two different types,
5&11
RCNN ZKLFKFRQVLVWVRIWZRVWDJHGHWHFWRUVDQGRQHVWDJHGHWHFWRUV
which consists of two-stage detectors and one-stage detectors.
MR-
05
78.2%
73.9°··0
$OO
&11
CNN
All WKH
the PHWKRGV
methods VKRZ
show VWURQJ
strong SHUIRUPDQFHV
performances DQG
and VDWLVI\LQJ
satisfying
<2/2
YOLO
63.4°··0
57.9% ISV
45fps
UHVXOWVRQVHYHUDOGDWDVHWV
results on several datasets.
&RUQHU1
ComerN
56.6% :HVWDWHWKDWWKLVSDSHUFDQSURYLGHDQLQGHSWKDQDO\VLVRQ
We state that this paper can provide an in-depth analysis on
HW WKLV
et this DUHD
area DQG
and JLYH
give D
a EULHI
brief JXLGDQFH
guidance IRU
for WKH
the EHJLQQHUV
beginners WR
to VWDUW
start
0'HW
M2Det
64.6%
OHDUQLQJREMHFWGHWHFWLRQDOJRULWKPV
learning object detection algorithms
$VFDQEHVHHQIURPWKHDERYHWDEOHWKHDFFXUDF\UDWHRI5
As can be seen from the above table, the accuracy rate of R 5()(5(1&(6
REFERENCES
&11PHWKRGLVLQ92&LQ92&
CNN method is 58.5% in VOC2007, 5 3 . 7% in VOC20 10, 53.3% >@
[1] *LUVKLFN
Girshick:_ 5
R.. 'RQDKXH
Donahue. -
J.. 'DUUHOO
DarrelL 7
T.. & 0DOLN
Malik - 5LFK
J. (2014). Rich IHDWXUH
feature
LQ
in 92&
VOC20 12 DQG
and
3 1 .4% LQ
in ,/695&
ILSVRC20 1 3 . 7KH
The DFFXUDF\
accuracy RIof KLHUDUFKLHV
hierarchies IRUfor DFFXUDWH
accurate REMHFW
object GHWHFWLRQ
detection DQG
and VHPDQWLF
semantic VHJPHQWDWLRQ
segmentation. ,Q In
&935
CVPR (pp. SS±
580-587).
633BQHW
SPP net PHWKRG
method LVis
54.2% LQ
in 92&
VOC2007 DQG
and
3 1 . 84% LQ
in
,/695& >@8LMOLQJV-YDQGH6DQGH.*HYHUV7
[2] Uijlings. J.. van de Sande. h... Gevers. T.. &6PHXOGHUV$
Smeulders. A. (20 1 3 ).6HOHFWLYH
Selective
IL SVRC20 1 3 . 7KH
The DFFXUDF\
accuracy RIof )DVW5&11
Fast-RCNN RQon 92&
VOC2007, VHDUFKIRUREMHFWUHFRJQLWLRQ,-&9 ±
search for object recognition. IJCV. 1 04(2). 1 54-1 71 .
92&
VOC20 10 DQGand 92&
VOC20 12 LV is
70.0%,
68.8% DQG
and
68.4%
>@
[3] *LUVKLFN
Girshick 5 R .. 'RQDKXH
Donahue. -J.. 'DUUHOO
DarrelL 7 0DOLN -
T..&Malik:_
J. (20 16). 5HJLRQEDVHG
Region-based
UHVSHFWLYHO\,QDGGLWLRQWKHYDOXHRQ06&2&2LV7KH
respectively. In addition, the value on MSCOCO is 1 9 . 7%. The FRQYROXWLRQDO
convolutional QHWZRUNV
networks IRUfor DFFXUDWH
accurate REMHFW
object GHWHFWLRQ
detection DQG
and VHJPHQWDWLRQ
segmentation.
YDOXHRI
value of )DVWHU5&11
Faster-RCNN RQ on 92&DQG
VOC2007 and 92&LV
VOC20 12 is
78.8% ,(((73$0,
IEEE TPAMI, 3 8(1 ).±
1 42-1 58.
DQGUHVSHFWLYHO\7KHDFFXUDF\RQ06&2&2LV
and 75 . 9% respectively. The accuracy on MSCOCO is 2 1 . 9%. >@.UL]KHYVN\$6XWVNHYHU,
[4] [Link]. A.. Sutskever. I.. &+LQWRQ* D ,PDJH1HWFODVVLILFDWLRQ
Hinton. G. (2012a). ImageNet classification
$OVRLWUXQVDWWKHVSHHGZKLFKLVISV05&11LVRQ
Also, it runs at the speed which is 5fps. MR-CNN is 78.2% on ZLWKGHHSFRQYROXWLRQDOQHXUDOQHWZRUNV,Q1,36
with deep convolutional neural networks. In NIPS (pp. SS±
1 097-1 105).
92&DQGRQ92&<2/2LVDFFXUDWH
VOC2007 and 73.9% on VOC2012. YOLO is 63.4% accurate >@5HGPRQ-'LYYDOD6*LUVKLFN5
[5] Redmon. J.. Divvala. S .. Girshick:_ R .. &)DUKDGL$ <RXRQO\ORRN
Farhadi. A. (2016). You only look
RQ92&DQGRQ92&$QGLWUXQVDWISV7KH
on VOC2007 and 57.9% on VOC2012. And it runs at [Link] RQFH8QLILHGUHDOWLPHREMHFWGHWHFWLRQ,Q&935
once: Unified. real time object detection. In CVPR (pp. SS±
779-788).
&RUQHU1HW
CornerNet DQG
and 0'HW
M2Det KDYH
have
56.6% DQG
and
64.6% DFFXUDF\
accuracy RQ on >@
[6] /DZ
Law. +H.. & 'HQJ
Deng. - &RUQHU1HW
J. (2018). ComerNet: 'HWHFWLQJ
Detecting REMHFWV
objects DV as SDLUHG
paired
06&2&2UHVSHFWLYHO\7RVXPXSZHFDQILQGWKDWFRPSDUHG
MSCOCO, respectively. To sum up, we can find that compared NH\SRLQWV,Q(&&9
keypoints. In ECCV.
ZLWK
with WZRVWDJH
two-stage PHWKRGV
methods, RQHVWDJH
one-stage PHWKRGV
methods KDYH
have KLJKHU
higher >@*LUVKLFN5,DQGROD)'DUUHOO7
[7] Girshick. R.. Iandola. F .. DarrelL T.. &0DOLN- Malik J. (2015 ).'HIRUPDEOHSDUW
Deformable part
DFFXUDF\DQGIDVWHUVSHHG PRGHOVDUHFRQYROXWLRQDOQHXUDOQHWZRUNV,Q&935
models are convolutional neural networks. In CVPR (pp. SS±
437-446).
accuracy and faster speed.
>@+H.*NLR[DUL*'ROOiU3
[8] He. h. .. Gkioxari. G.. Dollar. P .. &*LUVKLFN5 0DVN5&11,Q
Girshick. R. (2017). Mask RCNN. In
,&&9
ICCV.
·�
·
�. >@/LX:$QJXHORY'(UKDQ'6]HJHG\&5HHG6)X&
·�·
[9] Liu. W.. Anguelov. D .. Erhan. D .. Szegedy. C .. Reed. S .. Fu. C., &%HUJ$ Berg. A.
(20 16).66'6LQJOHVKRWPXOWLER[GHWHFWRU,Q(&&9
SSD: Single shot multibox detector. In ECCV(pp. SS±
21-37).
�
� ��: >@)X&</LX:5DQJD$7\DJL$
[10] Fu. C.-Y .. Liu. W .. Ranga. A.. Tyagi. A.. &%HUJ$& Berg. A. C. (2017). '66'DSSD:
'HFRQYROXWLRQDOVLQJOHVKRWGHWHFWRUDU;LY
Deconvolutional single shot detector. arXiv : l 70 1 .06659.
>@/LQ7*R\DO3*LUVKLFN5+H.
[1 1] Lin. T.. GoyaL P.. Girshick R.. He. h... &'ROOiU3 E )RFDOORVV
Dollar. P. (2017b). Focal loss
IRUGHQVHREMHFWGHWHFWLRQ,Q,&&9
for dense object detection. In ICCV.
SSD FPN >@1HZHOO$+XDQJ=
[12] NewelL A.. Huang. Z .. &'HQJ- $VVRFLDWLYHHPEHGGLQJ(QGWR
Deng. J. (2017). Associative embedding: End to
-�� �1 HQGOHDUQLQJIRUMRLQWGHWHFWLRQDQGJURXSLQJ,Q1,36
end learning for joint detection and grouping. In NIPS (pp. SS±
2277-2287).
l�i�J [13] NewelL A.. Yang. K. &'HQJ-
>@1HZHOO$<DQJ.
KXPDQSRVHHVWLPDWLRQ,Q(&&9
6WDFNHGKRXUJODVVQHWZRUNVIRU
Deng. J. (2016). Stacked hourglass networks for
SS±
human pose estimation. In ECCV (pp. 483-499).
>@5HQ6+H.*LUVKLFN5
[14] Ren. S .. He. h. .. Girshick, R.. &6XQ- )DVWHU5&117RZDUGV
Sun. J. (2015). Faster R-CNN: Towards
UHDOWLPHREMHFWGHWHFWLRQZLWKUHJLRQSURSRVDOQHWZRUNV,Q1,36
real time object detection with region proposal networks. In NIPS (pp. SS±
91-
99).
>@5RVV*LUVKLFN-HII'RQDKXH7UHYRU'DUUHOO-LWHQGUD0DOLN5LFKIHDWXUH
[15] Ross Girshick Jeff Donahue Trevor Darrell Jitendra Malik. Rich feature
Auto-fusion PANet KLHUDUFKLHV
hierarchies IRUfor DFFXUDWH
accurate REMHFW
object GHWHFWLRQ
detection DQG
and VHPDQWLF
semantic VHJPHQWDWLRQ
segmentation. ,Q In
&935
CVPR 2014.
)LJXUHˊ2YHUDOOIUDPHZRUNRI$XWR)XVLRQ
Figure 3 . Overall framework of AutoFusion.
>@.DLPLQJ+H;LDQJ\X=KDQJ6KDRTLQJ5HQDQG-LDQ6XQ6SDWLDO3\UDPLG
[16] [Link] He. Xiangyu Zhang. Shaoqing Ren. and Jian Sun. Spatial Pyramid
3RROLQJ
Pooling LQ in 'HHS
Deep &RQYROXWLRQDO
Convolutional 1HWZRUNV
Networks IRUfor 9LVXDO
Visual 5HFRJQLWLRQ
Recognition. ,Q In
,9
IV. ) 8785(3
FUTURE 5263(&7
PROSPECT 73$0,
TPAML 20 1 5 .
)URP
From WKH
the DERYHPHQWLRQHG
abovementioned SDSHU
paper, ZH
we VXPPDUL]H
summarize VHYHUDO
several >@5RVV*LUVKLFN)DVW5&11,Q,&&9
[17] Ross Girshick. Fast R-CNN. I n ICCV 20 1 5 .
DVSHFWV
aspects IRU
for WKH
the IXWXUH
future SURVSHFW
prospect. )LUVW
First, REMHFW
object GHWHFWLRQ
detection >@
[ 1 8] 6KDRTLQJ
Shaoqing 5HQ Ren .DLPLQJ
[Link] +HH e 5RVV
Ross *LUVKLFN
Girshick -LDQ
Jian 6XQ
Sun. )DVWHU
Faster 5&11
R-CNN:
DOJRULWKPV
algorithms DUH
are WRZDUGV
towards WLPH
time HIILFLHQW
efficient. &XUUHQW
Current RQHVWDJH
one-stage 7RZDUGV5HDO7LPH2EMHFW'HWHFWLRQZLWK5HJLRQ3URSRVDO1HWZRUNV,Q
Towards Real-Time Object Detection with Region Proposal Networks. In
1,36
NIPS 20 1 5 .
GHWHFWRUVZLWKIDVWVSHHGDQGKLJKSHUIRUPDQFHGRPDLQDWHWKLV
detectors with fast speed and high performance domainate this
DUHD6HFRQGREMHFWGHWHFWLRQKDVEHHQDSSOLHGWRPRUHUHDOLVWLF >@-/RQJ(6KHOKDPHUDQG7'DUUHOO)XOO\FRQYROXWLRQDOQHWZRUNVIRU
[ 1 9 ] J . Long. E. Shelhamer. and T. Darrell. Fully convolutional networks for
area. Second, object detection has been applied to more realistic VHPDQWLFVHJPHQWDWLRQ,Q&935
semantic segmentation. In CVPR. 20 1 5 .
DSSOLFDWLRQVVXFKDV3&%GHIHFWGHWHFWLRQVSRUWVGHWHFWLRQDQG
applications, such as PCB defect detection, sports detection, and >@
[20] 9V. 1DLU
Nair DQG
and *
G . (
E. +LQWRQ
Hinton. 5HFWLILHG
Rectified OLQHDU
linear XQLWV
units LPSURYH
improve UHVWULFWHG
restricted
VPDOO
small REMHFW
object GHWHFWLRQ
detection VXEWDVNV
subtasks. )LQDOO\
Finally, REMHFW
object GHWHFWLRQ
detection EROW]PDQQPDFKLQHV,Q,&0/
boltzmann machines. In ICML. 20 10.
PHWKRGVDUHWUDQVIRUPHGWRPRELOHGHYLFHVVXFKDVGHYHORSHU
methods are transformed to mobile devices, such as developer >@
[21] <Y. /H&XQ
LeCun. % B. %RVHU
Boser. -
J. 6
S. 'HQNHU
Denker. 'D. +HQGHUVRQ
Henderson. 5 R. ( E. +RZDUG:
Howard.W.
ERDUGDQGUDVSEHUU\SLHWRDFKLHYHZLGHUDSSOLFDWLRQV
board and raspberry pie to achieve wider applications. +XEEDUGDQG/'-DFNHO%DFNSURSDJDWLRQDSSOLHGWRKDQGZULWWHQ]LS
Hubbard. and L. D. Jackel. Backpropagation applied to handwritten zip
FRGHUHFRJQLWLRQ1HXUDOFRPSXWDWLRQ
code recognition. Neural computation. 1989.
9
V. &21&/86,21
CONCLUSION >@.DLPLQJ+H*HRUJLD*NLR[DUL3LRWU'ROODғ
[22] [Link] He Georgia Gkioxari Piotr Dolla·Ur5RVV*LUVKLFN0DVN5&11
Ross Girshick. Mask R-CNN.
2EMHFWGHWHFWLRQLVDSRSXODUWDVNLQFRPSXWHUYLVLRQZKLFK
Object detection is a popular task in computer vision, which ,&&9
ICCV 20 16.
KDV
has GHYHORSHG
developed IRU
for D
a ORQJ
long WLPH
time. ,Q
In WKH
the ODVW
last GHFDGH
decade, QXPHURXV
numerous >@-RVHSK5HGPRQ6DQWRVK'LYYDOD5RVV*LUVKLFN$OL)DUKDGL<RX2QO\
[23] Joseph Redmon. Santosh Divvala. Ross Girshick Ali Farhadi. You Only.
GHWHFWLRQPHWKRGVKDYHEHHQSURSRVHG,QWKLVSDSHUZHJLYHD /RRN2QFH8QLILHG5HDO7LPH2EMHFW'HWHFWLRQ,Q&935
Look Once: Unified. Real-Time Object Detection. In CVPR 20 16.
detection methods have been proposed. In this paper, we give a
EULHILQWURGXFWLRQIRUREMHFWGHWHFWLRQPHWKRGV:HVXPPDUL]H
brief introduction for object detection methods. We summarize
207
207
Authorized licensed use limited to: Infineon Technologies AG. Downloaded on March 28,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
[Link], S. M. A. Eslami, L. Van Goo!, C. K. L Williams, l Winn,
>@0(YHULQJKDP60$(VODPL/9DQ*RRO&.,:LOOLDPV-:LQQ [27] Qijie Zhao, Tao Wang, Zhi Tang. Ying Chen, Ling Cai and
>@4LMLH=KDR7DR6KHQJ<RQJWDR:DQJ=KL7DQJ<LQJ&KHQ/LQJ&DLDQG
The SDVFDO
DQG $ =LVVHUPDQ 7KH pascal YLVXDO
visual REMHFW
obj ect FODVVHV
classes FKDOOHQJH
challenge: $
A Haibin Ling. 1n
+DLELQ/LQJ,Q$$$,
International Journal of Computer Vision, 1 1 1 (I
UHWURVSHFWLYH,QWHUQDWLRQDO-RXUQDORI&RPSXWHU9LVLRQ ):98-136,
± [28] Hang Xu Lewei Yaoh Wei Zhang! Xiaodan Liang2t
>@+DQJ;X/HZHL<DRכ:HL=KDQJ;LDRGDQ/LDQJ=KHQJXR/L$XWR Li Auto-
-DQ FPN: Automatic Network Architecture Adaptation for
)31$XWRPDWLF1HWZRUN$UFKLWHFWXUH$GDSWDWLRQIRU2EMHFW'HWHFWLRQ
[25] 0. Russakovsky, l Deng, H Su, l Krause, S. Satheesh, S. Ma, Z. Huang,
>@25XVVDNRYVN\-'HQJ+6X-.UDXVH66DWKHHVK60D=+XDQJ Beyond Classification
%H\RQG&ODVVLILFDWLRQ
A .DUSDWK\
$ Karpathy, $A .KRVOD
Khosla, 0
M. %HUQVWHLQ
Bernstein, $
A. &
C. %HUJ
Berg, DQG
and /
L. )HL)HL
Fei-Fei. [29] Yanghao Yuntao Chen, Naiyan Wang, and Zhaoxiang Zhang. Scale-
>@<DQJKDR/L<XQWDR&KHQ1DL\DQ:DQJDQG=KDR[LDQJ=KDQJ6FDOH
ImageNet /DUJH
,PDJH1HW Large 6FDOH
Scale 9LVXDO
Visual 5HFRJQLWLRQ
Recognition &KDOOHQJH
Challenge. ,QWHUQDWLRQDO
International aware WULGHQW QHWZRUNV
DZDUH networks IRU
for REMHFW
object GHWHFWLRQ
detection. DU;LY
arXiv SUHSULQW
preprint
Journal of Computer Vision (IJCV).
-RXUQDORI&RPSXWHU9LVLRQ 201 5 .
,-&9 arXiv: l901 .01 892, 2019.
DU;LY
·
[26] Hei Law Jia Deng. Corner:'Jet: Detecting Objects a s Paired Keypoints. In
>@+HL/DZā-LD'HQJ&RUQHU1HW'HWHFWLQJ2EMHFWVDV3DLUHG.H\SRLQWV,Q [30] Hanxiao /LX.DUHQ6LPRQ\DQDQG<LPLQJ<DQJ'DUWV
>@+DQ[LDR Liu, Karen Simonyan, and Yiming Yang. Darts : 'LIIHUHQWLDEOH
Differentiable
ECCV 2018.
(&&9 architecture search. arXiv preprint arXiv:1 806.09055, 201 8
DUFKLWHFWXUHVHDUFKDU;LYSUHSULQWDU;LY
208
208
Authorized licensed use limited to: Infineon Technologies AG. Downloaded on March 28,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.