Vehicle Detection Using Deep Learning
Vehicle Detection Using Deep Learning
Abstract— Vehicle detection from unmanned aerial vehi- Artificial intelligence techniques, especially deep-learning-
cle (UAV) imagery is one of the most important tasks in a based computer vision, make UAVs smarter and can achieve
large number of computer vision-based applications. This crucial tasks that are impossible before their arrival. Computer vision
task needed to be done with high accuracy and speed. However,
it is a very challenging task due to many characteristics related is one of the most important topics in the field of autonomous
to the aerial images and the used hardware, such as different robots, including UAVs. It is the science that gives UAVs
vehicle sizes, orientations, types, density, limited datasets, and the ability to analyze, process, and understand the content of
inference speed. In recent years, many classical and deep- digital images/videos to make the right decision at the right
learning-based methods have been proposed in the literature time. Detecting vehicles in UAV images and videos is one
to address these problems. Handed engineering- and shallow
learning-based techniques suffer from poor accuracy and gen- of the tasks which attracted great attention from researchers.
eralization to other complex cases. Deep-learning-based vehicle It plays an important role in a wide range of fields, such as
detection algorithms achieved better results due to their powerful transportation, military, search and rescue, and surveillance.
learning ability. In this article, we provide a review on vehicle Traditional computer vision techniques are basically based
detection from UAV imagery using deep learning techniques. on handed-engineering-based features extractor or shallow-
We start by presenting the different types of deep learning
architectures, such as convolutional neural networks, recurrent learning algorithms followed by classifiers. Recently, these
neural networks, autoencoders, generative adversarial networks, techniques were surpassed by deep learning architectures.
and their contribution to improve the vehicle detection task. Many factors lead to such fast evolution, including deep
Then, we focus on investigating the different vehicle detection neural networks, a large amount of data, and GPUs for
methods, datasets, and the encountered challenges all along with processing power. Among the most known deep neural net-
the suggested solutions. Finally, we summarize and compare the
techniques used to improve vehicle detection from UAV-based work architectures, we find the convolutional neural network
images, which could be a useful aid to researchers and developers (CNN), the generative adversarial network (GAN), autoen-
to select the most adequate method for their needs. coders, and the recurrent neural network (RNN), which are
Index Terms— Autoencoders, computer vision, convolutional used in many computer visions tasks, including object detec-
neural networks (CNNs), deep learning, generative adversarial tion. These architectures have been contributed to improve
networks (GANs), recurrent neural networks (RNNs), unmanned computer vision abilities and performance. Today, most of
aerial vehicle (UAV), vehicle detection. the state-of-the-art object detection algorithms are based on
deep learning techniques as their backbone. Vehicle detec-
I. I NTRODUCTION tion from UAV imagery represents a big challenge due to
many properties concerning aerial images, such as small size,
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
relatively old and did not present the detection task in detail.
Another review is presented in [52]. The authors targeted
only one dataset which is VisDrone. It is a very large-scale
dataset that consists of different objects including vehicles.
However, the authors presented the performance of different
algorithms only on the VisDrone dataset and reported the
results from the summary papers of the ECCV and ICCV
challenge competitions. They presented six datasets for object
detection from UAV-based images/videos, where only three of
them targeted vehicle detection. Only CNN-based algorithms
have been discussed in their paper. Ayalew [79] has focused Fig. 1. Typical CNN architecture.
on some R-CNN and YOLO variants with so few details.
Moreover, Ayalew [79] did not provide explanations about
lightweight detectors nor benchmark datasets.
To the best of our knowledge, this is the first review article
that focuses on the vehicle detection task with most of the
necessary materials with the main objective to provide an
overview on the different deep learning techniques to improve
vehicle detection from aerial images, especially from UAVs
systems, while determining where each technique can be used.
Many techniques are covered including multiscale bounding
boxes, hard negative mining, and fusing feature maps from
different backbone layers. Also, we take into account the
lightweight detectors that are very convenient for small devices
with limited computation power. Moreover, the most popular Fig. 2. Feature visualization in different layers [18].
benchmark datasets for vehicle detection tasks are presented
all along with different evaluation metrics. This study could
not only help researchers and developers to understand the II. OVERVIEW OF D EEP L EARNING T ECHNIQUES
concepts of vehicle detection from UAV imagery but also assist In this section, we present an overview of the most relevant
them to choose the appropriate architecture and dataset for deep learning architectures that are developed to solve different
their applications. The main contributions of this article are problems in a large number of domains, such as computer
given as follows. vision, speech recognition, system recommender, and nat-
1) We introduce the most known and powerful deep learn- ural language processing (NLP). Among these architectures,
ing architectures that contribute to improve vehicle we find CNNs, autoencoders, GANs, RNNs, and long short-
detection performance in UAV-based images and videos, term memory (LSTM). All of these techniques are employed
where different techniques are introduced to enhance not only for object detection but also to enhance the detection
small size, dense, and oriented vehicle detection. performance.
2) We describe the most used aerial image/video bench-
mark datasets and their properties while presenting for
what application each dataset is better for. A. Convolutional Neural Networks
3) We present the best available vehicle detection frame- CNN is one of the most popular and successful feedfor-
works in the last years all along with their accuracy and ward neural network algorithms for deep learning, which
speed evaluation. Also, lightweight detectors for small achieves state-of-the-art results in many problem areas, such as
devices are presented. NLP, speech recognition, and recommender systems. However,
4) We discuss various UAV-based vehicle detection chal- it is mostly used to handle computer vision and image
lenges all along with the impact of different parameters analysis problems, including image classification [7]–[10],
on the detection accuracy and speed. The parameters object detection [11]–[14], and image segmentation [15]–[17].
include the image resolution, altitude, and view angle. Fig. 1 shows the typical CNN architecture.
The remainder of this article is organized as follows. As shown in Fig. 2, CNNs learn low-level vehicle features
In Section II, different deep learning architectures are pre- in the initial layers (edges and corners), followed by more
sented. Section III is devoted to the vehicle detection algo- complex intermediate and high-level feature representations
rithms and the encountered challenges. Different techniques in deeper layers (wheels, windscreen, car doors, and cars),
used to improve vehicle detection in UAV-based images/videos which are used for a classification task. In the last years,
are presented in Section IV. Section V is dedicated to present CNN architectures achieve incredible performance on some
some of the important benchmarks and evaluation metrics. computer vision tasks, including vehicle detection and classi-
Some discussions are presented in Section VI. We conclude fication [82], [83]. This is happening thanks to the combina-
this article in Section VII. tion of recent powerful computing systems (GPUs), software
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 3
TABLE I
P ERFORMANCE C OMPARISON OF D IFFERENT CNN A RCHITECTURES (C ITATION T ILL A PRIL 20, 2020)
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
C. Autoencoders
Data compression is a big topic in the computer vision Fig. 6. GAN architecture.
field. The main role of data compression is to convert our
input data into a smaller representation, which is used to
reconstruct the original input data. Autoencoder [46] is another the input images belong to the real dataset or not, where
type of artificial neural network that is based on unsuper- it affects a probability near to “1” for real images and a
vised algorithms to do this compression and reconstruct the probability near to “0” for fake images. Thus, the generator
input data after applying a series of operations. Basically, must produce samples that are similar as possible to real ones
it is designed to reproduce its input at the output. Autoen- to fool the discriminator.
coders are mainly composed of three components: an encoder, GAN architecture can enhance the poor representations of
a bottleneck, and a decoder (see Fig. 5). The output of tiny objects to super-resolved ones in the generator part to
an autoencoder represents the reconstructed input with the fool the discriminator part, which could improve small vehicle
minimum possible errors. The encoder part can also serve detection performance from high-altitude UAV imagery [45].
as an interesting feature extractor, as shown in [47], where However, GAN architecture is usually used to generate new
stacked denoising autoencoders (SDAEs) are used. There are data samples making it one of the best solutions for data
various autoencoders types, such as convolutional autoencoder augmentation.
(CAE) [48], sparse autoencoder [49], deep autoencoder [50],
and variational autoencoder [51]. The upsampling operation in III. UAV-BASED V EHICLE D ETECTION
the decoder part of an autoencoder can provide better accuracy A LGORITHMS AND C HALLENGES
in detecting small objects, as shown in [127], in which their
proposed detector is based on an autoencoder-like architecture. The continuously increasing number of vehicles has forced
Autoencoders are only able to compress data similar to what transport management agencies and researchers to develop
they have been trained on. For example, an autoencoder that new accurate methods for traffic monitoring, parking securing,
has been trained on human faces will not perform well on accident detection in difficult areas, and parking lot occupa-
images of vehicles. Moreover, they are worse than classical tion. One available solution is to use some on ground fixed
compression techniques, such as JPEG and MPEG. However, sensors, such as radars and fixed cameras [64]–[66], which
they work really well on other tasks, such as image denois- provide only a partial overview and miss a lot of information.
ing [53]–[55], dimensionality reduction [56]–[58], reconstruct- However, aerial image sensors seem to be a better solution
ing missing parts of an image [59], [60], and also colorizing because they provide a larger overall overview of areas of
images [61]. interest.
UAVs have emerged as a new image acquisition platform,
which presents several advantages over satellites and airborne
D. Generative Adversarial Networks for solving vehicle detection problems. This is due to their
Goodfellow et al. [62] proposed a new unsupervised deep low cost, high flying flexibility, small size, and easier and
neural network, called GANs, in 2014. It is described as faster data acquisition. Moreover, the most important of all,
the most interesting idea in the last decade by Emmert- UAVs provide extremely high-resolution images encompassing
Streib et al. [63]. GANs reached great success to generate new abundant spatial and contextual information, which makes
data instances that are similar to the ones in the given dataset. them a center of many new UAV image analysis applications.
GANs consist of two networks, generator and discriminator As UAV applications become widespread, implementing sen-
networks, competing against each other (see Fig. 6). The sors, powerful processors, artificial intelligence, and computer
generator network tries to generate new realistic samples from vision techniques is becoming indispensable. The combination
random noise (Gaussian distribution). The discriminator net- of deep learning and computer vision techniques allows UAVs
work is a regular neural network-based classifier, which takes to analyze, process, and understand the content of these images
both the real data and the generated one as inputs then decide and videos, which enables them to navigate autonomously,
whether the generated samples are considered real or fake detect and track objects, and even provide analytical feedback
samples. The output of a discriminator is the probability that in real-time. These technologies can provide a higher level of
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 5
autonomy and make the UAVs smarter and more efficient in for the detection, which achieves remarkable detection perfor-
many tasks, including navigation in GPS-denied areas. mance. In order to predict regions of different aspect ratios and
As a hotspot of deep learning and computer vision stud- scales within an image, Faster R-CNN introduces the concept
ies, vehicle detection from UAV-based images has attracted of anchors boxes to detect overlapping objects. These anchors
increasing attention in many studies in the last years. It plays can be denoted by (cx, cy, w, and h), where: “cx” and “cy”
a vital role in a large number of applications, such as road are the coordinates of the center point of the default box, and
traffic information [67]–[70], tracking specific types of vehi- “w” and “h” are the width and height of the box.
cles [71], [72], transportation monitoring [73], [74], intelligent Even with the impressive accuracy achieved by the region
parking [75], speed control [76], [77], and social distancing proposal framework, their architecture is complex and still
detection in the context of COVID-19 pandemic. noticeably very slow due to the candidate region generation
module. Thus, it makes them not suitable for real-time object
detection, particularly on mobile devices. For UAVs, it is
A. Deep-Learning-Based Detectors indispensable to detect all objects in its visual field, such as
The development of different deep learning techniques, vehicles, with high speed. Thus, the UAV can accomplish the
the availability of very large datasets, and the continuous required task in real time. One-stage detectors were proposed
improvement of processing power have been led to impressive to improve vehicle detection speed. They are different from
progress in the object detection field, especially for vehicle two-stage detectors. Instead of dividing the detection frame-
detection from UAV-based aerial images. work into two parts, single-shot detectors fuse the recognition
Recently, there has been significant progress in vehicle and detection steps into one deep neural network, which
detection due to deep learning techniques, particularly the avoids spending too much time on candidate region proposal
emergence of CNN, which is an effective method that can generation. Many one-stage detectors have been proposed
improve detection performance. Most of the deep-learning- over the last years to handle real-time object detection that
based vehicle detection research studies are based on CNN could be implemented even on small devices. One-stage
architectures as a backbone. Unlike the traditional feature object detection algorithms mainly include You Only Look
extractor techniques, CNN has a strong ability to extract Once (YOLO), Single-Shot Multibox Detector (SSD), and
relevant features automatically. Moreover, many CNN-based RetinaNet. In 2016, Redmon et al. [98] developed the first
algorithms for vehicle detection in UAV imagery have been version of YOLO for real-time object detection. It consists of
proposed as an end-to-end framework. These algorithms dividing the input image into a fixed number of grid cells.
can be categorized into two main groups: two-stage object Each cell is considered as a proposal that may contain an
detectors and one-stage object detectors. Two-stage object object. However, it has two main drawbacks over the region-
detectors, also called region-based object detectors, are basi- based methods: high localization error and low recall rate.
cally R-CNN [89] and its variants, including SPPnet [90], Liu et al. [99] proposed another one-stage detector called SSD
Fast R-CNN [91], Faster R-CNN [92], R-FCN [93], and Mask to overcome some limitations of YOLOv1 achieving better
R-CNN [94]. They consist of two main stages: the first stage accuracy for real-time processing. It has the same principle
is responsible to generate candidate regions, while the second as YOLO dividing the input image into grid cells. Moreover,
stage predicts the class that each region belongs to. Over it generates multiple anchor boxes of different scales and
the years, many region proposal techniques were proposed aspect ratios. Afterward, Redmon and Farhadi [100], [101]
[95], [96], and it seems that the selective search algorithm [97] proposed YOLOv2 and v3 in 2017 and 2018, respectively.
provides better results than the other nondeep learning tech- These methods improve the detection accuracy further while
niques. For that reason, the selective search algorithm was maintaining detection speed. They adopted a more powerful
adopted in R-CNN and Fast R-CNN to extract particular CNN framework, which is Darknet. In addition, they also
region proposals, which may contain an object, with different adopted a better anchor strategy, using k-means clustering
scales and positions. However, the selective search algorithm from the training data instead of set them manually, to enhance
is not a learnable algorithm, which can lead to bad region detection accuracy. Lin et al. [102] developed RetinaNet that
selection that makes it very slow and less accurate. Although adopts the FPN concept and focal loss function to improve the
R-CNN and Fast R-CNN achieved impressive results over detection accuracy of small-sized and dense objects. In 2020,
classical methods, they are still very slow due to the limits Bochkovskiy et al. [103] proposed YOLOv4 enriched with
of the selective search algorithm. These two architectures are many new features to provide the best tradeoff between speed
not often used anymore because of their lack of speed and and accuracy. These frameworks provide a remarkable tradeoff
accuracy. between detection accuracy and speed. However, all of them
For the above reasons, most of the studies on vehicle are developed for natural benchmark datasets that contain only
detection in aerial images, using a region-based detector, have one or few objects.
relied on another improved version of R-CNN called Faster More deep-learning-based object detection algorithms are
R-CNN. Instead of using the selective search algorithm for presented in [52]. The authors mentioned different algo-
region proposal generation, the Faster R-CNN network applied rithms for object detection from VisDrone2018 and Vis-
a separate deep CNN network to generate a predefined number Drone2019 datasets, where they received 34 and 47 object
of high-quality region proposals using the region proposal detection algorithms in VisDrone2018 and VisDrone2019
network (RPN). These proposals are then used by Fast R-CNN challenges, respectively. Most of the proposed algorithms are
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
based on the aforementioned methods, such as YOLOv3, FPN, 5) Background Complexity and Appearance Similarity: The
Faster R-CNN, and RetinaNet. Also, other approaches are large size of aerial images and the wide view angle produce
proposed, such as HRDet+ [84], TridentNet [85], DBCL [86], a very complex and clutter background with several objects
DetNet [87], and CornerNet [88]. However, HAL-Retina-Net in the same image, thus leading to confuse the vehicle’s
and DPNet [52] are the best methods on the VisDrone-DET appearance with other similar object types, such as trash
2018 dataset achieving more than 30% AP. bins and air conditioner units. This increases the difficulty of
Many other deep learning architectures, such as GAN, vehicle detection and classification.
autoencoder, and LSTM, are combined with CNN-based detec- 6) Limited Datasets: The limited annotated datasets are
tors to improve the detection performance. In order to handle another challenge facing vehicle detection from aerial images.
the UAV-based dataset lack of our targeted object (vehicle), The available annotated vehicles datasets in UAV imagery are
GANs are mostly used to increase the sizes of the datasets. very few, which makes the detectors less accurate, especially
Also, autoencoders can be used to generate new unseen for deep-learning-based ones where we need a very large
images to increase dataset size. Moreover, researchers can number of annotated images/videos to train them. Most of
use autoencoder-like architectures to improve small vehicle the available datasets are based on satellite images. For this
detection, as presented in [127]. RNN and LSTM are mostly reason, we find most of the studies are based on their own
used to improve the detection performance in video sequences. collected UAV-based datasets.
Kompella and Kulkarni [144] proposed a 1-D RNN architec- 7) Real-Time Issue: In order to manage road traffic or park-
ture called RVS to improve salient object detection in videos. ing lot, UAVs must detect and classify vehicles in a clutter
Most of the time RNN/LSTM are combined with different environment in real time. However, the UAV platform has
CNN architectures to achieve better performance on vehicle limited onboard hardware resources to execute such complex
detection in video sequences, as shown in [42] and [150]. More tasks and heavy data, which makes detection and classification
details are presented in Section IV. operations in real time the biggest challenges. Also, aerial
images are larger than ordinary scene images, which decreases
the detection speed. Therefore, accurate and fast vehicle
B. Vehicle Detection From UAV Imagery Challenges detection algorithms are indispensable for low-power and low
processing UAV platforms to accomplish their mission at the
UAVs can fly at different altitudes with different view right time.
angles resulting in many challenges compared with vehicle In order to overcome these challenges, there is a critical
detection in ordinary view images and videos. Automated need to develop robust algorithms that are able to enhance the
and accurate vehicle detection in UAV-based images remains detection performance from aerial images/videos.
a challenging problem due to many issues, such as, but
not limited to, scale variations, different vehicles orientations IV. T ECHNIQUES TO I MPROVE V EHICLE D ETECTION
and types, illumination variations, high density, appearance F ROM UAV-BASED I MAGES AND V IDEOS
similarity of the vehicles with other object types, partial
occlusions of vehicles, complex background, and the image Deep-learning-based vehicle detection is not transferable
qualities. Moreover, limited annotated training datasets and directly on high-altitude aerial imagery, which contains a large
real-time detection are other problems that must be handled. number of smaller and randomly located objects. To handle
these challenges, many approaches, based on improved ver-
1) Small Object: High-resolution aerial images contain a
sions of region-based and one-stage detector algorithms, are
smaller and large number of objects beside vehicles than those
developed. In the last years, many techniques have been pro-
in ordinary scene images, which causes the lack of information
posed to improve vehicle detection from aerial views. Among
in feature extraction. This consequently increases the difficulty
the adopted techniques to extend CNN-based detector to UAV
of localization and the missed detections.
imagery, we find redesigning the anchor boxes properties and
2) Scale Diversity: The same object can be of different
the feature extractor structure, adopting data augmentation
sizes, which can be ambiguous and confusing for a UAV when
techniques, and combine different deep learning approaches.
it is captured at different altitudes. It makes the detection
In the following, we describe the most effective techniques
more challenging due to the different scales of the vehicles
used to improve the performance of vehicle detection in
throughout the captured images/videos.
UAV-based images.
3) Vehicles Orientations and Types: The mobility degree of
UAVs leads to many view angles of the vehicles: top view, side
view, and front view. The captured images/videos consist of A. Training Strategy and Parameter Adaptation
vehicles of various types in different orientations and shapes, To detect vehicles from UAV-based images, researchers
which make the detection more and more challenging. [78], [81], [96], [104]–[108] adopted the original detectors
4) Illumination Condition Variations: Light is an important frameworks. In order to show how these architectures are
factor in any classification/detection task, and it is one of the accurate, they just trained them on low altitude and/or top
most difficult problems to solve. The reliable and accurate view UAV imagery while adapting some parameters to han-
system cannot be achieved without taking this factor into dle multiscales vehicle detection, including the number of
account, hence the need for some image preprocessing to layers, anchor boxes scales, loss function, and input image
minimize the effects of lighting and illumination. size. Training the original detectors on low-altitude images
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 7
achieved remarkable results. However, a big drawback of such strategy is adopted in [119] and [120], which trains the model
architecture is their lack of small vehicle detection from very alternately with cross-entropy loss (focus much on easy exam-
high altitudes. ples during training) and focal loss (focus on hard examples
Many deep-learning-based detectors studies show the during training) for more discriminative features extraction;
impact of the framework settings’ choice on vehicle detection thus, more improvements are provided.
from UAV images. Anchor box is one of the most important Most of these techniques have remarkable results on top
parameters, which is set in different scales and aspect ratios view and low altitude UAV images. However, they have poor
in a way to fit various sizes and shapes of the vehicles. Larger performance when the UAV provides us images/videos from
anchors are applied to detect large vehicles while smaller ones different altitudes, where the vehicle size varies from a few
to detect tiny vehicles. Thus, changing the anchors’ sizes to fit pixels (e.g., 10 × 20) to medium and larger sizes. This is
the used dataset is a good idea to improve multisize vehicle because the UAV-based image has different characteristics than
detection in UAV images. Combining the anchor box concept natural on-ground, top views, and low altitude scenes.
with the right framework parameters and a suitable dataset can
achieve remarkable results.
Hsieh et al. [109] showed the effect of anchor box sizes and B. Using Feature Maps From Different Layers
some other parameters on the performance of vehicle detection Most of the previous CNN-based object detection tech-
in UAV images. Tang et al. [110] and Shen et al. [111] niques use a single feature map, which is the last convolutional
showed the impact of anchors scales and aspect ratios on layer output, to detect vehicles in an image. Deep CNN that
small vehicle detection accuracy. Xu et al. [81] demonstrated has a large number of convolutional and pooling layers results
that Faster R-CNN provides good results on vehicle detection in feature maps, from deeper layers, with a receptive field
from low altitude UAV-based images. However, the authors larger than that of shallower ones. Consequently, deep layers
did not extend the detection to higher altitude and multi- produce low spatial resolution with strong semantics, which
types of vehicles, where the feature extraction and detection makes them more suitable for large vehicle detection, while
become more difficult. In the same way, Ammar et al. [104] shallower layers are semantically very strong, which provides
showed the impact of the chosen dataset on Faster R-CNN high feature map resolution, which is more suitable for small-
and YOLOv3 performance while using different hyperpa- size vehicle detection. Therefore, deep layers in CNN architec-
rameters, including input size, feature extractor, and score tures provide a high recall rate with poor localization accuracy
threshold. Sommer et al. [96] investigated the impact of the while shallower layers otherwise [121].
adapted RPN and Fast R-CNN on small object detection. Many researchers have proposed to use shallower CNN
They showed the effect of the chosen CNN architecture and architectures or feature maps from earlier layers to enhance
the network parameters (number of layers, number and size small-size vehicle detection. Wang et al. [122] showed the
of filters, and dropout) on small object detection accuracy. impact of the chosen backbone architecture on Faster R-CNN
Radovic et al. [105] showed that a trained YOLO, with a suit- accuracy. Sommer et al. [123] replaced VGG-16, which is the
able dataset and parameter settings, is able to detect vehicles main CNN backbone of Faster R-CNN, with another CNN
and aircraft from low altitude and top view UAV images. They architecture with fewer layers that are more suitable to handle
trained YOLO on a dataset that consists of satellite and UAV small-size vehicle detection. Sommer et al. [96] proposed
images. Tang et al. [107] trained the original SSD, YOLOv1, optimized versions of Fast and Faster R-CNN detectors. High-
and YOLOv2 structures on UAV imagery to detect vehicles. resolution feature maps from earlier layers of the original VGG
YOLOv2 had the best results in detection accuracy and speed architecture were used resulting in significant improvement
achieving 77.12% and 0.048 s, respectively, against (67.99% of Faster R-CNN performance on small vehicle detection in
and 0.056 s) on YOLOv1 and (72.95% and 0.055 s) on the UAV images. Also, they replaced the original VGG network
SSD detector. with another network inspired by the network in [124] to
The ordinary bounding box contains some background show the effect of shallow networks on the detector perfor-
features in the case of vehicles in multiple orientations. mance. However, only one vehicle category was considered
Moreover, it has difficulties separating vehicle targets effec- in their work. The same authors extended the adapted Faster
tively. In order to overcome these issues, oriented bounding R-CNN for multicategory vehicle detection [123], but, even
boxes can improve vehicle detection and counting in dense in this work, the authors are limited to only two vehicle
and sparse UAV scenes. For this reason, Tang et al. [112], categories detection due to the limitation of annotated object
Guo et al. [113], and Li et al. [114] adopted rotated bound- classes. Moreover, they showed the impact of data augmen-
ing boxes. The loss function is another important parameter tation to improve detection accuracy. However, even with
to improve vehicle detection in UAV images. Many stud- these improvements for small-size vehicle detection in UAV
ies [115]–[118] have shown the impact of the chosen loss images, using feature maps provided by shallower layers and
function on the detector performance. Some researchers used shallow CNN architectures result in a high number of false
the focal loss function instead of the cross-entropy function, detection due to the apparent similarity of some objects with
in the training process, to handle small-size vehicle detection vehicles.
because it concentrates on hard features, which makes it a Most of the recent studies on object detection using
good solution that can improve the performance of small only one scale feature, almost the last layer feature map,
vehicle detection [115], [118]. A more advanced training to enhance the tradeoff between detection speed and accuracy.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
However, the techniques used in these studies have a lack of four-year-old learns from very little data. Deep-learning-based
detecting tiny objects. Quite recently, considerable attention methods always require large amounts of training data to
has been paid to fuse feature maps from different layers. perform very well. Thus, the performance of the model
Many techniques have been gaining importance in recent depends on the datasets’ size, type, and quality, as shown
years proposing to combine heretical feature maps from in [78], [106], [108], and [134], where the authors showed
shallow and deeper layers. Many researchers demonstrate the impact of the chosen dataset on the performance of the
that combining feature maps from different layers provides algorithm. Generally, more data improve the model accuracy
more suitable results for multiscales vehicle detection, espe- better than complex algorithms.
cially for small-size vehicle detection. Sommer et al. [125] Most of the well-annotated datasets are those of natural
extended Faster R-CNN to detect small vehicles, where scene images, such as ImageNet [135], PASCAL VOC [136],
they applied the deconvolutional layer on a deeper layer to and MS COCO [137]. Unfortunately, the available datasets
upsample low-dimensional feature maps. Then, the upsam- that aim to detect vehicles from aerial images are very limited,
pled feature maps are combined with feature maps from which makes it one of the major problems for vehicle detection
shallower layers to detect small-size vehicles from aerial from UAV imagery. Moreover, most of the available datasets
images. Acatay et al. [126] presented an extended YOLOv2, are based on satellites, airborne and fixed cameras. In order
called DYOLO, which achieved remarkable accuracy com- to provide the necessary datasets to train a model on UAV
bining the upsampled feature maps of deep layers (obtained imagery, we are forced to use non-UAV-based images all
by deconvolutional module) with those of shallow layer. along with other methods. One of the most used methods
Tayara et al. [127] proposed the fully convolutional regression is to collect our datasets by capturing thousands of images
network (FCRN), which is a detector based on autoencoder- through UAV platforms [75], [78], [80], [81], [138], [139], and
like architecture to detect small vehicles in UAV images. the annotation will be done manually. However, this method
Tang et al. [110] proposed a modified Faster R-CNN version is very exhausting and time-consuming. Another solution is
to overcome the limits of RPN network performance on to use an available dataset [96], [106], [140], but most of
small object localization and the differentiation between vehi- these datasets are specialized for only one task. For these
cles and complex background features in the classifier. They reasons, we find ourselves forced to combine collected and
replaced the original RPN with a Hyper-RPN (HRPN), which available datasets, as in [107] and [129], or to use some data
is based on ZFNet architecture. It combines feature maps augmentation techniques to increase the number of training
from the last three convolutional layers while using a cascade images to make the detector more efficient.
of boosted classifiers to improve the classification accuracy There are two approaches for data augmentation: clas-
reducing false detection by mining hard negative examples. sical techniques and deep-learning-based techniques. Using
Also, Tang et al. [128] and Deng et al. [129] improved these data augmentation techniques in the right way, we can
Faster R-CNN doing the same thing as [110] combining the improve the performance of vehicle detection algorithms.
feature maps from the last three convolutional layers of ZFNet For the classical data augmentation methods, we just apply
architecture to generate a hyper feature map while adopting some basic image manipulations, such as flipping, rota-
different anchor scales and aspect ratios. Xie et al. [120] tions, cropping, scaling, color space transformation, add some
proposed a detector based on RefineDet [130] architecture noise, and more [117], [123], [132], [141], [142]. However,
adapted to detect small vehicles in aerial images combining deep-learning-based data augmentation is a more advanced
weak and strong semantic features. technique, which is based generally on neural style trans-
The feature pyramid network (FPN) is one of the most fer, autoencoders, and GANs to produce new unseen sam-
recommended techniques to enhance multiscale vehicle detec- ples [143]–[145]. Many data augmentation techniques are
tion, which was first introduced by Lin et al. [131]. It aims presented very well in [146].
to combine shallow layers low-level features and deep layers GAN is the most used deep-learning-based architecture
high-level features to produce a set of new feature maps for data augmentation due to its high ability to generate
with different spatial resolutions that give us the possibility fake images that are very similar to real ones. Various
to detect multiscale vehicles, especially in UAV scene cases. GAN-based architectures were proposed over the few past
Several publications have appeared in recent years document- years. Some researchers benefit from the GAN architecture
ing the improvements brought by adopting FPN architecture to improve the performance of deep learning object detec-
[115], [122], [132], [133]. Li et al. [114] showed the impact tors. Shen et al. [143] showed how multicondition con-
of FPN comparing different backbone architectures, where strained GAN (MCGAN) improved vehicle detection from
they achieved better results adopting ResNet-101 with FPN. aerial images. Chen et al. [147] proposed a data augmentation
Wang et al. [122] studied vehicle detection from UAV images framework based on classification-oriented super-resolution
and showed that FPN improves its detection in very challeng- GANs (CSRGANs) combined with a flexible object proposal
ing scenes, including vehicles in shadow and/or occluded by generation to detect small-scale vehicles in UAV-based images.
trees and buildings. In order to generate new realistic remote sensing labeled vehi-
cle images, Zheng et al. [148] proposed another data augmen-
C. Training Data and Data Augmentation tation approach called vehicle synthesis GANs (VS-GANs).
The training process does not work the same way that it Before doing data augmentation, we must consider the
does with humans, where we find that a little boy of three- or application that we want to develop. For example, we cannot
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 9
do flipping for digit recognition because it will be confusing data sending, which leads to an additional cost to the latency
to distinguish between 9 and 6. Moreover, we cannot generate of the system.
new unreal images in the healthcare field because it gives false In the second group, all the calculations are done inside
and undesirable results. the drone itself, and we call it also edge computing. However,
the heavy detection algorithms have a lack of computing power
D. Other Deep-Learning-Based Techniques to Improve and energy managing on small and low-power devices, while
Vehicle Detection Performance From UAV Videos they are crucial for deep-learning-based vehicle detection.
Much research has been done to solve these problems where
In video sequences, CNN-based detectors have some draw-
several publications have appeared, in recent years, docu-
backs, such as classifying the same detected object into dif-
menting lightweight versions of deep-learning-based detectors
ferent classes between consecutive frames. Many studies have
to improve the detection speed while keeping a competitive
shown that recurrent networks achieve better results in the field
accuracy.
of object detection from videos. In addition, other studies have
Ringwald et al. [152] proposed a powerful lightweight
tried to combine CNN and RNN characteristics to improve
vehicle detector (UAV-Net), which is implemented on Nvidia
detection accuracy. RNN network and its types, including
Jetson TX2 achieving an average precision (AP) of 97.2%
LSTM, show good performance on object tracking tasks in
and an inference speed of 15.9 frames per second (fps). The
video sequences. Thus, their characteristics help to improve
UAV-Net is based on SSD architecture replacing VGG-16 with
the detector performance by predicting the location of the
ZynqNet [153] and adapted to the unique characteristics of
vehicle in the next frames resulting in correct classifications
UAV imagery. Shen et al. [143] proposed a modified version
between successive frames. Lakhal et al. [149] proposed the
of Faster R-CNN architecture, which is based on a lightweight
CNR framework that combines CNN and LSTM architectures
deep CNN feature extractor (LD-CNN) and an MCGAN
to improve vehicle classification performance from remote
to improve the vehicle detection from aerial images. The
sensing images. Ning et al. [150] developed ROLO architec-
LD-CNN achieved better performance than other deep CNN
ture stacking LSTM layers on the top of the YOLOv1 detec-
and lightweight CNN architectures (VGG-16, ResNet-50,
tor to improve object detection accuracy, including vehicles.
DenseNet-121, MobileNet, and ShuffleNet-v2) while reducing
Lu et al. [42] proposed a new approach combining a CNN-
the model size and the computation cost. Kyrkou et al. [154]
based detector (SSD) with an associated LSTM network to
developed four single-shot detectors, including SmallYoloV3,
improve vehicle detection accuracy. Zhang et al. [151] com-
TinyYoloVoc, TinyYoloNet, and DroNet. They are based on
bine fully CNNs (FCN) and LSTM to estimate the number of
Tiny-YOLO architectures while applying different: filters sizes
vehicles in dense scenes from low-resolution videos captured
and numbers, layers number, and input image sizes. They
through city cameras.
proved that these architectures work efficiently on lightweight
embedded systems achieving good results even on Raspberry
E. Lightweight Detectors for Small Devices Pi 3. Azimi [155] proposed a modified version of the SSD
The visual data produced by UAVs need a strong combi- detector, called ShuffleDet, which uses ShuffleNet architecture
nation of reliable software and powerful hardware to realize as a backbone network. The author showed that ShuffleDet can
complex applications. Most of today’s deep-learning-based be executed in real-time achieving 14 fps on Nvidia Jetson
vehicle detection algorithms are executed on sophisticated TX2 while keeping a competitive accuracy.
GPUs, which are much faster than GPPs and CPUs. The heavy
operations of detection algorithms also lead to huge energy V. B ENCHMARK DATASETS AND E VALUATION M ETRICS
consumption and large storage. Real-time vehicle detection The dataset size, type, and quality play a vital role in the
from UAVs has very strict requirements, such as huge comput- development of deep-learning-based vehicle detection algo-
ing requirements, low latency, power efficiency, and security. rithms from aerial images. There is a lack in the number of
UAV’s data processing could be divided into two main well-annotated aerial images, and most of them are based on
groups, which are off-board processing and onboard process- satellites and/or airborne. However, the available UAV-based
ing. For the first one, the UAVs collect data over their sensors datasets are very limited although large-scale and challenging
and send them to on-ground powerful workstations or on aerial vehicle detection datasets are crucial to develop this
the cloud. This gives to the edge device, UAV in our case, field. In this section, we present some common benchmark
the ability to save energy by offloading intensive compute datasets and various evaluation metrics used to evaluate deep-
operations. Even when we use workstations equipped with learning-based vehicle detection algorithms.
powerful GPUs, it still very hard to process some tasks in real
time, while time is very crucial for any autonomous machine,
including UAVs, cars, and robots. Maybe, this issue will be A. Datasets
solved with the advance of the fourth industrial revolution, 1) DLR 3K Munich Vehicle Aerial Image Dataset: The
the 5G, and even with quantum computing. Other problems DLR 3K Munich Vehicle Aerial Image Dataset [5] is one
with the off-board data processing are the security issue and of the most used and challenging datasets for small-size
the denied-signal areas, where the transmitted data could vehicle detection in aerial images annotated with oriented
be hacked or cannot be transmitted to the processing unit. bounding boxes. The DLR 3K dataset consists of 20 top views
Moreover, the wireless transmission takes additional time for aerial images with a resolution of (5616 × 3744) pixels and
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
A ERIAL I MAGES /V IDEOS D ATASETS FOR V EHICLE D ETECTION TASK (C ITATION T ILL A PRIL 20, 2020)
a ground sampling distance (GSD) of around 13 cm/pixel. consists of nine different classes of sparse small vehicles all
It is acquired at an altitude of around 1000 m through an along with various backgrounds and confused objects captured
airplane over Munich (Germany) urban and residential areas from the same distance to the ground and with no oblique
using Canon Eos 1Ds Mark III camera. It contains seven views. VEDAI is basically used to train and test small-size
vehicle categories, but only two classes are generally used: cars vehicle detection algorithms. Many state-of-the-art studies are
with 9300 instances and trucks with 160 instances. The other based on the VEDAI dataset as a baseline for object detection
categories are neglected because of the very small number of algorithm development (see Tables II and III).
annotated instances. Due to the limited number and the very 3) DOTA Dataset: The Object deTection in Aerial
large size of the dataset images, all the researchers divide images (DOTA) dataset was created recently by
these images into a group of smaller subimages. Moreover, Xia et al. [158], which contains 15 object categories,
data augmentation techniques are used to increase the small including vehicles of different scales and orientations.
number of truck instances [123]. The DLR 3K dataset is Moreover, each object is annotated manually by experts with
considered as one of the most important datasets to develop an oriented bounding box. It consists of 2806 aerial images
and evaluate different object detection algorithms, as shown captured through different platforms in multiple cities. Thus,
in Tables II and III. the size of each image is varying between (800 × 800) and
2) Vehicle Detection in Aerial Imagery (VEDAI) Dataset: (4000 × 4000) pixels with multiple GSDs. DOTA dataset
Razakarivony and Jurie [156] create another well-annotated provides a good balance between small- and middle-size
dataset, called the VEDAI dataset, by cropping the very large objects, which makes it very similar to real-world scenes.
satellite images provided by the Utah AGRC database [157] 4) COWC Dataset: Mundhenk et al. [159] created the
into more than 1200 smaller images with two different reso- Cars Overhead with Context (COWC) dataset, which consists
lutions, VEDAI-512 and VEDAI-1024, and a GSD of around of 53 TIFF images, from six various locations, with different
25 and 12.5 cm/pixel, respectively. The VEDAI-512 is just resolutions varying from (2000 × 2000) to (19 000 × 19 000)
the downscaled version of VEDAI-1024, which makes the pixels, and a GSD of around 15 cm/pixel. Instead of labeling
target vehicles smaller and more challenging. This dataset the images with bounding boxes, the authors followed another
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 11
style of annotation, where they used the center pixel point 9) UA123 Dataset: Mueller et al. [2] proposed, in their
of the vehicles. COWC dataset can be used for vehicle study, one of the largest low altitude multiclass UAV-based
detection and counting tasks. However, it is a very challenging datasets, called UAV123. It contains 123 video sequences with
dataset due to the small size of cars that vary between 110 000 fully annotated frames, including vehicles. The videos
24 and 48 pixels. are captured through different UAV platforms with different
5) Stanford Drone Dataset: The SDD [3] is one of the resolutions, varying between (1280 × 720) and (3840 × 2160)
most popular UAV-based datasets that consists of around pixels, from various altitudes varying from 5 to 25 m. It is
60 annotated UAV video sequences with a resolution of basically created for the single-object-tracking task.
(1400 × 1904) pixels, from eight unique scenes over the 10) UAVDT Dataset: Du et al. [1] created a new dataset,
Stanford University campus. Each of these scenes is captured called UAVDT, which consists of 100 videos with 80k UAV-
through a 3DR solo UAV equipped with a 4k camera at an alti- based images, with a resolution of (1080 × 540) pixels.
tude of 80 m. These videos contain different object categories, These images are captured through UAV in complex scenarios
including 11 200 pedestrians, 6400 bicyclists, 1300 cars, from different altitudes varying from 10 to more than 70 m.
300 skateboarders, 200 golf carts, and 100 buses. All the images are annotated carefully with axis-aligned
6) CARPK and PUCPR+ Datasets: The CARPK and bounding boxes and some helpful attributes, including flying
PUCPR+ are publicly available datasets for vehicle detection altitude, occlusion, vehicle category, weather condition, and
tasks in different parking lots, which are published in the camera view. This information makes UAVDT dataset useful
same paper [109]. CARPK is a large-scale UAV-based car for three main computer vision tasks: vehicle detection, single-
parking dataset, which consists of around 90 000 cars in high- vehicle tracking, and multiple-vehicle tracking.
resolution images. It is captured through a UAV platform from The choice of the appropriate dataset depends on the prob-
four different parking lots with an altitude of around 40 m. lem that we are trying to solve. Other datasets are summarized
However, the PUCPR+ dataset is captured through cameras in Table II, and Fig. 7 highlights the differences between some
installed at around 30 m on the top of a store building. benchmark datasets.
7) VisDrone 2018 Dataset: VisDrone 2018 [4] is a
large-scale dataset that consists of 263 videos of around
179 000 frames and 10 209 static images with resolutions B. Evaluation Metrics
of (3840 × 2160) and (2000 × 1500) pixels, respectively. After developing the vehicle detection algorithms, their
It contains ten object categories, including vehicles, pedes- performance should be evaluated. Nowadays, there are various
trians, and bicycles. All the videos and images are captured metrics to evaluate the quality of object detection algorithms,
through different UAV platforms across many Chinese cities including mean AP (mAP), Intersection over Union (IoU),
under different lighting and weather conditions. The VisDrone recall rate, precision rate, F1-score, and inference speed. In this
dataset targets four tasks, including object detection in images section, we present the most important evaluation metrics.
and videos, and single- and multiple-object trackings. 1) Intersection Over Union: IoU is a technique used to
8) CyCAR Dataset: Kouris et al. [142] created a CyCAR determines whether detection is valid or not by comparing how
dataset that consists of high-resolution images captured good two different bounding boxes are for the detected object.
through a UAV platform over Nicosia city (Cyprus). It consists As shown in (1), IoU is the ratio of the area of overlapping to
of 450 images that contain more than 5000 annotated vehicles the area of union between predicted (red box) and ground-truth
collected from low (20 m) to high (500 m) altitudes. (green box) boxes. Blue represents the intersection region,
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
while orange represents the union of the two bounding boxes 8) Frame Rate and Time/Image: One of the biggest chal-
lenges of object detection algorithms is to make them work
Area of overlap
IoU = . (1) in real time. In order to express how fast the detector is,
Area of union we use one of the most known evaluation metrics, called
2) True Positive, False Positive, True Negative, and False frame rate. The frame rate is the number of displayed frames
Negative: The true positive (TP) value represents the number in 1 s, which is measured in frame per second (FPS). The
of vehicles correctly identified by the detector. Each bounding speed of a detection algorithm depends on the used hardware
box is considered as a positive detection if the IoU is greater (GPU and CPU), the video/image resolution, and the detector
than a certain threshold. In addition, if there are many bound- architecture. The higher the frame rate, the better the detector
ing boxes for the same detected object, only the one with the we will get. Another frequently used evaluation metric is time
highest overlap ratio is taken as positive detection. However, per Image that expresses how fast a detector can process an
false positive (FP) means the number of vehicles incorrectly input image of a certain resolution.
identified by the detector. In other terms, the number of 9) Mean Absolute Error and Root Mean Squared Error:
bounding boxes with an IoU is less than the chosen threshold, The mean absolute error (MAE) and the root mean squared
while false negative (FN) means the undetected vehicles, and error (RMSE) are two metrics that are employed in object
true negative (TN) represents the correct misdetection. detection algorithms evaluation. MAE represents the average
3) Recall Rate (Sensitivity): The recall rate is an important absolute difference between true values “ f i and predicted
metric to evaluate the detector performance. It represents how values “yi ” [see (6)]. Similarly, RMSE represents the average
many of the predictive positive vehicles are really vehicles as of the square root of the squared difference between the true
follows: and predicted values [see (7)]
1
n
TP TP
Recall = = . (2) MAE = | f i − yi | (6)
TP + FN All ground truth n i
n
4) Precision Rate: The precision rate is another important 1
evaluation metric, which determines the number of vehicles RMSE = ( f i − y i )2 . (7)
n i
that are predicted to be truly positives over all detections as
follows: Table III shows the performance evaluation of various
TP TP vehicle detection algorithms on the most used datasets, all
Precision = = . (3) along with the main backbone for each detector. In addition,
TP + FP All detections
the listed performances are achieved for a certain dataset that
5) F1-Score: The F1-score combines precision and recall contains characteristics (the number of classes, the number of
rates into a single metric, which is the harmonic mean of these instances in each image, and resolution).
combined metrics as follows:
Precision ∗ Recall VI. D ISCUSSION
F1-score = 2 ∗ . (4)
Precision + Recall In the last years, vehicle detection from UAV-based aerial
6) Accuracy: The accuracy metric represents how much images/videos systems gained high importance due to their
correct detection over the total number of detections [see (5)]. versatility use in many studies. In this article, we presented
The accuracy metric does not work well in the case of several vehicle detection algorithms and the improvements
imbalanced data. In such a case, the F1-score metric is more that came with the different deep learning architectures. Since
reliable than the accuracy metric there are many challenges related to the type of used hard-
ware, the UAV-based images/videos characteristics, the quality,
TP + TN Correct detections the type, and the size of the dataset, we need to choose
Accuracy = = . (5)
TP + FP All detections the appropriate system for the required task. Based on the
challenges presented in Section III-B, there is a wide range of
7) Average Precision: The AP metric represents the area research directions that merits exploring. Table III shows var-
under the precision–recall curve; the higher AP the better ious metrics related to the performances of vehicle detection
performance is, and vice versa. Moreover, the mAP is obtained algorithms on different datasets.
by calculating the AP for each class and then calculating
the average between all the classes. It was first adopted
by the PASCAL VOC 2007 dataset. The mAP metric is A. Impact of Altitude
used to evaluate the object detection algorithm accuracy. Unlike natural scene images/videos, UAV-based images
Most of the time, the mAP is noted as [email protected] or AP50. contain vehicles of different sizes, scales, and orientations that
In some other cases, like in the case of COCO competition, increase the detection difficulty and make it more challenging.
AP@[.50:.95] or AP@[.50:0.05:.95] represents the same thing The flying altitude is an essential parameter, which affects
as mAP, where AP@[.50:0.05:.95] is the average value over the sizes of vehicles. Flying from low and medium altitudes
multiple IoU of the results from 0.5 to 0.95 with a step size provides vehicles’ sizes very similar to the ones in natural
of 0.05. scene images. However, the higher the distance from the
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 13
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 15
ground, the smaller the size of the vehicles. From higher on the same datasets [96], [125], but with different resolu-
altitudes, the detector confuses the vehicles with other similar tions (VEDAI-512 and VEDAI-1024), gives different results.
objects that have the same shape, resulting in a large number The results presented in Table III shows that high-resolution
of false alarms (FP). As shown in Tables II and III, mostly, images/videos give better results in term of accuracy, where the
as the altitude increases, the detection performance decreases, AP improves from 89.5% to 95.2% in [96] and from 95% to
and vice versa. The resolution of the captured images and 97.8% in [125]. Also, as the resolution decreases, the vehicle
chosen algorithms should be taken into account as shown in size decreases, which led to an increase in the number of
[96], [125], [152], and [165]. For example, Li et al. [114] FP detection. In addition, as the image resolution increases,
showed that the altitude could affect the detector accuracy, the detection speed decreases.
where they applied the same architecture (R3Net) on two
different datasets DLR and VEDAI, achieving 87% and 64.8%, D. Impact of Dataset Properties
respectively. These datasets have similar image resolutions As shown in Table II, each dataset has different proper-
(1080×540) but captured at different altitudes, where the DLR ties, such as occlusion, shadow, illumination, resolution, and
was captured at an altitude of 1000 m, and the VEDAI was viewpoint variation. These properties have an impact on the
captured through satellites. Most vehicle detection algorithms performance of the detection algorithm. Table III shows that
perform very well on low altitude images, but this performance vehicle detection from datasets that have multiple properties
decreases on images from very high altitudes. In many works, achieved lower accuracy compared to other datasets with fewer
such as [78], [165], and [167], the authors showed that vehicle properties. As presented in Section IV-C, data augmentation
detection from lower altitudes achieves better accuracy than techniques also have a significant impact on the detection
from higher altitudes. In Sections IV-A and IV-B, we pre- performance. The number of images used to train the model
sented the most powerful techniques to improve small vehicle and the scene complexity are other parameters that we should
detection in aerial images. The best results on small-sized take into consideration because they are able to affect vehicle
vehicle detection are based on the detection algorithms that detection performance. The dataset is usually divided into
use FPN architectures, narrow CNN architectures, or adding training and testing sets, where their sizes depend on the
deconvolution modules to increase the feature map resolution. total number of images in the dataset. Amato et al. [108],
Yang et al. [119] achieved the best AP result of 88.8% on Yang et al. [119], and Chen et al. [163] split the dataset
the Stanford dataset among the detectors that are based on into a training set of 80% and a testing set of 20%. Also,
ResNet-101 as a feature extractor by combining it with FPN Amato et al. [108] split the PUCPR+ dataset into 70% and
architecture. Similarly, Li et al. [114] and Gu et al. [115] 30% for training and testing sets, respectively.
show the performance of FPN-based detector on small vehicle
E. Impact of Backbone Architecture
detection from VEDAI and DLR datasets. However, applying
a deconvolution module could provide better results as shown Another important parameter is the chosen CNN archi-
in the paper of Sommer et al. [125], achieving an AP varying tecture, which has an impact on the detection performance
between 95% and 98% on the VEDAI and DLR datasets. in terms of accuracy and speed. Table III shows that the
Moreover, Sommer et al. in their studies [96] and [123], show backbone network architectures have an obvious impact on
the impact of shallow architecture on small vehicle detection detection accuracy. Kyrkou et al. [154] showed the impact
performance achieving an AP that varies between 82.5% and of changing CNN architecture on the detection performance.
95.2% on the VEDAI and DLR datasets (see Table III). Yang et al. [119] showed that ResNet-101 architecture pro-
vides slightly better accuracy than VGG-16 in the case of
B. Impact of View Angles Faster R-CNN detector. In the same paper, they showed
One other property that defers UAV from on-ground vehi- that VGG-16 provides better results than ResNet-101 in the
cles is the degree of liberty that has an impact on the view case of an SSD detector. However, in both cases, VGG-16
angles (top view, side view, and front view) and orientation. provides a better inference speed than ResNet-101. Table III
As shown in Table III, most of the detectors achieved very shows that the backbone architectures’ selection has a direct
acceptable results on datasets with top view scenes, such as impact on vehicle detection performance. As an example,
in the case of DLR and Stanford datasets, where the AP in Section IV-B, we presented that FPN and shallow CNN
can achieve around 98% [125] and 91% [119], respectively. architectures provide better results on small vehicle detection.
However, most of them gave poor results when the used dataset Yang et al. [119] and Kouris et al. [142] got better accuracy
consists of images/videos captured from different angles, such just by changing the backbone architecture, but it could affect
as in the case of UAVDT where they achieved AP from the detection speed. Yang et al. [119] achieved seven and
21% [1] to 60% [165]. In addition, the diversity of view angles two fps in the case of Faster R-CNN using VGG-16 and
is not the only reason for the bad performance because there ResNet-101, respectively. The same thing also happened with
are many other parameters that can cause the low performance SSD using VGG-16 and ResNet-101 as backbone architectures
of detectors, including dataset size and quality. achieving 59 and 11.2 fps.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
to do the perception and decision-making and take actions in dataset compared to around 80% applying YOLOv3 on the
real time, especially in the case of small devices-based plat- same dataset. However, YOLOv3 achieved 76 fps against six
forms. Image/video resolution and backbone architecture have fps of the FPN detector. Recently, one-stage detectors compete
serious impacts on the inference speed. As shown in Table III, with two-stage detectors in terms of accuracy, especially with
two-stage detectors achieved interesting accuracies in vehicle the advance of the new YOLOv4 that achieved remarkable
detection from UAV tasks. However, they have a lack in results both in accuracy and detection speed.
terms of inference speed, which makes them not suitable for Most reliable detectors are unsuitable for small devices and
vehicle detection in real time even when we use powerful have real-time detection issues. For this reason, lightweight
GPUs. One-stage detectors are solutions to achieve vehicle models are developed, which provide remarkable results con-
detection in real time while keeping a competitive accuracy cerning real-time detection, even on small devices. However,
result. Most detectors achieve real-time detection speed only they still some critical issues concerning accuracy. For exam-
on powerful GPUs. The SSD detector in [163] achieved ple, ShuffleNet [155] and MobileNet [116] achieve a detection
43 FPS while keeping a good accuracy of around 89% on speed near-real-time of 14 fps on Jetson TX2 but with low
the UAV123 dataset. A modified version of RefineDet [119] accuracy of 62.9% on the DLR dataset and 29.2% on the
achieved 58 FPS and an accuracy of around 90% on Nvidia VisDrone-2018, respectively.
GTX1080Ti GPU. In [152], the proposed UAV-Net that is In order to realize an application, we need to take all of
based on lightweight CNN architecture achieved competi- these parameters into consideration according to the chosen
tive accuracy on three different datasets. It achieved 97.2%, application. For these reasons, many factors should be taken
95.7%, and 26.2% on DLR, VEDAI, and UAVDT datasets, into account, including the used platform, the targeted appli-
respectively, while providing a remarkable speed even on a cation, and the detection speed. For example, we should select
small device (Jetson TX2) achieving 15.9, 9.9, and 18.3 FPS PUCPR+ or CARPK datasets for parking-related applications.
on DLR, VEDAI-1024, and UAVDT datasets, respectively. We must select lightweight architectures for real-time vehicle
Moreover, Ringwald et al. [152] showed that, as the image detection from autonomous UAVs or other small systems.
resolution increases, the detection speed decreases. The res- Deeper and heavier algorithms could be more efficient for
olutions of the cropped DLR 3K (936 × 624) and UAVDT offline applications that demand more accuracy.
(1024 × 540) images are almost the same, which gave an Object detection from UAV imagery is still an open
approximate speed detection of 194.1 and 214 fps on TITAN X research field, where the performance of the developed detec-
GPU, respectively. However, the VEDAI-1024 (1024 × 1024) tors evolves continuously. The vehicle detection performance
has a bigger resolution that causes more processing time, depends on several parameters, including the complexity of
resulting in less speed detection of 123.5 fps on the same the scene, the size of the vehicles, and the dataset quality and
GPU platform (TITAN X). size, among others.
The different algorithms developed with the aim to The choice of the right dataset is one of the most important
improve the performance of vehicle detection from UAV steps in vehicle detection from UAV imagery. The number of
imagery/videos (see Section IV) have some advantages and images in a specific dataset could affect the performance of the
drawbacks. One of these algorithms is anchors’ sizes changing, detector, especially in complex environments. A large number
which could be a good solution for small vehicle detection of training sets may improve the detection accuracy, but the
in UAV-based scenes. However, it cannot be generalized for model takes a longer time in the training process. However,
multiscale vehicle detection when the UAV flights at different the trained model on small datasets can suffer in terms of
altitudes. This is the case when Cai et al. [167] applied it on accuracy due to the lack of information about the targeted
the UAVDT dataset (see Table III). Setting a large number of object. Similarly, the image resolution could affect the speed
anchors with different scales could be a solution to minimize and accuracy of the detector, where low-resolution images are
the flexibility issue. However, it still not sufficient to enhance processed faster, but they achieve poor accuracy due to the
detection accuracy. lack of valuable information. High-resolution images provide
Shallow architectures and feature maps from early layers more valuable information, while they take a very long time to
could be another solution to improve small vehicle detection, be processed. Also, the flight altitude is another fundamental
but they still suffer from a high number of false-positive due parameter that can affect the detection accuracy due to the
to the appearance resemblance of some objects with vehicles. variation of vehicle size and scene complexity, where detecting
These drawbacks can lead to low recall and precision rates, vehicles from high altitude is more difficult than from low
as shown in [96], [129], and [123] (see Table III). Fusing altitude. Several methods were proposed in the literature to
feature maps from different layers could be a very good solve such problems from image resolution choice [96] to the
solution to improve the recall rate. However, it is not the only choice of the right architecture [114], [119], [125].
reason for the low recall rate, where it could be a compilation
of different parameters, such as the image resolution, GSD, VII. C ONCLUSION
and flight altitude. It seems that FPN-based architectures The UAV is one of the indispensable systems in the new
provide the best results for multiscale vehicle detection. intelligent era. They are being adopted in several sensitive
Also, two- and one-stage algorithms have advantages and areas and fields, including agriculture, search and rescue,
drawbacks. The two-stage detectors achieved higher accuracy military, and even discovering Mars. Artificial intelligence and
of more than 88% using the FPN detector on the Stanford deep learning techniques play an important role to develop
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 17
these systems and make them smarter to facilitate many [15] B. Kayalibay, G. Jensen, and P. van der Smagt, “CNN-based segmen-
operations. In this review article, we provided an overview of tation of medical imaging data,” CoRR, vol. abs/1701.03056, pp. 1–24,
Jan. 2017. [Online]. Available: http://arxiv.org/abs/1701.03056
some powerful deep learning architectures, including CNN, [16] M. Vardhana, N. Arunkumar, S. Lasrado, E. Abdulhay, and
RNN, autoencoders, and GANs. These architectures are con- G. Ramirez-Gonzalez, “Convolutional neural network for bio-medical
sidered as the cornerstone of modern vehicle detection algo- image segmentation with hardware acceleration,” Cognit. Syst. Res.,
vol. 50, pp. 10–14, Aug. 2018, doi: 10.1016/j.cogsys.2018.03.005.
rithms, including region-based detectors (Faster R-CNN and [17] Y. Tian, G. Yang, Z. Wang, E. Li, and Z. Liang, “Instance segmen-
RFCN) and one-stage detectors (SSD, YOLO, and RefineDet). tation of apple flowers using the improved mask R-CNN model,”
We showed that deep-learning-based vehicle detection from Biosyst. Eng., vol. 193, pp. 264–278, May 2020, doi: 10.1016/j.
biosystemseng.2020.03.008.
UAV images achieved interesting results in many tasks. [18] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
Moreover, we presented different benchmark datasets all tional networks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), vol. 8689,
2014, pp. 818–833, doi: 10.1007/978-3-319-10590-1_53.
along with their characteristics to evaluate the developed [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
models. In this review article, we tend to help researchers with deep convolutional neural networks,” Commun. ACM, vol. 1,
choosing the appropriate architecture and dataset for their pp. 1097–1105, May 2012, doi: 10.1145/3065386.
[20] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
applications. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9, doi:
Future works should focus on improving lightweight detec- 10.1109/CVPR.2015.7298594.
tors for real-time vehicle detection, which could be imple- [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. 3rd Int. Conf. Learn. Represent.
mented on the UAV itself instead of sending captured videos (ICLR), 2015, pp. 1–14.
to the on-ground workstation or doing the processing on the [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
cloud. Moreover, improving vehicle detection in the dense image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778, doi: 10.1109/CVPR.2016.90.
environment and vehicle tracking are fundamental issues that [23] A. G. Howard et al., “MobileNets: Efficient convolutional neural
should be addressed in the future. networks for mobile vision applications,” 2017, arXiv:1704.04861.
[Online]. Available: http://arxiv.org/abs/1704.04861
[24] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
R EFERENCES “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[1] D. Du et al., “The unmanned aerial vehicle benchmark: Object detec- pp. 4510–4520, doi: 10.1109/CVPR.2018.00474.
tion and tracking,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 370–386, [25] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF
doi: 10.1007/978-3-030-01249-6_23. Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1314–1324, doi:
[2] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator 10.1109/ICCV.2019.00140.
for UAV tracking,” in Proc. Eur. Conf. Comput. Vis., vol. 9905, 2016, [26] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and
pp. 445–461, doi: 10.1007/978-3-319-46448-0_27. K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer
[3] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning parameters and <0.5 MB model size,” CoRR, vol. abs/1602.07360,
social etiquette: Human trajectory understanding in crowded scenes,” pp. 1–13, Nov. 2016. [Online]. Available: http://arxiv.org/abs/1602.
in Proc. Eur. Conf. Comput. Vis., in Lecture Notes in Computer Science, 07360
vol. 9912, 2016, pp. 549–565, doi: 10.1007/978-3-319-46484-8_33. [27] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
[4] P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu, “Vision meets efficient convolutional neural network for mobile devices,” in
drones: A challenge,” 2018, arXiv:1804.07437. [Online]. Available: Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
http://arxiv.org/abs/1804.07437 pp. 6848–6856, doi: 10.1109/CVPR.2018.00716.
[28] N. Ma, X. Zhang, H. T. Zheng, and J. Sun, “ShuffleNet V2: Practical
[5] K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aer-
guidelines for efficient CNN architecture design,” in Proc. Eur. Conf.
ial images,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 9,
Comput. Vis. (ECCV), 2018, pp. 116–131, doi: 10.1007/978-3-030-
pp. 1938–1942, Sep. 2015, doi: 10.1109/LGRS.2015.2439517.
01264-9_8.
[6] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based [29] R. J. Wang, X. Li, and C. X. Ling, “Pelee: A real-time object detection
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, system on mobile devices,” in Proc. Adv. Neural Inf. Process. Syst.,
pp. 2278–2324, Nov. 1998, doi: 10.1109/5.726791. 2018, pp. 1963–1972.
[7] S. Mujawar, D. Kiran, and H. Ramasangu, “An efficient CNN archi- [30] J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN transducer mod-
tecture for image classification on FPGA accelerator,” in Proc. 2nd eling for end-to-end speech recognition,” in Proc. IEEE Autom. Speech
Int. Conf. Adv. Electron., Comput. Commun. (ICAECC), Feb. 2018, Recognit. Understand. Workshop (ASRU), Dec. 2019, pp. 114–121.
pp. 2018–2021, doi: 10.1109/ICAECC.2018.8479517. [31] A. Amberkar, P. Awasarmol, G. Deshmukh, and P. Dave, “Speech
[8] Y. Peng et al., “FB-CNN: Feature fusion-based bilinear CNN for recognition using recurrent neural networks,” in Proc. Int. Conf.
classification of fruit fly image,” IEEE Access, vol. 8, pp. 3987–3995, Current Trends Towards Converging Technol. (ICCTCT), Mar. 2018,
2020, doi: 10.1109/ACCESS.2019.2961767. pp. 2018–2021, doi: 10.1109/ICCTCT.2018.8551185.
[9] M. Blot, M. Cord, and N. Thome, “Max-min convolutional neural [32] H. Wang, H. Wang, and K. Xu, “Evolutionary recurrent neural net-
networks for image classification,” in Proc. IEEE Int. Conf. Image work for image captioning,” Neurocomputing, vol. 401, pp. 249–256,
Process. (ICIP), Sep. 2016, pp. 3678–3682. Aug. 2020, doi: 10.1016/j.neucom.2020.03.087.
[10] A. G. Howard, “Some improvements on deep convolutional neural [33] M. Wang, L. Song, X. Yang, and C. Luo, “A parallel-fusion RNN-
network based image classification,” in Proc. 2nd Int. Conf. Learn. LSTM architecture for image caption generation,” in Proc. IEEE
Represent. (ICLR), 2014, pp. 1–6. Int. Conf. Image Process. (ICIP), Sep. 2016, pp. 4448–4452, doi:
[11] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, “Scale-aware fast 10.1109/ICIP.2016.7533201.
R-CNN for pedestrian detection,” IEEE Trans. Multimedia, vol. 20, [34] Y. Li and J. Yang, “Hydrological time series prediction model based on
no. 4, pp. 985–996, Apr. 2018, doi: 10.1109/TMM.2017.2759508. attention-LSTM neural network,” in Proc. 2nd Int. Conf. Mach. Learn.
[12] T. Agrawal and S. Urolagin, “Multi-angle parking detection system Mach. Intell., Sep. 2019, pp. 21–25, doi: 10.1145/3366750.3366756.
using mask R-CNN,” in Proc. 2nd Int. Conf. Big Data Eng. Technol., [35] Y. Chen and K. Wang, “Prediction of satellite time series data
Jan. 2020, pp. 76–80, doi: 10.1145/3378904.3378914. based on long short term memory-autoregressive integrated mov-
[13] W. Zhang, S. Wang, S. Thachan, J. Chen, and Y. Qian, “Deconv R-CNN ing average model (LSTM-ARIMA),” in Proc. IEEE 4th Int.
for small object detection on remote sensing images,” in Proc. IEEE Conf. Signal Image Process. (ICSIP), Jul. 2019, pp. 308–312, doi:
Int. Geosci. Remote Sens. Symp. (IGARSS), Jul. 2018, pp. 2491–2494. 10.1109/SIPROCESS.2019.8868350.
[14] B. N. K. Sai and T. Sasikala, “Object detection and count of objects [36] X. Song et al., “Time-series well performance prediction based
in image using tensor flow object detection API,” in Proc. Int. Conf. on long short-term memory (LSTM) neural network model,”
Smart Syst. Inventive Technol. (ICSSIT), Nov. 2019, pp. 542–546, doi: J. Petroleum Sci. Eng., vol. 186, Mar. 2020, Art. no. 106682, doi:
10.1109/ICSSIT46314.2019.8987942. 10.1016/j.petrol.2019.106682.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[37] L. Yao and Y. Guan, “An improved LSTM structure for nat- [57] C. Hu, X. Hou, and Y. Lu, “Improving the architecture of an
ural language processing,” in Proc. IEEE Int. Conf. Saf. Pro- autoencoder for dimension reduction,” in Proc. IEEE 11th Int. Conf.
duce Informatization (IICSPI), Dec. 2018, pp. 565–569, doi: Ubiquitous Intell. Comput., IEEE 11th Int. Conf. Autonomic Trusted
10.1109/IICSPI.2018.8690387. Comput., IEEE 14th Int. Conf. Scalable Comput. Commun. Asso-
[38] J. Li, Y. Xu, and H. Shi, “Bidirectional LSTM with hierarchical ciated Workshops, Dec. 2014, pp. 855–858, doi: 10.1109/UIC-ATC-
attention for text classification,” in Proc. IEEE 4th Adv. Inf. Technol., ScalCom.2014.50.
Electron. Automat. Control Conf. (IAEAC), Dec. 2019, pp. 456–459, [58] J. Zabalza et al., “Novel segmented stacked autoencoder for effec-
doi: 10.1109/IAEAC47372.2019.8997969. tive dimensionality reduction and feature extraction in hyperspec-
[39] A. F. Ganai and F. Khursheed, “Predicting next word using RNN tral imaging,” Neurocomputing, vol. 185, pp. 1–10, Apr. 2016, doi:
and LSTM cells: Stastical language modeling,” in Proc. 5th Int. 10.1016/j.neucom.2015.11.044.
Conf. Image Inf. Process. (ICIIP), Nov. 2019, pp. 469–474, doi: [59] V. Kuppili, D. R. Edla, and A. Bablani, “Novel fitness function for
10.1109/iciip47207.2019.8985885. 3D image reconstruction using bat algorithm based autoencoder,” in
[40] C. Su, H. Huang, S. Shi, P. Jian, and X. Shi, “Neural machine Proc. 23rd Int. ACM Conf. 3D Web Technol., Jun. 2018, pp. 2–3, doi:
translation with Gumbel tree-LSTM based encoder,” J. Vis. Com- 10.1145/3208806.3211218.
mun. Image Represent., vol. 71, Aug. 2020, Art. no. 102811, doi: [60] C. C. Tan and C. Eswaran, “Reconstruction of handwritten digit
10.1016/j.jvcir.2020.102811. images using autoencoder neural networks,” in Proc. Can. Conf.
[41] C. Zhang and J. Kim, “Modeling long- and short-term tempo- Electr. Comput. Eng., May 2008, pp. 465–470, doi: 10.1109/
ral context for video object detection,” in Proc. IEEE Int. Conf. CCECE.2008.4564577.
Image Process. (ICIP), Sep. 2019, pp. 71–75, doi: 10.1109/ICIP.2019. [61] Z. Dong and W. Qu, “Infrared image colorization using an edge
8802920. aware auto encoder decoder with the multi-resolution fusion,” in
[42] Y. Lu, C. Lu, and C.-K. Tang, “Online video object detection using Proc. Chin. Automat. Congr. (CAC), Nov. 2019, pp. 1011–1016, doi:
association LSTM,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 10.1109/CAC48633.2019.8996588.
Oct. 2017, pp. 2344–2352, doi: 10.1109/ICCV.2017.257. [62] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th Int.
[43] A. Carrio, C. Sampedro, A. Rodriguez-Ramos, and P. Campoy, Conf. Neural Inf. Process. Syst., vol. 2, 2014, pp. 2672–2680.
“A review of deep learning methods and applications for unmanned [63] F. Emmert-Streib, Z. Yang, H. Feng, S. Tripathi, and M. Dehmer,
aerial vehicles,” J. Sensors, vol. 2017, pp. 1–13, Aug. 2017, doi: “An introductory review of deep learning for prediction models with
10.1155/2017/3296874. big data,” Frontiers Artif. Intell., vol. 3, pp. 1–23, Feb. 2020, doi:
[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” 10.3389/frai.2020.00004.
Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: [64] M. Bugdol, Z. Segiet, M. Kre˛cichwost, and P. Kasperek, “Vehicle
10.1162/neco.1997.9.8.1735. detection system using magnetic sensors,” Transp. Problems, vol. 9,
[45] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Percep- no. 1, pp. 49–60, 2014.
tual generative adversarial networks for small object detection,” in [65] S. S. M. Ali, B. George, L. Vanajakshi, and J. Venkatraman, “A multi-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, ple inductive loop vehicle detection system for heterogeneous and lane-
pp. 1222–1230. less traffic,” IEEE Trans. Instrum. Meas., vol. 61, no. 5, pp. 1353–1360,
[46] C.-Y. Liou, W.-C. Cheng, J.-W. Liou, and D.-R. Liou, “Autoencoder May 2012, doi: 10.1109/TIM.2011.2175037.
for words,” Neurocomputing, vol. 139, pp. 84–96, Sep. 2014, doi: [66] M. Hickman and P. Mirchandani, “Airborne traffic flow data and traffic
10.1016/j.neucom.2013.09.055. management,” in Proc. Greenshields Symp., 2008, pp. 121–132.
[47] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, [67] J. Leitloff, D. Rosenbaum, F. Kurz, O. Meynberg, and P. Reinartz,
“Stacked denoising autoencoders: Learning useful representations in a “An operational system for estimating road traffic information from aer-
deep network with a local denoising criterion,” J. Mach. Learn. Res., ial images,” Remote Sens., vol. 6, no. 11, pp. 11315–11341, Nov. 2014,
vol. 11, no. 12, pp. 3371–3408, Dec. 2010. doi: 10.3390/rs61111315.
[48] J. Masci, U. Meier, D. Cireşan, and J. Schmidhuber, “Stacked convo- [68] K. V. Najiya and M. Archana, “UAV video processing for traf-
lutional auto-encoders for hierarchical feature extraction,” in Proc. Int. fic surveillence with enhanced vehicle detection,” in Proc. 2nd Int.
Conf. Artif. Neural Netw., 2011, pp. 52–59, doi: 10.1007/978-3-642- Conf. Inventive Commun. Comput. Technol. (ICICCT), Apr. 2018,
21735-7_7. pp. 662–668, doi: 10.1109/ICICCT.2018.8473204.
[49] J. Xu et al., “Stacked sparse autoencoder (SSAE) for nuclei detec- [69] M. Elloumi, R. Dhaou, B. Escrig, H. Idoudi, and L. A. Saidane,
tion on breast cancer histopathology images,” IEEE Trans. Med. “Monitoring road traffic with a UAV-based system,” in Proc. IEEE
Imag., vol. 35, no. 1, pp. 119–130, Jan. 2016, doi: 10.1109/TMI. Wireless Commun. Netw. Conf. (WCNC), Apr. 2018, pp. 1–6, doi:
2015.2458702. 10.1109/WCNC.2018.8377077.
[50] M. Schreyer, T. Sattarov, D. Borth, A. Dengel, and B. Reimer, [70] H. Zhang, M. Liptrott, N. Bessis, and J. Cheng, “Real-time traffic
“Detection of anomalies in large scale accounting data using deep analysis using deep learning techniques and UAV based video,” in
autoencoder networks,” 2017, arXiv:1709.05254. [Online]. Available: Proc. 16th IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS),
http://arxiv.org/abs/1709.05254 Sep. 2019, pp. 1–5, doi: 10.1109/AVSS.2019.8909879.
[51] Y. Pu et al., “Variational autoencoder for deep learning of images, [71] X. Zhao, F. Pu, Z. Wang, H. Chen, and Z. Xu, “Detection, track-
labels and captions,” in Proc. 30th Int. Conf. Neural Inf. Process., ing, and geolocation of moving vehicle from UAV using monoc-
2016, pp. 2360–2368. ular camera,” IEEE Access, vol. 7, pp. 101160–101170, 2019, doi:
[52] P. Zhu, L. Wen, D. Du, X. Bian, Q. Hu, and H. Ling, “Vision meets 10.1109/access.2019.2929760.
drones: Past, present and future,” 2020, arXiv:2001.06303. [Online]. [72] S. Li, W. Zhang, G. Li, L. Su, and Q. Huang, “Vehicle detection in
Available: http://arxiv.org/abs/2001.06303 UAV traffic video based on convolution neural network,” in Proc. IEEE
[53] K. Bajaj, D. K. Singh, and M. A. Ansari, “Autoencoders 1st Conf. Multimedia Inf. Process. Retr. (MIPR), Apr. 2018, pp. 1–6,
based deep learner for image denoising,” Procedia Comput. Sci., doi: 10.1109/MIPR.2018.00009.
vol. 171, pp. 1535–1541, Jan. 2020, doi: 10.1016/j.procs.2020. [73] B. Coifman, M. McCord, R. G. Mishalani, and K. Redmill, “Surface
04.164. transportation surveillance from unmanned aerial vehicles,” in Proc.
[54] Z. Fang, T. Jia, Q. Chen, M. Xu, X. Yuan, and C. Wu, “Laser 83rd Annu. Meeting Transp. Res. Board, 2004, p. 28.
stripe image denoising using convolutional autoencoder,” Results [74] B. Coifman, M. McCord, R. G. Mishalani, M. Iswalt, and
Phys., vol. 11, pp. 96–104, Dec. 2018, doi: 10.1016/j.rinp.2018. Y. Ji, “Roadway traffic monitoring from an unmanned aerial vehi-
08.023. cle,” IEE Proc. Intell. Transp. Syst., vol. 153, no. 1, pp. 11–20,
[55] Y. Qiu, Y. Yang, Z. Lin, P. Chen, Y. Luo, and W. Huang, “Improved Mar. 2006.
denoising autoencoder for maritime image denoising and semantic [75] X. Xi, Z. Yu, Z. Zhan, Y. Yin, and C. Tian, “Multi-task cost-sensitive-
segmentation of USV,” China Commun., vol. 17, no. 3, pp. 46–57, convolutional neural network for car detection,” IEEE Access, vol. 7,
Mar. 2020, doi: 10.23919/JCC.2020.03.005. pp. 98061–98068, 2019, doi: 10.1109/ACCESS.2019.2927866.
[56] R. K. Keser and B. U. Toreyin, “Autoencoder based dimensionality [76] R. Ke, Z. Li, S. Kim, J. Ash, Z. Cui, and Y. Wang, “Real-time bidi-
reduction of feature vectors for object recognition,” in Proc. 15th Int. rectional traffic flow parameter estimation from aerial videos,” IEEE
Conf. Signal-Image Technol. Internet-Based Syst. (SITIS), Nov. 2019, Trans. Intell. Transp. Syst., vol. 18, no. 4, pp. 890–901, Apr. 2017, doi:
pp. 577–584, doi: 10.1109/sitis.2019.00097. 10.1109/TITS.2016.2595526.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 19
[77] Q. Pan, X. Wen, Z. Lu, L. Li, and W. Jing, “Dynamic speed control of [100] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
unmanned aerial vehicles for data collection under Internet of Things,” Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
Sensors, vol. 18, no. 11, pp. 1–18, 2018, doi: 10.3390/s18113951. pp. 6517–6525, doi: 10.1109/CVPR.2017.690.
[78] B. Benjdira, T. Khursheed, A. Koubaa, A. Ammar, and K. Ouni, [101] J. Redmon and A. Farhadi, “YOLOv3: An incremental improve-
“Car detection using unmanned aerial vehicles: Comparison ment,” 2018, arXiv:1804.02767. [Online]. Available: http://arxiv.
between faster R-CNN and YOLOv3,” in Proc. 1st Int. Conf. org/abs/1804.02767
Unmanned Vehicle Systems-Oman (UVS), Feb. 2019, pp. 1–6, doi: [102] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
10.1109/UVS.2019.8658300. dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
[79] A. Ayalew, “A review on object detection from unmanned aerial vehicle Oct. 2017, pp. 2980–2988, doi: 10.1109/ICCV.2017.324.
using CNN,” Int. J. Advance Res., Ideas Innov. Technol., vol. 5, no. 4, [103] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOv4: Optimal
pp. 241–243, 2019. speed and accuracy of object detection,” 2020, arXiv:2004.10934.
[80] Y. Koga, H. Miyazaki, and R. Shibasaki, “A CNN-based method [Online]. Available: http://arxiv.org/abs/2004.10934
of vehicle detection from aerial images using hard example min- [104] A. Ammar, A. Koubaa, M. Ahmed, and A. Saad, “Aerial images
ing,” Remote Sens., vol. 10, no. 1, p. 124, Jan. 2018, doi: processing for car detection using convolutional neural networks: Com-
10.3390/rs10010124. parison between faster R-CNN and YoloV3,” 2019, arXiv:1910.07234.
[81] Y. Xu, G. Yu, Y. Wang, X. Wu, and Y. Ma, “Car detection from [Online]. Available: http://arxiv.org/abs/1910.07234
low-altitude UAV imagery with the faster R-CNN,” J. Adv. Transp., [105] M. Radovic, O. Adarkwa, and Q. Wang, “Object recognition in aerial
vol. 2017, pp. 1–10, Aug. 2017, doi: 10.1155/2017/2823617. images using convolutional neural networks,” J. Imag., vol. 3, no. 2,
[82] H. Nguyen, “Improving faster R-CNN framework for fast vehicle p. 21, Jun. 2017, doi: 10.3390/jimaging3020021.
detection,” Math. Problems Eng., vol. 2019, pp. 1–11, Nov. 2019, doi: [106] J. Lu et al., “A vehicle detection method for aerial image based on
10.1155/2019/3808064. YOLO,” J. Comput. Commun., vol. 6, no. 11, pp. 98–107, 2018, doi:
[83] K. Shi, H. Bao, and N. Ma, “Forward vehicle detection based on 10.4236/jcc.2018.611009.
incremental learning and fast R-CNN,” in Proc. 13th Int. Conf. Comput. [107] T. Tang, Z. Deng, S. Zhou, L. Lei, and H. Zou, “Fast vehicle
Intell. Secur. (CIS), Dec. 2017, pp. 73–76. detection in UAV images,” in Proc. Int. Workshop Remote Sens. With
[84] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution Intell. Process. (RSIP), May 2017, pp. 1–5, doi: 10.1109/RSIP.2017.
representation learning for human pose estimation,” in Proc. 7958795.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, [108] G. Amato, L. Ciampi, F. Falchi, and C. Gennaro, “Counting vehi-
pp. 5693–5703. cles with deep learning in onboard UAV imagery,” in Proc. IEEE
[85] Y. Li, Y. Chen, N. Wang, and Z.-X. Zhang, “Scale-aware trident Symp. Comput. Commun. (ISCC), Jun. 2019, pp. 1–5, doi: 10.1109/
networks for object detection,” in Proc. IEEE/CVF Int. Conf. Comput. ISCC47284.2019.8969620.
Vis. (ICCV), Oct. 2019, pp. 6054–6063.
[109] M.-R. Hsieh, Y.-L. Lin, and W. H. Hsu, “Drone-based object counting
[86] Z. Cheng, Y. Wu, Z. Xu, T. Lukasiewicz, and W. Wang, “Segmen-
by spatially regularized regional proposal network,” in Proc. IEEE
tation is all you need,” 2019, arXiv:1904.13300. [Online]. Available: Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4165–4173, doi:
http://arxiv.org/abs/1904.13300
10.1109/ICCV.2017.446.
[87] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “DetNet:
[110] T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, “Vehicle detection in
Design backbone for object detection,” in Proc. Eur. Conf. Comput.
aerial images based on region convolutional neural networks and hard
Vis. (ECCV), 2018, pp. 334–350.
negative example mining,” Sensors, vol. 17, no. 2, p. 336, Feb. 2017,
[88] H. Law and J. Deng, “CornerNet: Detecting objects as paired key-
doi: 10.3390/s17020336.
points,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 734–750.
[111] J. Shen, N. Liu, H. Sun, X. Tao, and Q. Li, “Vehicle detection
[89] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
in aerial images based on hyper feature map in deep convolutional
chies for accurate object detection and semantic segmentation,” in Proc.
network,” IEEE Access, vol. 13, no. 4, pp. 1989–2011, 2019, doi:
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587,
10.3837/tiis.2019.04.014.
doi: 10.1109/CVPR.2014.81.
[90] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep [112] T. Tang, S. Zhou, Z. Deng, L. Lei, and H. Zou, “Arbitrary-oriented
convolutional networks for visual recognition,” IEEE Trans. Pattern vehicle detection in aerial imagery with single convolutional neural
Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015, doi: networks,” Remote Sens., vol. 9, no. 11, p. 1170, Nov. 2017, doi:
10.1109/TPAMI.2015.2389824. 10.3390/rs9111170.
[91] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. [113] Y. Guo, Y. Xu, and S. Li, “Dense construction vehicle detection
(ICCV), Dec. 2015, pp. 1440–1448, doi: 10.1109/ICCV.2015.169. based on orientation-aware feature fusion convolutional neural net-
[92] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards work,” Autom. Construct., vol. 112, Apr. 2020, Art. no. 103124, doi:
real-time object detection with region proposal networks,” IEEE Trans. 10.1016/j.autcon.2020.103124.
Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017, [114] Q. Li, L. Mou, Q. Xu, Y. Zhang, and X. X. Zhu, “R3-Net: A deep net-
doi: 10.1109/TPAMI.2016.2577031. work for multioriented vehicle detection in aerial images and videos,”
[93] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region- IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 5028–5042,
based fully convolutional networks,” in Proc. Adv. Neural Inf. Process. Jul. 2019, doi: 10.1109/TGRS.2019.2895362.
Syst., 2016, pp. 379–387. [115] Y. Gu, B. Wang, and B. Xu, “A FPN-based framework for vehicle
[94] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in detection in aerial images,” in Proc. 2nd Int. Conf. Video Image
Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969, Process., Dec. 2018, pp. 60–64, doi: 10.1145/3301506.3301531.
doi: 10.1109/ICCV.2017.322. [116] S. Vaddi, C. Kumar, and A. Jannesari, “Efficient object detection model
[95] L. Sommer, T. Schuchert, and J. Beyerer, “Comprehensive analysis of for real-time UAV applications,” 2019, arXiv:1906.00786. [Online].
deep learning-based vehicle detection in aerial images,” IEEE Trans. Available: http://arxiv.org/abs/1906.00786
Circuits Syst. Video Technol., vol. 29, no. 9, pp. 2733–2747, Sep. 2019, [117] H. Tayara and K. Chong, “Object detection in very high-resolution
doi: 10.1109/TCSVT.2018.2874396. aerial images using one-stage densely connected feature pyramid
[96] L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep vehicle network,” Sensors, vol. 18, no. 10, p. 3341, Oct. 2018, doi: 10.3390/
detection in aerial images,” in Proc. IEEE Winter Conf. Appl. Comput. s18103341.
Vis. (WACV), Mar. 2017, pp. 311–319, doi: 10.1109/WACV.2017.41. [118] S. Maiti, P. Gidde, S. Saurav, S. Singh, Dhiraj, and S. Chaudhury,
[97] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and “Real-time vehicle detection in aerial images using skip-connected
A. W. M. Smeulders, “Selective search for object recognition,” Int. convolution network with region proposal networks,” in Pattern Recog-
J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Sep. 2013, doi: nition and Machine Intelligence (Lecture Notes in Computer Science),
10.1007/s11263-013-0620-5. vol. 11941. Cham, Switzerland: Springer, 2019.
[98] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only [119] J. Yang, X. Xie, and W. Yang, “Effective contexts for UAV vehi-
look once: Unified, real-time object detection,” in Proc. IEEE Conf. cle detection,” IEEE Access, vol. 7, pp. 85042–85054, 2019, doi:
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788, doi: 10.1109/ACCESS.2019.2923407.
10.1109/CVPR.2016.91. [120] X. Xie et al., “Real-time vehicle detection from UAV imagery,” in
[99] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. Proc. IEEE 4th Int. Conf. Multimedia Big Data (BigMM), Sep. 2018,
Comput. Vis., 2016, pp. 21–37, doi: 10.1007/978-3-319-46448-0_2. pp. 1–5, doi: 10.1109/BigMM.2018.8499466.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[121] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and L. Van Gool, [140] Y. Zhou, T. Rui, Y. Li, and X. Zuo, “A UAV patrol system using
“DeepProposals: Hunting objects and actions by cascading deep con- panoramic stitching and object detection,” Comput. Electr. Eng.,
volutional layers,” Int. J. Comput. Vis., vol. 124, no. 2, pp. 115–131, vol. 80, Dec. 2019, Art. no. 106473, doi: 10.1016/j.compeleceng.
Sep. 2017, doi: 10.1007/s11263-017-1006-x. 2019.106473.
[122] L. Wang, J. Liao, and C. Xu, “Vehicle detection based on drone [141] L. Sommer, K. Nie, A. Schumann, T. Schuchert, and J. Beyerer,
images with the improved faster R-CNN,” in Proc. 11th Int. Conf. “Semantic labeling for improved vehicle detection in aerial imagery,”
Mach. Learn. Comput. (ICMLC), 2019, pp. 466–471, doi: 10.1145/ in Proc. 14th IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS),
3318299.3318383. Aug. 2017, pp. 1–6, doi: 10.1109/AVSS.2017.8078510.
[123] L. W. Sommer, T. Schuchert, and J. Beyerer, “Deep learning [142] A. Kouris, C. Kyrkou, and C.-S. Bouganis, “Informed region
based multi-category object detection in aerial images,” Proc. SPIE, selection for efficient UAV-based object detectors: Altitude-aware
vol. 10202, May 2017, Art. no. 1020209, doi: 10.1117/12.2262083. vehicle detection with CyCAR dataset,” in Proc. IEEE/RSJ Int.
[124] C. Herrmann, D. Willersinn, and J. Beyerer, “Low-resolution convo- Conf. Intell. Robots Syst. (IROS), Nov. 2019, pp. 51–58, doi:
lutional neural networks for video face recognition,” in Proc. 13th 10.1109/IROS40897.2019.8967722.
IEEE Int. Conf. Adv. Video Signal Based Surveill. (AVSS), Aug. 2016, [143] J. Shen, N. Liu, H. Sun, and H. Zhou, “Vehicle detection in aerial
pp. 221–227, doi: 10.1109/AVSS.2016.7738017. images based on lightweight deep convolutional network and generative
[125] L. Sommer, A. Schumann, T. Schuchert, and J. Beyerer, “Multi feature adversarial network,” IEEE Access, vol. 7, pp. 148119–148130, 2019,
deconvolutional faster R-CNN for precise vehicle detection in aerial doi: 10.1109/ACCESS.2019.2947143.
imagery,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), [144] A. Kompella and R. V. Kulkarni, “A semi-supervised recurrent neural
Mar. 2018, pp. 635–642, doi: 10.1109/WACV.2018.00075. network for video salient object detection,” Neural Comput. Appl.,
[126] O. Acatay, L. Sommer, A. Schumann, and J. Beyerer, “Compre- vol. 33, pp. 1–19, Jun. 2020, doi: 10.1007/s00521-020-05081-5.
hensive evaluation of deep learning based detection methods for
[145] X. Feng, Q. M. Jonathan Wu, Y. Yang, and L. Cao, “An autuencoder-
vehicle detection in aerial imagery,” in Proc. 15th IEEE Int. Conf.
based data augmentation strategy for generalization improvement of
Adv. Video Signal Based Surveill. (AVSS), Nov. 2018, pp. 1–6, doi:
DCNNs,” Neurocomputing, vol. 402, pp. 283–297, Aug. 2020, doi:
10.1109/AVSS.2018.8639127.
10.1016/j.neucom.2020.03.062.
[127] H. Tayara, K. Gil Soo, and K. T. Chong, “Vehicle detection and
counting in high-resolution aerial images using convolutional regres- [146] C. Shorten and T. M. Khoshgoftaar, “A survey on image data aug-
sion neural network,” IEEE Access, vol. 6, pp. 2220–2230, 2018, doi: mentation for deep learning,” J. Big Data, vol. 6, no. 1, Dec. 2019,
10.1109/ACCESS.2017.2782260. Art. no. 60, doi: 10.1186/s40537-019-0197-0.
[128] T. Tang, S. Zhou, Z. Deng, L. Lei, and H. Zou, “Fast multidirec- [147] Y. Chen, J. Li, Y. Niu, and J. He, “Small object detection networks
tional vehicle detection on aerial images using region based convo- based on classification-oriented super-resolution GAN for UAV aerial
lutional neural networks,” in Proc. IEEE Int. Geosci. Remote Sens. imagery,” in Proc. Chin. Control Decis. Conf. (CCDC), Jun. 2019,
Symp. (IGARSS), Jul. 2017, pp. 1844–1847, doi: 10.1109/IGARSS. pp. 4610–4615, doi: 10.1109/CCDC.2019.8832735.
2017.8127335. [148] K. Zheng, M. Wei, G. Sun, B. Anas, and Y. Li, “Using vehicle synthesis
[129] Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and generative adversarial networks to improve vehicle detection in remote
accurate vehicle detection in aerial images using coupled region- sensing images,” ISPRS Int. J. Geo-Inf., vol. 8, no. 9, p. 390, Sep. 2019,
based convolutional neural networks,” IEEE J. Sel. Topics Appl. Earth doi: 10.3390/ijgi8090390.
Observ. Remote Sens., vol. 10, no. 8, pp. 3652–3664, Aug. 2017, doi: [149] M. I. Lakhal, S. Escalera, and H. Cevikalp, “CRN: End-to-end convo-
10.1109/JSTARS.2017.2694890. lutional recurrent network structure applied to vehicle classification,” in
[130] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot Proc. 13th Int. Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory
refinement neural network for object detection,” in Proc. IEEE/CVF Appl., 2018, pp. 137–144, doi: 10.5220/0006533601370144.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4203–4212, doi: [150] G. Ning et al., “Spatially supervised recurrent convolutional neural
10.1109/CVPR.2018.00442. networks for visual object tracking,” in Proc. IEEE Int. Symp. Circuits
[131] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, Syst. (ISCAS), May 2017, pp. 1–4, doi: 10.1109/ISCAS.2017.8050867.
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. [151] S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura, “FCN-rLSTM:
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125, doi: Deep spatio-temporal neural networks for vehicle counting in city
10.1109/CVPR.2017.106. cameras,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,
[132] J. Zhou, C.-M. Vong, Q. Liu, and Z. Wang, “Scale adaptive image pp. 3687–3696, doi: 10.1109/ICCV.2017.396.
cropping for UAV object detection,” Neurocomputing, vol. 366, [152] T. Ringwald, L. Sommer, A. Schumann, J. Beyerer, and
pp. 305–313, Nov. 2019, doi: 10.1016/j.neucom.2019.07.073. R. Stiefelhagen, “UAV-Net: A fast aerial vehicle detector for
[133] Z. Wu, K. Suresh, P. Narayanan, H. Xu, H. Kwon, and Z. Wang, mobile platforms,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
“Delving into robust object detection from unmanned aerial vehicles: Recognit. Workshops (CVPRW), Jun. 2019, pp. 544–552, doi:
A deep nuisance disentanglement approach,” in Proc. IEEE/CVF 10.1109/CVPRW.2019.00080.
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 1201–1210, doi: [153] D. Gschwend, “ZynqNet: An FPGA-accelerated embedded convolu-
10.1109/ICCV.2019.00129. tional neural network,” Aug. 2020, arXiv:2005.06892. [Online]. Avail-
[134] H. Eriş and U. Çevik, “Implementation of target tracking methods able: http://arxiv.org/abs/2005.06892
on images taken from unmanned aerial vehicles,” in Proc. IEEE [154] C. Kyrkou, G. Plastiras, T. Theocharides, S. I. Venieris, and
17th World Symp. Appl. Mach. Intell. Inform. (SAMI), Jan. 2019, C.-S. Bouganis, “DroNet: Efficient convolutional neural network detec-
pp. 311–316, doi: 10.1109/SAMI.2019.8782768. tor for real-time UAV applications,” in Proc. Design, Automat. Test Eur.
[135] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Conf. Exhib. (DATE), Mar. 2018, pp. 967–972, doi: 10.23919/DATE.
“ImageNet: A large-scale hierarchical image database,” in Proc. IEEE 2018.8342149.
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255, doi:
[155] S. M. Azimi, “ShuffleDet: Real-time vehicle detection network in on-
10.1109/cvprw.2009.5206848.
board embedded UAV imagery,” in Proc. Eur. Conf. Comput. Vis., 2018,
[136] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
pp. 88–99, doi: 10.1007/978-3-030-11012-3_7.
A. Zisserman, “The Pascal visual object classes (VOC) challenge,”
Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010, doi: [156] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery:
10.1007/s11263-009-0275-4. A small target detection benchmark,” J. Vis. Commun. Image Repre-
[137] T. Y. Lin et al., “Microsoft COCO: Common objects in context,” in sent., vol. 34, pp. 187–203, Jan. 2016, doi: 10.1016/j.jvcir.2015.11.002.
Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755, doi: 10.1007/978- [157] Utah Mapping Portal. Accessed: Jun. 12, 2020. [Online]. Available:
3-319-10602-1_48. https://gis.utah.gov/
[138] F. Kamran, M. Shahzad, and F. Shafait, “Automated military vehi- [158] G.-S. Xia et al., “DOTA: A large-scale dataset for object detec-
cle detection from low-altitude aerial images,” in Proc. Digit. tion in aerial images,” in Proc. IEEE/CVF Conf. Comput. Vis. Pat-
Image Comput., Techn. Appl. (DICTA), Dec. 2018, pp. 1–8, doi: tern Recognit., Jun. 2018, pp. 3974–3983, doi: 10.1109/CVPR.2018.
10.1109/DICTA.2018.8615865. 00418.
[139] Z. Xu, H. Shi, N. Li, C. Xiang, and H. Zhou, “Vehicle detection [159] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye, “A large
under UAV based on optimal dense YOLO method,” in Proc. 5th contextual dataset for classification, detection and counting of cars with
Int. Conf. Syst. Informat. (ICSAI), Nov. 2018, pp. 407–411, doi: deep learning,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 785–800,
10.1109/ICSAI.2018.8599403. doi: 10.1007/978-3-319-46487-9_48.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
BOUGUETTAYA et al.: VEHICLE DETECTION FROM UAV IMAGERY WITH DEEP LEARNING: REVIEW 21
[160] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identification Abdelmalek Bouguettaya received the master’s and
in urban surveillance videos,” in Proc. IEEE Int. Conf. Multimedia Expo Ph.D. degrees in telecommunications from Badji
(ICME), Jul. 2016, pp. 1–6, doi: 10.1109/ICME.2016.7553002. Mokhtar University, Annaba, Algeria, in 2011 and
[161] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object 2017, respectively.
detection and geographic image classification based on collection He was a member of the “Embedded and
of part detectors,” ISPRS J. Photogramm. Remote Sens., vol. 98, Detection” Advanced Systems Division, Laboratoire
pp. 119–132, Dec. 2014, doi: 10.1016/j.isprsjprs.2014.10.002. d’Etude et de Recherche en Instrumentation et en
[162] F. Tanner et al., “Overhead imagery research data set—An annotated Communication d’Annaba (LERICA Laboratory),
data library & tools to aid in the development of computer vision Annaba, from 2009 to 2018. He is currently a Senior
algorithms,” in Proc. Appl. Imag. Pattern Recognit. Workshop, 2009, Researcher with the Research Center in Industrial
pp. 1–8, doi: 10.1109/AIPR.2009.5466304. Technologies (CRTI), Chéraga, Algeria. His current
[163] W. Chen, Z. Baojun, T. Linbo, and Z. Boya, “Small vehicles detection research interests include deep learning, computer vision, intelligent embed-
based on UAV,” J. Eng., vol. 2019, no. 21, pp. 7894–7897, Nov. 2019, ded systems, and unmanned aerial vehicles (UAVs).
doi: 10.1049/joe.2019.0710.
[164] W. Sakla, G. Konjevod, and T. N. Mundhenk, “Deep multi-modal
vehicle detection in aerial ISR imagery,” in Proc. IEEE Winter Hafed Zarzour received the Ph.D. degree in com-
Conf. Appl. Comput. Vis. (WACV), Mar. 2017, pp. 916–923, doi: puter science from Annaba University, Annaba,
10.1109/WACV.2017.107. Algeria, in 2013.
[165] W. Li, H. Li, Q. Wu, X. Chen, and K. N. Ngan, “Simultaneously He is currently an Associate Professor of com-
detecting and counting dense vehicles from drone images,” IEEE puter science with the University of Souk Ahras,
Trans. Ind. Electron., vol. 66, no. 12, pp. 9651–9662, Dec. 2019, doi: Souk-Ahras, Algeria. He has published several
10.1109/TIE.2019.2899548. research articles in international journals and con-
[166] J. Wang, S. Simeonova, and M. Shahbazi, “Orientation- and scale- ferences of high repute, including IEEE, Springer,
invariant multi-vehicle detection and tracking from unmanned aerial Elsevier, Wiley, ACM, Taylor and Francis, IGI
videos,” Remote Sens., vol. 11, no. 18, p. 2155, Sep. 2019, doi: Global, and Inderscience. His research focuses on
10.3390/rs11182155. deep learning, artificial intelligence, and educational
[167] Y. Cai et al., “Guided attention network for object detection and technology.
counting on drones,” 2019, arXiv:1909.11307. [Online]. Available:
http://arxiv.org/abs/1909.11307
[168] L. Sommer, N. Schmidt, A. Schumann, and J. Beyerer, “Search area
reduction fast-RCNN for fast vehicle detection in large aerial imagery,” Ahmed Kechida received the M.S. degree in
in Proc. 25th IEEE Int. Conf. Image Process. (ICIP), Oct. 2018, electronics from the Polytechnic Military School,
pp. 3054–3058, doi: 10.1109/ICIP.2018.8451189. Algiers, Algeria, in 2003 and the Ph.D. degree in
[169] M. Zhang, H. Li, G. Xia, W. Zhao, S. Ren, and C. Wang, “Research on electronics from Saad Dahleb Blida 1 University,
the application of deep learning target detection of engineering vehicles Blida, Algeria, in 2016.
in the patrol and inspection for military optical cable lines by UAV,” He is currently a Senior Researcher with Research
in Proc. 11th Int. Symp. Comput. Intell. Design (ISCID), Dec. 2018, Center in Industrial Technologies (CRTI), Algiers.
pp. 97–101, doi: 10.1109/ISCID.2018.00029. His research interests include signal processing,
[170] Z. Gao, H. Ji, T. Mei, B. Ramesh, and X. Liu, “EOVNet: Earth- image analysis, embedded systems for drones, and
observation image-based vehicle detection network,” IEEE J. Sel. Top- nondestructive evaluation by ultrasound.
ics Appl. Earth Observ. Remote Sens., vol. 12, no. 9, pp. 3552–3561,
Sep. 2019, doi: 10.1109/JSTARS.2019.2933501.
[171] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink-
ing the inception architecture for computer vision,” in Proc. IEEE Conf. Amine Mohammed Taberkit was born in Tlemcen,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2818–2826, doi: Algeria, in 1988. He received the master’s degree in
10.1109/CVPR.2016.308. electronic instrumentation and the Ph.D. degree from
[172] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception- the University of Tlemcen, Tlemcen, in 2012 and
v4, inception-ResNet and the impact of residual connections on 2018, respectively.
learning,” 2016, arXiv:1602.07261. [Online]. Available: http://arxiv. In 2010, he became an Electronic Engineer at the
org/abs/1602.07261 University of Tlemcen. He was a member of the
[173] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely Research Unit of Materials and Renewable Energy
connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Research (URMER) and a part-time Teacher with the
Pattern Recognit. (CVPR), Jul. 2017, pp. 4700–4708. University of Tlemcen from 2013 to 2019. He is cur-
[174] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual rently a Senior Researcher with the Research Center
transformations for deep neural networks,” in Proc. CVPR, Jul. 2017, in Industrial Technologies (CRTI), Chéraga, Algiers. His research interests
pp. 1492–1500. included strain silicon technology, MOSFET transistors, and nanotechnology.
[175] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in His actual research interests include artificial intelligence, intelligent embed-
Proc. CVPR, Jun. 2018, pp. 7132–7141. ded systems, unmanned aerial vehicle, and computer vision.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 30,2021 at 09:23:35 UTC from IEEE Xplore. Restrictions apply.