Jimaging 10 00197
Jimaging 10 00197
Imaging
Article
A Multi-Scale Target Detection Method Using an Improved
Faster Region Convolutional Neural Network Based on
Enhanced Backbone and Optimized Mechanisms
Qianyong Chen, Mengshan Li * , Zhenghui Lai, Jihong Zhu and Lixin Guan
College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China;
[email protected] (Q.C.); [email protected] (Z.L.); [email protected] (J.Z.);
[email protected] (L.G.)
* Correspondence: [email protected]
Abstract: Currently, existing deep learning methods exhibit many limitations in multi-target detection,
such as low accuracy and high rates of false detection and missed detections. This paper proposes
an improved Faster R-CNN algorithm, aiming to enhance the algorithm’s capability in detecting
multi-scale targets. This algorithm has three improvements based on Faster R-CNN. Firstly, the
new algorithm uses the ResNet101 network for feature extraction of the detection image, which
achieves stronger feature extraction capabilities. Secondly, the new algorithm integrates Online Hard
Example Mining (OHEM), Soft non-maximum suppression (Soft-NMS), and Distance Intersection
Over Union (DIOU) modules, which improves the positive and negative sample imbalance and the
problem of small targets being easily missed during model training. Finally, the Region Proposal
Network (RPN) is simplified to achieve a faster detection speed and a lower miss rate. The multi-scale
training (MST) strategy is also used to train the improved Faster R-CNN to achieve a balance between
detection accuracy and efficiency. Compared to the other detection models, the improved Faster
R-CNN demonstrates significant advantages in terms of [email protected], F1-score, and Log average miss
rate (LAMR). The model proposed in this paper provides valuable insights and inspiration for many
fields, such as smart agriculture, medical diagnosis, and face recognition.
Citation: Chen, Q.; Li, M.; Lai, Z.;
Zhu, J.; Guan, L. A Multi-Scale Target
Detection Method Using an Improved
Keywords: DIoU; improved Faster R-CNN; multi-scale target detection; ResNet101; Soft-NMS
Faster Region Convolutional Neural
Network Based on Enhanced
Backbone and Optimized
Mechanisms. J. Imaging 2024, 10, 197. 1. Introduction
https://doi.org/10.3390/ Object detection has long been a research focus for computer vision, and it has been
jimaging10080197 extensively applied in areas such as face recognition, medical image diagnosis, and road
Academic Editor: Vijayan K. Asari detection [1–3]. At present, deep learning-based target detection methods can be broadly
categorized into two main classes [4,5]. One class is the two-stage object detection approach
Received: 16 July 2024 typified by Region-based Convolutional Neural Networks (R-CNNs), and the other class
Revised: 8 August 2024
is the one-stage approach typified by You Only Look Once (YOLO). These two types of
Accepted: 10 August 2024
algorithms have their own characteristics and advantages [6]. One-stage target detection
Published: 13 August 2024
algorithms detect targets directly on the original image without a region proposal step, so
these algorithms are relatively faster, but the detection accuracy decreases when detecting
different multi-scale targets. Two-stage object detection algorithms exhibit relatively higher
Copyright: © 2024 by the authors.
detection accuracy but at the cost of slower processing speeds. Driven by the rapid advance-
Licensee MDPI, Basel, Switzerland. ment of deep learning, target detection algorithms have achieved impressive gains in both
This article is an open access article accuracy and processing speed. R-CNN [7] is the seminal work in object detection algo-
distributed under the terms and rithms, which computes over candidate regions generated by the selective search method,
conditions of the Creative Commons and further applies SVM classification and bounding box regression. Consequently, R-CNN
Attribution (CC BY) license (https:// consumes too much time in image processing, resulting in low detection efficiency. He
creativecommons.org/licenses/by/ et al. [8] introduced the Faster R-CNN on the basis of R-CNN, which introduced the Region
4.0/). Proposal Network (RPN) to generate candidate regions and utilized shared convolutional
features to further improve detection accuracy and efficiency. Cai et al. [9] introduced Cas-
cade R-CNN, which is best characterized by cascading classifiers and multi-stage training
to improve detection accuracy and speed. Wan et al. [10] proposed an improved version
of Faster R-CNN with optimized convolutional and pooling layers for detecting a wide
range of fruits and achieving higher accuracy. Yang et al. [11] introduced an improved
strawberry detection algorithm based on Mask R-CNN, which resulted in a substantial
improvement in model generalization and robustness. After the R-CNN family of algo-
rithms, Divvala et al. [12] proposed YOLO as an alternative to R-CNN. Unlike R-CNN,
YOLO directly predicts classifications and regressions from features, using a single fully
connected layer for both tasks. This design enhances speed and efficiency in processing.
However, the disadvantage of YOLO is that its generalization ability and robustness are not
strong, and it is easy to miss small targets. With the aim of improving the above problems
effectively, Liu et al. [13] introduced the Single Shot MultiBox Detector (SSD) family of
algorithms. Zhu et al. [14] used SSD to detect fruits on mango trees with an F1 of 0.91.
Anagnostis et al. [15] used SSD to categorize infected trees in walnut orchards and the
method detected whether walnut leaves were infected with 87% accuracy. Tian et al. [16]
proposed EasyRP-R-CNN, a convolution-based framework for cyclone detection. The
method was improved based on Region of Interest and achieved satisfactory detection
accuracy. Li et al. [17] proposed a lightweight convolutional neural network, WearNet, to
achieve automatic detection of scratches on contact sliding parts such as metal molding,
and the classification accuracy of the method can reach 94.16%.
All of the methods mentioned above have some problems. (1) These methods detect
and recognize a single or no target categories, which cannot satisfy the multi-target de-
tection task, and are not able to accurately localize and recognize small targets. (2) These
methods do not have strong feature extraction capabilities for small targets and cannot
extract enough information about the target features. Thus, they can generate noise in the
detection region, resulting in a decrease in accuracy. (3) These methods do not achieve a
balance between detection accuracy and speed to meet the real-time demands of detection
tasks. For the purpose of improving the detection accuracy in a multi-scale target environ-
ment, after considering accuracy and detection efficiency, this paper chooses to use Faster
R-CNN as a baseline to detect different multi-scale targets on the Pascal VOC (Visual Object
Classes) dataset [18]. In this paper, the following modifications are made while reducing
model computation and improving model detection performance:
(1) ResNet101 [19] is employed as the trunk network in the improved Faster R-CNN,
which enhances the feature extraction capabilities of the model.
(2) The Online Hard Example Mining (OHEM) algorithm [20] is used to help the model learn
hard-to-classify samples more efficiently, which in turn enhances the model’s capacity
for generalization. The Soft non-maximum suppression (Soft-NMS) algorithm [21] and
the Distance Intersection Over Union (DIOU) algorithm [22] are used to optimize the ex-
cessive bounding boxes generated by the RPN and their overlap degree, which enhances
the accuracy of detecting small targets and improves the issue of missed target detection.
(3) The RPN structure is optimized by adding an anchor box with a scale of 64 and using
a smaller convolutional kernel to achieve bounding box regression. Employing the
multi-scale training (MST) method to train the improved Faster R-CNN [23] achieves
a balance between detection accuracy and speed.
paring the four networks, VGG16, ResNet34, ResNet50, and ResNet101 [19], ResNet101 is
chosen as the trunk network. The introduction of DIOU can increase the effectiveness of
is
thechosen as theFaster
improved trunkR-CNN
network. The introduction
regarding of DIOU
the problems can convergence
of slow increase the effectiveness
of the target
of the improved Faster R-CNN regarding the problems of slow
detection loss function and target regression localization accuracy. convergence of the target
The introduction of
detection loss function and target regression localization accuracy. The
OHEM enables the improved Faster R-CNN to mine difficult samples in the dynamic introduction of
OHEM
trainingenables
process.the improved
This Faster
can improve theR-CNN
issue oftoimbalance
mine difficult
betweensamples in the
positive and dynamic
negative
training process. This can improve the issue of imbalance between positive
samples during the training process. As depicted in Figure 1, the feature map from and negative
the
samples during the training process. As depicted in Figure 1, the feature map
ResNet101 trunk network is input into the optimized RPN. At this point, a large number from the
ResNet101 trunk network is input into the optimized RPN. At this point, a large number of
of candidate proposal boxes are generated on the feature map. Soft-NMS is used to elim-
candidate proposal boxes are generated on the feature map. Soft-NMS is used to eliminate
inate redundant target proposal boxes. It can reduce the miss rate of small targets by grad-
redundant target proposal boxes. It can reduce the miss rate of small targets by gradually
ually decreasing the confidence score of overlapping proposal boxes.
decreasing the confidence score of overlapping proposal boxes.
2.1.1.
2.1.1. Improved
Improved Backbone
Backbone Network
Network
VGG16
VGG16 [25] serves as the
[25] serves as the trunk
trunk network
network forfor the
the original
original Faster
Faster R-CNN.
R-CNN. InIn general,
general,
data expansion and increasing network depth methods can be used
data expansion and increasing network depth methods can be used to improve to improve model per-
model
formance,
performance,andand
network
networkdepth is very
depth important
is very for optimizing
important network
for optimizing performance
network [26].
performance
ResNet [27] makes it possible for information to skip certain layers directly by introducing
[26]. ResNet [27] makes it possible for information to skip certain layers directly by intro-
direct
ducingconnections across layers
direct connections across in the network.
layers This effectively
in the network. improves
This effectively the problems
improves the prob-
of gradient vanishing and gradient explosion. ResNet mainly uses convolutional opera-
lems of gradient vanishing and gradient explosion. ResNet mainly uses convolutional op-
tions instead of fully connected layers, which reduces the number of network parameters
erations instead of fully connected layers, which reduces the number of network parame-
and can effectively avoid overfitting problems. Common ResNet structures are ResNet34,
ters and can effectively avoid overfitting problems. Common ResNet structures are Res-
ResNet50, and ResNet101 [28]. One of the most significant features of ResNet50 compared
Net34, ResNet50, and ResNet101 [28]. One of the most significant features of ResNet50
to ResNet34 is the introduction of a new bottleneck residual block structure. It comprises a
compared to ResNet34 is the introduction of a new bottleneck residual block structure. It
sequence of 1 × 1 convolutional layers, followed by a 3 × 3 convolutional layer, and then
comprises a sequence of 1 × 1 convolutional layers, followed by a 3 × 3 convolutional layer,
another 1 × 1 convolutional layer. This structure allows ResNet50 to have stronger feature
and then another 1 × 1 convolutional layer. This structure allows ResNet50 to have
representation while maintaining the depth of the model. ResNet101 has an additional
stronger feature representation while maintaining the depth of the model. ResNet101 has
set of convolutional blocks compared to ResNet50, which contains multiple residual units.
an additional set of convolutional blocks compared to ResNet50, which contains multiple
The deeper network structure allows ResNet101 to further enhance its expressive and
residual units. The deeper network structure allows ResNet101 to further enhance its ex-
learning capabilities, allowing it to better capture image details and semantic information,
pressive
as shownand learning
in Figure 2. capabilities, allowing it to better capture image details and semantic
information,
ResNet101as shown
consistsinof
Figure 2.
two fundamental blocks, as depicted in Figures 2a and 2b,
respectively, named Conv Block and Identity Block. The Conv Block’s input and output
dimensions are different and cannot be connected in series. Its function is to change the
network dimension. The input dimensions and output dimensions of the Identity Block
are the same and can be connected in series to deepen the network. ResNet101 consists
of multiple residual units in each convolutional block, as illustrated in Figure 2c. Each
residual unit performs three convolutional operations. The shortcut connections of residual
J. Imaging 2024, 10, 197 4 of 19
units and the identity mapping help address the issues of gradient vanishing or exploding,
which can lead to a decrease in detection accuracy. As shown in Figure 2d, ResNet101
J. Imaging 2024, 10, x FOR PEER REVIEW 4 of 21
consists of five convolutional layers and has a depth of 101 layers, which can provide
stronger feature extraction capabilities.
Resnet101network
2. Resnet101
Figure 2. networkcomposition.
composition.(a)(a) Conv
Conv Block.
Block. (b)(b) Identity
Identity Block.
Block. (c) Residual
(c) Residual block.
block. (d)
Resnet101 network
(d) Resnet101 structure.
network structure.
In the fourth
ResNet101 chapter,
consists ofcomparative experiments
two fundamental blocks, are conducted
as depicted in for VGG16,
Figure ResNet34,
2a and Figure
ResNet50, and ResNet101. Experimental results indicate that ResNet101 outperforms
2b, respectively, named Conv Block and Identity Block. The Conv Block’s input and out- the
other three networks in overall detection performance. Therefore, ResNet101 has
put dimensions are different and cannot be connected in series. Its function is to change been
selected
the as thedimension.
network trunk network
The for thedimensions
input improved Faster R-CNN.dimensions of the Identity
and output
Block are the same and can be connected in series to deepen the network. ResNet101 con-
2.1.2. Modifying the Region Proposal Network
sists of multiple residual units in each convolutional block, as illustrated in Figure 2c. Each
In the
residual object
unit detection
performs threeprocess, CNNs are
convolutional usually used
operations. Thetoshortcut
extract image features.
connections These
of resid-
features are convolved and pooled through several convolution and
ual units and the identity mapping help address the issues of gradient vanishing or ex- pooling operations to
produce a smaller feature map (also referred to as an activation map).
ploding, which can lead to a decrease in detection accuracy. As shown in Figure 2d, Res- The activation map
contains semantic information and location information taken from the image, and then the
Net101 consists of five convolutional layers and has a depth of 101 layers, which can pro-
activation map is input into the RPN. The RPN [8,29] will first further extract features from
vide stronger feature extraction capabilities.
the input activation map through a shared convolutional layer, and then it will initially
In the fourth chapter, comparative experiments are conducted for VGG16, ResNet34,
generate a set of predefined anchor boxes at each spatial position of the activation map to
ResNet50, and ResNet101. Experimental results indicate that ResNet101 outperforms the
accomplish object detection. These predefined anchor boxes generated during the detection
other three networks in overall detection performance. Therefore, ResNet101 has been se-
process usually have different width-to-height ratios as well as areas. Figure 3 shows the
lected as the trunk network for the improved Faster R-CNN.
schematic of the optimized RPN.
For each anchor box, the RPN network uses a binary classifier to predict whether it
2.1.2. Modifying the Region Proposal Network
contains a target. It outputs the probability that each anchor box belongs to the foreground
In the object
or background to detection
complete process,
the initialCNNs are usually
classification used to The
prediction. extract image features.
conventional RPN
These features
generates are convolved
nine anchor boxes atandeachpooled
spatialthrough
position several convolution
of the activation mapandwithpooling oper-
aspect ratios
ations to produce
of 1:1, 1:2, and 2:1 aandsmaller
scalesfeature
of 128,map256, (also referred
and 512, to as an activation
respectively, as shown in map). The3a.
Figure acti-
In
vation map contains semantic information and location information
order to improve the accuracy of detecting small targets and reduce the leakage rate, ataken from the image,
and
newthen
anchorthebox
activation
with a map
scaleisofinput
64 is into
added theto
RPN. The RPN
the RPN while[8,29] will first
the aspect further
ratio extract
remains un-
features from the input activation map through a shared convolutional
changed [30], so that there are a total of 12 anchor boxes at each position of the feature layer, and then it
map,
will initially
which is thengenerate a set capture
able to better of predefined anchor
small-scale boxesasatshown
targets, each spatial position
in Figure of the acti-
3b. In addition to
vation map to accomplish
the classification predictions,object
the detection.
RPN completesThesean predefined anchor boxes
initial bounding generated
box regression ondur-
the
ing the detection
positive process usually
samples (containing have different
the target’s anchor boxes)width-to-height
with the goal ratios as well aspredict-
of accurately areas.
Figure
ing the 3target’s
showsbounding
the schematic of the optimized
box position. The output RPN.
is the translation and scaling parameters
relative to each positive sample anchor box. As depicted in Figure 3c, the structure of the
RPN is simplified, and only the 3 × 3 convolutional kernel is used to generate 256 feature
J. Imaging 2024, 10, 197 5 of 19
maps. The benefits of using a 3 × 3 convolution kernel include: reducing the number of
parameters and computational burden, improving computational efficiency, capturing local
features, effectively utilizing boundary information, and simplifying the network design.
J. Imaging 2024, 10,The
x FOR × 3 REVIEW
3 PEER convolution kernel is able to maintain model simplicity and computational 5
efficiency while ensuring the effectiveness and accuracy of feature extraction.
FigureRPN
Figure 3. Optimized 3. Optimized RPN
schematic. (a)schematic. (a) window
RPN sliding RPN sliding
andwindow
anchors.and
(b)anchors.
Modified(b)RPN
Modified RPN
chors. (c)
anchors. (c) Modified RPN. Modified RPN.
Figure 4. Schematic representation of the different levels of overlap of the bounding box.
Figure 4. Schematic representation of the different levels of overlap of the boundingbox.
(a) Three possible scenarios when IoUs are identical. (b) DIOU loss for bounding box regression.
possible scenarios when IoUs are identical. (b) DIOU loss for bounding box regression.
(2) OHEM
During
(2) OHEMthe object detection training process, the activation maps from the trunk
network are input into the RPN, at which time a large number of candidate proposal
During the object detection training process, the activation maps from the tr
boxes are generated. Generally, the IOU threshold is set to 0.5 and proposal boxes with
work
an IOU are
aboveinput into
0.5 are the RPN,
retained at which
as positive timewhile
samples, a large
thosenumber
below 0.5ofare candidate
treated as propos
are generated.
negative Generally,
samples. This the IOU threshold
leads to significantly is set
more positive to 0.5than
samples and proposal
negative ones.boxes
As with
aabove 0.5model
result, the are retained as positive
may overlook samples,
difficult negative whilethat
samples those below 0.5 to
are challenging are treated as
detect.
These difficult
samples. samples
This leadscanto contribute
significantlyhigher loss values,
more positiveimproving
samples thethan
overall detection
negative ones. As
performance of the model.
theInmodel may overlook difficult negative samples that are challenging to detec
order to enable more difficult samples to be used in the dynamic process of training,
difficult
OHEM samplesinto
is introduced canthe
contribute higher
improved Faster loss[20].
R-CNN values,
Figureimproving theworking
5 illustrates the overall detec
formance
principle of the OHEM
of OHEM. model.[33] improves the training effect of the model by dynamically
selecting and processing those difficult samples that the model currently has difficulty clas-
sifying during the training process. Specifically, it first detects the input image during each
training session and selects the samples with higher loss as difficult samples according to
the loss value. These difficult samples are then utilized for backpropagation and parameter
updating so that the model learns to correctly classify difficult cases faster, thus improving
its overall performance. The OHEM training strategy in the dynamic process of training
can send the target samples that are difficult to detect into the network again for deep
learning training. This makes the network more sensitive to the detection target, which in
turn improves the target detection accuracy.
ples according to the loss value. These difficult samples are then utilized for backpropa-
gation and parameter updating so that the model learns to correctly classify difficult cases
faster, thus improving its overall performance. The OHEM training strategy in the dy-
namic process of training can send the target samples that are difficult to detect into the
J. Imaging 2024, 10, 197 network again for deep learning training. This makes the network more sensitive 7toofthe19
detection target, which in turn improves the target detection accuracy.
Figure5.5.Diagram
Figure Diagramof
ofthe
theOHEM
OHEMalgorithm.
algorithm.
(3)Soft-NMS
(3) Soft-NMS
NMS isis aavery
NMS veryimportant
importantalgorithm
algorithm in intarget
targetdetection.
detection. ItsIts basic
basic idea
idea is
is to
to sort
sort the
the
proposal
proposalboxes
boxesbybytheir
theirconfidence
confidencescoresscoresandandretain
retainthe
theone
onewith
withthe thehighest
highestscore.
score.InInthis
this
process,
process,ififthe
theoverlap
overlapbetween
betweentwo twoproposal
proposalboxes
boxesexceeds
exceedsaa set
set threshold
threshold (generally
(generally 0.5),
0.5),
the
thebox
boxwith
with the
the lower
lower score
score isis discarded,
discarded, and and the
the one
one with
with the
the higher
higher score
score isis retained.
retained.
Therefore,
Therefore,TheTheNMS
NMSscore
scoreisisbased
basedsolely
solelyononthe
theclassification
classificationconfidence,
confidence, without
without consider-
consid-
ing thethe
ering localization
localizationaccuracy
accuracy ofofthe bounding
the bounding box.
box.This
Thismeans
meansthat thatthe
theclassification
classificationand and
localization
localizationconfidences
confidencesarearenotnotpositively
positivelycorrelated.
correlated.
To
Toeffectively
effectivelyaddress
addresssomesomeof ofthe
thelimitations
limitationsof ofNMS,
NMS,Soft-NMS
Soft-NMS isisadopted
adopted in inthe
the
improved Faster R-CNN [21]. Its linear weighted equation is defined
improved Faster R-CNN [21]. Its linear weighted equation is defined as Equation (2). as Equation (2).
SSi IOU (M, bi ) b<)N<t N
IOU(M,
i i t
Si = (2)
Si = Si (1 − IOU(M, bi )
IOU(M, bi ) ≥ Nt (2)
Here, Si is the proposal i (1 −confidence
Sbox IOU(M, b iscore
) IOU(M,
and M is b i) ≥
the Nt
proposal box with the
highest score. The set of all proposal boxes during training is denoted by b, the area of
Here, Si is the proposal box confidence score and M is the proposal box with the
overlap between M and the proposal boxes in set b is denoted by IOU(M, bi ), and Nt is
highest
the score. Theset
IOU threshold setatofthe
allbeginning.
proposal boxes during training
The difference is denoted
between Soft-NMSbyand b ,NMS
the area of
is the
overlap between M and the proposal boxes in set b is denoted by
way overlapping bounding boxes are handled. Traditional NMS directly removesi otherIOU(M, b ) , and
N t is theboxes
bounding that overlap
IOU threshold set with
at thethe highest scoring
beginning. box above
The difference a certain
between threshold,
Soft-NMS andwhich
NMS
may result in some correct detections being mistakenly removed. In contrast, Soft-NMS
is the way overlapping bounding boxes are handled. Traditional NMS directly removes
does not remove these overlapping bounding boxes, but gradually reduces their scores.
other bounding boxes that overlap with the highest scoring box above a certain threshold,
Soft-NMS works by attenuating the scores proportionally to the degree of overlap between
which may result in some correct detections being mistakenly removed. In contrast, Soft-
the bounding box and the highest-scoring box [34]. The larger the overlap, the more the
NMS does not remove these overlapping bounding boxes, but gradually reduces their
score is attenuated. In this way, Soft-NMS is able to retain more valuable detection results,
scores. Soft-NMS works by attenuating the scores proportionally to the degree of overlap
reduce missed detections, and improve the accuracy of target detection.
between the bounding box and the highest-scoring box [34]. The larger the overlap, the
moreMulti-Scale
2.1.4. the score is Training
attenuated. In this way, Soft-NMS is able to retain more valuable detec-
tion The
results, reduce missed can
MST [23] method detections,
enhanceand
theimprove
model’sthe accuracy ofand
adaptability target detection. ca-
generalization
pabilities when detecting a variety of target sizes. An image pyramid called MST [35] is
used in the training process of CNN. As shown in Figure 6, image training is achieved
by randomly inputting images of different scales within a given segment of image size,
which enables the target detection model to adapt to targets of different scales. In the
testing phase, the same image at different scales is input for multiple detections. Finally,
Soft-NMS is employed to integrate all detection results, which enables the detection model
to cover as many targets as possible at multiple scales and improves the robustness of the
detection model. During the feature extraction phase, the generated activation map will be
significantly smaller than the original image. This can make it challenging for the model to
focus on the details of small targets. Therefore, by providing the model with larger and
richer images, its detection capabilities can be enhanced effectively [36]. In this paper, MST
is used to train the improved Faster R-CNN. The training samples have image lengths
ranging from 380 to 640, and image widths spanning 300 to 450.
detection model. During the feature extraction phase, the generated activation map will
be significantly smaller than the original image. This can make it challenging for the model
to focus on the details of small targets. Therefore, by providing the model with larger and
richer images, its detection capabilities can be enhanced effectively [36]. In this paper, MST
J. Imaging 2024, 10, 197 8 of 19
is used to train the improved Faster R-CNN. The training samples have image lengths
ranging from 380 to 640, and image widths spanning 300 to 450.
Figure 6.
Figure 6. Diagram
Diagram of
of the multi-scale training
the multi-scale training strategy.
strategy.
2PR (3)
F1 = AP = p(r)dr
× 100%
P + R
0
LAMR = 2PR
1 M
(3)
∑ log(MR )
F1 = M i=1 × 100%i
P+R
K
∑1APMi
LAMR = i=1 log(MR i)
mAP = × 100%
M
K i =1
K
Figure7.7.Data
Figure Data enhancement effect diagram.
enhancement effect diagram.
3.2. Results
All experiments are conducted on the Colab cloud server platform, using a Tesla T4
graphics card with 16 GB of video memory and Windows as the operating system. The
deep learning framework used is PyTorch 1.9, with 100 training epochs and all images
J. Imaging 2024, 10, 197 10 of 19
3.2. Results
All experiments are conducted on the Colab cloud server platform, using a Tesla T4
graphics card with 16 GB of video memory and Windows as the operating system. The
deep learning framework used is PyTorch 1.9, with 100 training epochs and all images
uniformly resized to 400 × 400. To improve model robustness and efficiency, the expanded
dataset is divided into three categories, A, B, and C [30], and the proportional allocation of
the dataset in each category is depicted in Table 2.
The improved Faster R-CNN is first applied to the training set of three datasets and the
results are recorded. During the training process, model error is minimized by adjusting
various training parameters over 100 iterations of training. At last, the improved Faster
R-CNN is trained using test data from the three datasets. Figure 8 shows the experimental
result curves and data distributions during the training process.
By observing the changes and positions of the curves, we know that the accuracy curves
of dataset A are steadily increasing and located at the top of the three curves. This indicates that
the improved Faster R-CNN has the best performance on dataset A. Meanwhile, the loss value
curve of dataset A decreases steadily. The fluctuation is small, located at the bottom of the three
curves, and gradually tends to be stable. This also indicates that the improved Faster R-CNN
has stronger stability and robustness on dataset A. The variation of these curves can be seen in
Figure 8a,c,e,g. Accordingly, by observing the data distribution of the experimental results, it
can be found that, compared to the training and testing sets of datasets B and C, the accuracy
values of dataset A are more centrally distributed and have higher positions, while the loss
values are at lower distribution positions and more compactly distributed. The improved
Faster R-CNN is adequately trained on dataset A with better generalization performance. This
is because 80% of the images in dataset A are used for training so that the model can extract
J. Imaging 2024, 10, x FOR PEER REVIEW
richer information about image features. Based on these experimental results, dataset 11 ofA21is used
Figure 8. Cont.
J. Imaging 2024, 10, 197 11 of 19
Figure 8. The result curves and data distribution for datasets A, B, and C. (a) Training accuracy
Figure 8. The result curves and data distribution for datasets A, B, and C. (a) Training accuracy curve.
curve. (b) Distribution of training accuracy data. (c) Training loss curve. (d) Distribution of training
(b) loss
Distribution of training
data. (e) Test accuracyaccuracy
curve. (f)data. (c) Training
Distribution loss
of test curve.data.
accuracy (d) Distribution of training
(g) Test loss curve. loss data.
(h) Dis-
(e) Test accuracy
tribution of testcurve. (f) Distribution of test accuracy data. (g) Test loss curve. (h) Distribution of
loss data.
test loss data.
By observing the changes and positions of the curves, we know that the accuracy curves
4. Discussion
of dataset A are steadily increasing and located at the top of the three curves. This indicates
4.1.that
Comparison
the improvedof Trunk Networks
Faster R-CNN has the best performance on dataset A. Meanwhile, the
loss value curve of dataset
In this section, comparative A decreases steadily. The
experiments fluctuation
[24] is small,for
are conducted located
VGG16, at theResNet34,
bot-
tom of the three curves, and gradually tends to be stable. This also indicates that the im-
ResNet50, and ResNet101. A total of 15,000 images of dataset A are divided in the ratio
proved Faster R-CNN has stronger stability and robustness on dataset A. The variation of
of 8:1:1, and SGD is employed to optimize the model [42]. The momentum is 0.9, the
these curves can be seen in Figure 8a,c,e,g. Accordingly, by observing the data distribution
learning rate is 0.001,
of the experimental and the
results, it canF1
bethreshold
found that,is 0.5. After
compared 100training
to the epochs, andthe model
testing sets training
of
curve converges, indicating that the model has reached an optimal solution.
datasets B and C, the accuracy values of dataset A are more centrally distributed and have The improved
Faster
higherR-CNN improves
positions, while the theloss
capability
values areofattarget
lower detection
distributionby combining
positions different
and more com- trunk
networks. Table 3 shows the experimental results on the test set.
pactly distributed. The improved Faster R-CNN is adequately trained on dataset A with
better generalization performance. This is because 80% of the images in dataset A are used
for3.
Table training so that
Predictive the model can
performance extract richer
of different trunkinformation
networks. about image features. Based on
these experimental results, dataset A is used for all subsequent experiments.
Methods [email protected] (%) F1 (%) LAMR (%)
VGG16 72.7 55.3 31.4
ResNet34 73.3 56.4 30.8
ResNet50 73.8 56.2 30.2
ResNet101 74.9 57.2 29.5
As can be seen in Table 3, the trunk network with a residual structure has stable detection
performance in the multi-scale target category. With the increase of network layers, the
detection accuracy of ResNet50 surpasses that of VGG16 and there are fewer missed detections
for targets. Since the residual unit module inside the ResNet101 network is connected by a
multilevel residual network, this makes it better able to capture the local feature information
of multi-scale targets. ResNet101 has the best detection effect, with [email protected] reaching 74.9%
and the LAMR reaching 29.5%. Therefore, from Table 3, it can be seen that the improvement
of the Faster R-CNN trunk network is effective. ResNet101 performs better in multi-scale
target category detection, and the miss rate for small targets is lower.
Figure 9 compares the detection results of the four trunk networks. Focusing on the
first two columns, when using ResNet101 as the feature extractor, detection performance
is notably improved compared to the other trunk networks. The target detection scores
are generally higher, the target localization is more accurate, and there are no target
misdetections or missed detections. VGG16 and ResNet50 mistakenly detected the ponytail
as a person, while ResNet34 missed the couch and did not detect the chair in the upper right
corner. In the third resultant image containing only small targets, ResNet101 can accurately
detect the person and the occluded bicycle in the image. This further demonstrates better
detection performance when using ResNet101.
J. Imaging
J. Imaging 2024,
2024, 10,10,
197x FOR PEER REVIEW 13 of
12 21
of 19
Figure9.9.Visual
Figure Visualcomparison
comparison of detection results
results for
forfour
fourbackbone
backbonenetworks.
networks.
4.2.
4.2.Comparison
Comparison of of Different
Different Object Detectors
Object Detectors
ToTodemonstrate
demonstratethat thatthe
theproposed
proposedmethod
methodhas hasbetter
betterdetection
detectioneffectiveness
effectivenessand andaccuracy,
accu-
experiments are also performed on the test set of the expanded dataset
racy, experiments are also performed on the test set of the expanded dataset for RetinaNet for RetinaNet [43],
Faster R-CNN [8], Mask R-CNN [44], YOLOv4 [45], and Cascade
[43], Faster R-CNN [8], Mask R-CNN [44], YOLOv4 [45], and Cascade R-CNN [9]. The SGD R-CNN [9]. The SGD
optimizer
optimizerisisused
used toto optimize
optimize thethe model
modelwith
withaamomentum
momentumofof0.9 0.9and
anda alearning
learning rate
rate of of 0.001,
0.001,
with
with the learning rate decaying 0.1 times every 20 epochs. By continuously adjusting the train-the
the learning rate decaying 0.1 times every 20 epochs. By continuously adjusting
training parameters,
ing parameters, afterafter 100 epochs,
100 epochs, the training
the training curvescurves
of eachofcomparison
each comparison model gradually
model gradually level
level off. This
off. This indicates
indicates that
that the the model
model training
training processprocess is relatively
is relatively smooth.smooth. Table 4the
Table 4 shows shows
meanthe
mean
valuesvalues
of eachofmetric
each metric
for the for the 20 targets
20 targets detecteddetected
by the six bycomparison
the six comparison
models onmodels
the teston
set.the
test set. 10
Figure Figure
shows10 ashows a comparison
comparison of thevalues
of the metric metricandvalues
theirand
datatheir data distributions
distributions for the 20for
de-the
20 detection targets of the six comparison models
tection targets of the six comparison models on the test set. on the test set.
Table4.4.Experimental
Table Experimental mean
mean results for six comparison
comparisonmodels.
models.
Methods
Methods [email protected]
[email protected] (%) (%) F1 (%)
F1 (%) LAMR
LAMR (%)(%) T
T (s)
RetinaNet
RetinaNet 75.675.6 58.558.5 29.2
29.2 0.155
0.155
Faster
FasterR-CNN
R-CNN 72.772.7 55.355.3 31.4
31.4 0.147
0.147
Mask R-CNN 75.8 58.2 28.5 0.153
Mask
YOLOv4 R-CNN 76.575.8 59.258.2 28.5
27.2 0.153
0.132
Cascade R-CNN
YOLOv4 76.276.5 59.759.2 28.1
27.2 0.139
0.132
This Paper 77.8 60.6 26.5 0.163
Cascade R-CNN 76.2 59.7 28.1 0.139
This Paper 77.8 60.6 26.5 0.163
Table 4 shows the mean values of each metric for the 20 targets detected by the
six comparison models
Table 4 shows the on thevalues
mean test set.ofFigure 10 shows
each metric a comparison
for the of the metric
20 targets detected by thevalues
six
and their datamodels
comparison distributions for the
on the test set. 20 detection
Figure targets
10 shows of the six comparison
a comparison of the metricmodels
values on
andthe
test set.
their Thedistributions
data improved Faster
for theR-CNN improves
20 detection [email protected]
of the sixtocomparison
77.8%, F1 tomodels
60.6%,onandthethe
LAMR
test set.toThe
26.5%. Compared
improved FastertoR-CNN
the other five detection
improves [email protected] models, the proposed
to 77.8%, method
F1 to 60.6%, and thehas
better
LAMR performance. Specifically,
to 26.5%. Compared compared
to the with
other five the original
detection Faster
models, theR-CNN,
proposed the [email protected]
method has is
improved by 5.1%, the F1 is improved by 5.3%, and the LAMR is reduced by 4.9%. This
indicates that the proposed method is effective in improving the detection rate of small
J. Imaging 2024, 10, x FOR PEER REVIEW 14 of 21
Figure10.
Figure 10.Comparison
Comparison of
of experimental
experimental result
resultvalues
valuesand
andtheir
theirdata
datadistribution forfor
distribution 20 20
detection
detection
targets for each comparison model. (a,b) Comparisons of AP values. (c,d) Comparisons of values.
targets for each comparison model. (a,b) Comparisons of AP values. (c,d) Comparisons of F1 F1 values.
(e,f) Comparisons of LAMR values.
(e,f) Comparisons of LAMR values.
Observing Figure 10a,b, it can be seen that 15 AP values of the proposed approach
are located in the first position, and the data distribution of the prediction results is more
centralized than the other compared models. This indicates that the overall detection accuracy
J. Imaging 2024, 10, 197 14 of 19
of the model is higher. From Figure 10c,d, it can be seen that 17 F1 values are in the leading
position, among which two F1 values exceed 0.8, and the distribution intervals of the F1 values
are relatively higher. This indicates that the model is more adaptable when facing multi-scale
target categories. As shown in Figure 10e,f, the prediction results of the proposed approach
all have LAMR values below 0.5, and the miss rate is also reduced for small targets such as
pottedplant (1), sheep (4), and bottle (5). The reason for this may be that the bounding box
optimization mechanism, as well as the introduced multi-scale training strategy, plays a role.
The data distribution interval of the LAMR is also relatively lower. This indicates that the
model is able to focus on small targets that are difficult to detect when performing multi-scale
target category detection, and the model is more robust and stable. Figure 11 shows the
visualization of the detection image after mosaic data enhancement. Compared to the other
detection models, although the proposed approach is not the best in each target category, the
overall effect is excellent. The very small car in the upper right corner, as well as the occluded
chair in the lower left corner, can be accurately detected, and both target detection scores are
relatively quite high. This demonstrates that the proposed approach can better adapt to the
J. Imaging 2024, 10, x FOR PEER REVIEW 16 ofand
different sizes and shapes of targets in different multi-scale detection target environments 21
improve the accuracy of small target detection.
Figure 11.Visual
Figure11. Visualcomparison
comparisonof
ofdetection
detection results
results for
for six
six comparison
comparison models.
models.
Method ResNet101 RPN DIOU OHEM Soft-NMS MST [email protected] (%) F1 (%) LAMR (%)
Faster R-
72.7 55.3 31.4
CNN(VGG16)
Faster R-
✓ 74.9 57.2 29.5
CNN(ResNet101)
Improve1 ✓ ✓ 75.2 57.4 29.2
Improve2 ✓ ✓ 75.1 57.3 29.3
Improve3 ✓ ✓ 75.4 57.6 28.9
Improve4 ✓ ✓ 75.0 57.4 29.3
Improve5 ✓ ✓ ✓ ✓ ✓ 75.5 58.1 28.7
This paper ✓ ✓ ✓ ✓ ✓ ✓ 77.8 60.6 26.5
As depicted in Table 5, the addition of the modified RPN to the Faster R-CNN
(ResNet101) increases the [email protected] by 0.3% and the F1 by 0.2%. The introduction of
DIOU improves the [email protected] by 0.2% and the F1 by 0.1%. It can be seen that the
introduction of OHEM increases the [email protected] by 0.5% and the F1 by 0.4%. This shows
its improvement for the problems of sample imbalance and insufficient training of hard
case samples in the target detection task. After replacing NMS with Soft-NMS, the miss
rate is reduced, while performance and accuracy are improved. Finally, by combining
the modules with a multi-scale training strategy, an improved Faster R-CNN is obtained
with a mAP of 77.8%, which is 5.1% higher than the original Faster R-CNN, and the
F1 is improved by 5.3%. It is worth noting that the LAMR of the proposed approach
compared to Improve5 is lower. This improvement is attributed to the efficacy of the
MST [16] in reducing the miss rate and enhancing the model’s recognition capabilities for
small targets. The partial detection results for Faster R-CNN, Improved5, and Improved
Faster R-CNN are shown in Figure 12. A comparison shows that Faster R-CNN is not
very effective in detecting cows and bottles with small sizes and misses some small
targets. While Improved5 improves this phenomenon, the network can detect some
small targets that were originally missed. The best detection was achieved by Improved
Faster R-CNN, which had more accurate edge localization, detected more small targets,
and largely correctly identified overlapping and low-contrast targets. The detection
performance of this model is much stronger.
J. Imaging
J. Imaging2024,
2024, 10,
10, x FOR PEER REVIEW
197 16 of 19 18 of 2
Figure
Figure Visual
12.12. comparison
Visual of detection
comparison resultsresults
of detection for ablation experiments.
for ablation experiments.
5. Conclusions
5. Conclusions
This paper proposes a novel two-stage object detection model for detecting multi-
This paper
scale objects proposes
from diverse a novelThe
categories. two-stage object detection
model introduces model to
improvements forthedetecting
Faster mult
scale objects
R-CNN and its from
trunk diverse categories.
feature extraction The model
network. DIOU,introduces
OHEM, andimprovements to the Faste
Soft-NMS are used
toR-CNN
improveand
theits
problems of unbalanced positive and negative samples and target
trunk feature extraction network. DIOU, OHEM, and Soft-NMS are use miss
rate during model
to improve training. The
the problems RPN is also positive
of unbalanced optimizedand
andnegative
the proposed approach
samples is
and target mis
trained by employing a multi-scale training strategy. Comparison experiments with trunk
rate during model training. The RPN is also optimized and the proposed approach
networks verify that using the ResNet101 feature extraction network is more advantageous.
trained by employing a multi-scale training strategy. Comparison experiments with trun
networks verify that using the ResNet101 feature extraction network is more advanta
geous. The validity of the proposed approach is further confirmed by comparison exper
ments with other detection models. Ablation experiments are also conducted to verify tha
the modules in the proposed approach can indeed be useful. The experiments show tha
J. Imaging 2024, 10, 197 17 of 19
Author Contributions: Q.C. and M.L. designed the study; Q.C. and Z.L. performed the research; M.L.
and J.Z. conceived the idea; Q.C., J.Z. and Z.L. provided and analyzed the data; Q.C., M.L. and L.G.
helped perform the analysis with constructive discussions. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China (Grant
Numbers: 51663001, 52063002, 42061067 and 61741202).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The dataset is available free of charge at kaggle. (https://www.kaggle.
com/datasets/qianyongchen/dataset, accessed on 10 August 2024).
Acknowledgments: The authors thank the anonymous reviewers and editors for their valuable
comments and constructive suggestions.
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Zeng, N.; Wu, P.; Wang, Z.; Li, H.; Liu, W.; Liu, X. A Small-Sized Object Detection Oriented Multi-Scale Feature Fusion Approach
with Application to Defect Detection. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [CrossRef]
2. Deng, Y.; Hu, X.L.; Li, B.; Zhang, C.X.; Hu, W.M. Multi-scale self-attention-based feature enhancement for detection of targets
with small image sizes. Pattern Recogn. Lett. 2023, 166, 46–52. [CrossRef]
3. Ma, Y.L.; Wang, Q.Q.; Cao, L.; Li, L.; Zhang, C.J.; Qiao, L.S.; Liu, M.X. Multi-Scale Dynamic Graph Learning for Brain Disorder
Detection with Functional MRI. IEEE Trans. Neur. Syst. Rehabil. 2023, 31, 3501–3512. [CrossRef] [PubMed]
4. Menezes, A.G.; de Moura, G.; Alves, C.; de Carvalho, A. Continual Object Detection: A review of definitions, strategies, and
challenges. Neural Netw. 2023, 161, 476–493. [CrossRef] [PubMed]
5. Xu, S.B.; Zhang, M.H.; Song, W.; Mei, H.B.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater
object detection. Neurocomputing 2023, 527, 204–232. [CrossRef]
6. Goswami, P.K.; Goswami, G. A Comprehensive Review on Real Time Object Detection using Deep Learing Model. In Proceedings
of the 2022 11th International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India,
16–17 December 2022; pp. 1499–1502.
7. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Coiumbus, OH, USA, 23–28 June 2014;
IEEE: Piscataway, NJ, USA, 2014; pp. 580–587.
8. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans.
Pattern Anal. 2017, 28, 1137–1149. [CrossRef] [PubMed]
9. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition(CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018;
pp. 6154–6162.
10. Wan, S.; Goudos, S. Faster R-CNN for multi-class fruit detection using a robotic vision system. Comput. Netw. 2020, 168, 107036.
[CrossRef]
11. Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on
Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [CrossRef]
12. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway,
NJ, USA, 2016; pp. 779–788.
J. Imaging 2024, 10, 197 18 of 19
13. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of
the 2016 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37.
14. Liang, Q.; Zhu, W.; Long, J.; Wang, Y.; Sun, W.; Wu, W. A real-time detection framework for on-tree mango based on SSD network.
In Proceedings of the 2018 11th International Conference on Intelligent Robotics and Applications (ICIRA), Newcastle, NSW,
Australia, 9–11 August 2018; pp. 423–436.
15. Anagnostis, A.; Tagarakis, A.C.; Asiminari, G.; Papageorgiou, E.; Kateris, D.; Moshou, D.; Bochtis, D. A deep learning approach
for anthracnose infected trees classification in walnut orchards. Comput. Electron. Agric. 2021, 182, 105998. [CrossRef]
16. Tian, X.X.; Bi, C.K.; Han, J.; Yu, C. EasyRP-R-CNN: A fast cyclone detection model. Vis. Comput. 2024, 40, 4829–4841. [CrossRef]
17. Li, W.; Zhang, L.C.; Wu, C.H.; Cui, Z.X.; Niu, C. A new lightweight deep neural network for surface scratch detection. Int. J. Adv.
Manuf. Tech. 2022, 123, 1999–2015. [CrossRef] [PubMed]
18. Tong, K.; Wu, Y. Rethinking PASCAL-VOC and MS-COCO dataset for small object detection. J. Vis. Commun. Image R. 2023,
93, 103830. [CrossRef]
19. Demir, A.; Yilmaz, F.; Kose, O. Early detection of skin cancer using deep learning architectures: Resnet-101 and inception-v3. In
Proceedings of the 2019 Medical Technologies Congress (TIPTEKNO 2019), Izmir, Turkey, 3–5 October 2019; pp. 1–4.
20. Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE:
Piscataway, NJ, USA, 2016; pp. 761–769.
21. Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--improving object detection with one line of code. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway,
NJ, USA, 2017; pp. 5561–5569.
22. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In
Proceedings of the 2020 20th AAAI Conference on Artificial Intelligence (AAAI-20), New York, NY, USA, 7–12 February 2020;
pp. 12993–13000.
23. Tian, R.; Shi, H.; Guo, B.; Zhu, L. Multi-scale object detection for high-speed railway clearance intrusion. Appl. Intell. 2022, 52,
3511–3526. [CrossRef]
24. Wang, H.; Xiao, N. Underwater object detection method based on improved Faster RCNN. Appl. Sci. 2023, 13, 2746. [CrossRef]
25. Lu, X.; Wang, H.; Zhang, J.J.; Zhang, Y.T.; Zhong, J.; Zhuang, G.H. Research on J wave detection based on transfer learning and
VGG16. Biomed. Signal. Process. 2024, 95, 106420. [CrossRef]
26. Pal, S.K.; Pramanik, A.; Maiti, J.; Mitra, P. Deep learning in multi-object detection and tracking: State of the art. Appl. Intell. 2021,
51, 6400–6429. [CrossRef]
27. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition(CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016;
pp. 770–778.
28. Corso, M.P.; Stefenon, S.F.; Singh, G.; Matsuo, M.V.; Perez, F.L.; Leithardt, V.R.Q. Evaluation of visible contamination on power
grid insulators using convolutional neural networks. Electr. Eng. 2023, 105, 3881–3894. [CrossRef]
29. Chan, S.X.; Tao, J.; Zhou, X.L.; Bai, C.; Zhang, X.Q. Siamese Implicit Region Proposal Network with Compound Attention for
Visual Tracking. IEEE Trans. Image Process. 2022, 31, 1882–1894. [CrossRef] [PubMed]
30. Sha, G.; Wu, J.; Yu, B. The improved faster-RCNN for spinal fracture lesions detection. J. Intell. Fuzzy Syst. 2022, 42, 5823–5837.
[CrossRef]
31. Zhou, D.; Fang, J.; Song, X.; Guan, C.; Yin, J.; Dai, Y.; Yang, R. Iou loss for 2d/3d object detection. In Proceedings of the 2019
International Conference on 3D Vision (3DV), Québec City, QC, Canada, 16–19 September 2019; pp. 85–94.
32. Shen, Y.Y.; Zhang, F.Z.; Liu, D.; Pu, W.H.; Zhang, Q.L. Manhattan-distance IOU loss for fast and accurate bounding box regression
and object detection. Neurocomputing 2022, 500, 99–114. [CrossRef]
33. Wang, Z.H.; Jiang, Q.P.; Zhao, S.S.; Feng, W.S.; Lin, W.S. Deep Blind Image Quality Assessment Powered by Online Hard Example
Mining. IEEE Trans. Multimed. 2023, 25, 4774–4784. [CrossRef]
34. Li, W.B.; Wang, Q.; Gao, S. PF-YOLOv4-Tiny: Towards Infrared Target Detection on Embedded Platform. Intell. Autom. Soft
Comput. 2023, 37, 921–938. [CrossRef]
35. Xiao, L.; Wu, B.; Hu, Y. Surface defect detection using image pyramid. IEEE Sens. J. 2020, 20, 7181–7188. [CrossRef]
36. Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End
Object Detection with Learnable Proposals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 14449–14458.
37. Fang, H.; Ding, L.; Wang, L.; Chang, Y.; Yan, L.; Han, J. Infrared Small UAV Target Detection Based on Depthwise Separable
Residual Dense Network and Multiscale Feature Fusion. IEEE Trans. Instrum. Meas. 2022, 71, 1–20. [CrossRef]
38. Smart, P.D.S.; Thanammal, K.K.; Sujatha, S.S. An Ontology Based Multilayer Perceptron for Object Detection. Comput. Syst. Sci.
Eng. 2023, 44, 2065–2080. [CrossRef]
39. Zhang, X.; Zhao, C.; Luo, H.Z.; Zhao, W.Q.; Zhong, S.; Tang, L.; Peng, J.Y.; Fan, J.P. Automatic learning for object detection.
Neurocomputing 2022, 484, 260–272. [CrossRef]
40. Chen, J.A.; Tam, D.; Raffel, C.; Bansal, M.; Yang, D.Y. An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP. Trans. Assoc. Comput. Linguist 2023, 11, 191–211. [CrossRef]
J. Imaging 2024, 10, 197 19 of 19
41. Shi, J.; Ghazzai, H.; Massoud, Y. Differentiable Image Data Augmentation and Its Applications: A Survey. IEEE Trans. Pattern
Anal. 2024, 46, 1148–1164. [CrossRef]
42. Gower, R.M.; Loizou, N.; Qian, X.; Sailanbayev, A.; Shulgin, E.; Richtárik, P. SGD: General analysis and improved rates. In
Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 5200–5209.
43. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988.
44. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision
(ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969.
45. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.