0% found this document useful (0 votes)
90 views6 pages

Deep Learning for Shark Detection

This document discusses using deep learning models for shark detection from beach images collected by UAVs. It evaluates several state-of-the-art object detection models (Faster R-CNN, Mask R-CNN, FPN, RetinaNet) on a dataset of shark and non-shark objects collected off the coast of Southern California. The experiments show these deep learning models can efficiently detect sharks and other objects in sparse, unevenly illuminated beach images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views6 pages

Deep Learning for Shark Detection

This document discusses using deep learning models for shark detection from beach images collected by UAVs. It evaluates several state-of-the-art object detection models (Faster R-CNN, Mask R-CNN, FPN, RetinaNet) on a dataset of shark and non-shark objects collected off the coast of Southern California. The experiments show these deep learning models can efficiently detect sharks and other objects in sparse, unevenly illuminated beach images.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning for Shark Detection Tasks

Wenlu Zhang∗ , Xinyi Chen∗ , Dhara Bhadani∗ , Patrick Rex† , Yu Yang‡ , Christopher G. Lowe† , Hen-Geul Yeh§
∗ Department of Computer Engineering and Computer Science, California State University Long Beach, CA, 90840
† Department of Biological Sciences, California State University Long Beach, CA, 90840
‡ Department of Chemical Engineering, California State University Long Beach, CA, 90840
2021 IEEE Green Energy and Smart Systems Conference (IGESSC) | 978-1-6654-3456-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/IGESSC53124.2021.9618703

§ Department of Electrical Engineering, California State University Long Beach, CA, 90840

Abstract—Automatic detection of free-ranging sharks from in object detection modeling design. The most state-of-the-
beach areas is of great importance in maintaining a safe human- art object detection can be divided into two major categories:
shark interaction. The task is especially challenging due to most region proposed based detection models and classification and
existing shark detection methods and the sparsity features of
field images collected from Unmanned Aerial Vehicle (UAV). bounding box regression based detection models [21], [22]. In
Recently, deep learning has been tremendously successful in this paper, we mainly focus on region proposed based Convo-
various real-world applications such as automatic driving system, lutional Neural Networks including Faster R-CNN [23], Mask
object detection, face recognition, medical diagnosis, etc. In this R-CNN [24] and RetinaNet [25], because these mentioned
paper, we propose an automated pipeline of shark detection tasks. methods can handle imbalanced classes, image illumination
In specific, we implement several state-of-the-art object detection
models into our shark field data set. These algorithms are Faster and sparsity challenges.
R-CNN, Mask R-CNN, Feature Pyramid Network (FPN) and II. M ATERIALS AND M ETHOD
RetinaNet. We report the quantitative comparison results on the
above mentioned object detection models and we also provide A. Field Data Collection
some detection example images. The experiments show that the Survey Protocol Small Unmanned Aerial Vehicles (sUAVs)
models are capable of making a fast and efficient detection among
shark and non-shark objects. were used to conduct video surveys of the southern Califor-
Index Terms—Object Detection, Shark Recognition, Deep nia coast between Point Conception, California (34.4486◦ N ,
Learning, Convolutional Neural Network 120.4716◦ W ) and San Diego, California (32.7157◦ N ,
117.1611◦ W ) from January 2019 to December 2020. Specific
I. I NTRODUCTION beaches where large aggregations of juvenile white sharks
(Carcharodon carcharias) were present were selected to in-
Automatically identifying and detecting the activities of crease the probability of shark observations. To ensure that
free-ranging sharks play an essential role in maintaining a data was collected under a wide range of environmental
healthy marine ecosystem and reducing the risk to public conditions, days when surveys were performed were selected
safety for beachgoers [1]. The recent enhanced technical de- semi-haphazardly. The sUAV was flown between 5.5 and 6.0
velopment of UAVs provides a new opportunity for managing m/s along a 1 km stretch of coastline, following the specific
the human-shark interaction with low cost [2]–[5]. Currently, contour of each beach. Altitude of the sUAV varied from
most of the existing researches related with shark detection 30 m to 120 m, resulting in variation in pixel silhouettes
have combine UAVs and machine learning techniques [6]– of subjects in the frame. The sUAV was positioned so that
[8]. However, these methods failed to give a full investigation the shoreline and the outside of the wave break were within
about state-of-the-art object detection models. the same frame of the camera at all times. This ensured that
From the perspective of computer vision, classical object all human subjects using the shoreline for recreation would
detection methods can be usually involved into the following be encompassed within the first transect of the survey. If no
three major stages: 1. They mainly used the multi-scale sliding juvenile white sharks were observed during the first 1 km
window by scanning the whole image to get the informative transect of the survey, the pilot flew the drone offshore 75
selection of the region proposal. 2. They usually used the hand- m and then returned to a position parallel to the start of the
crafted feature extractions such as SIFT [9], HOG [10], etc. 3. survey. This would repeat until a shark was spotted or until
A shallow machine learning classifier has been used to make the pilot performed a transect 500 m offshore, in which case
prediction, such as Support Vector Machine [11]. However, the survey would end. If a shark was observed, it would be
the computation cost of the traditional models is high and may tracked by positioning the sUAV directly above the central
not produce robust features due to the nature of hand-crafted point of the shark and following the shark for the remaining
feature extraction. battery life of the sUAV. Survey duration ranged from 16 to
Recently, deep learning has made significant gains in broad 22 min.
range of models such as Convolutional Neural Network Image Selection For Analysis Video surveys were filmed
(CNN) [12]–[14], Recurrent Neural Network (RNN) [15]– at 4k resolution (3840 × 2160) at 30 frames per second using
[18], Transformer [19], and Generative Adversarial Network the stock, onboard camera of the Phantom 4 Pro v2.0 (Da-
(GAN) [20]. Deep learning also made tremendous success Jiang Innovations) sUAV was used to collect data. Images were

Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
selected from the video using VLC media player (VideoLan) a “top-down” pyramidal series of layers is created via up-
during post-survey review of the video surveys. Instances sampling to simulate a higher resolution for layers of higher
where humans and sharks were in the same frame of the semantic value. To this end, the convolution outputs from
camera were prioritized for analysis. However, images where the first pyramid are laterally connected to the corresponding
only humans or sharks were within the frame were also layer on the top-down pathway-pyramid in a manner similar
analyzed. Images with varying environmental conditions were to ResNet [31]. At each of these merged layers, a 3 × 3
selected to ensure we were training the algorithm with a range convolution layer is applied to reduce the aliasing effect of
of light levels, glare, wind waves, and water clarity. Images upsampling. That output is the final feature map for the usage
were only selected if all subjects in the frame were clearly of object detection at that layer’s specific scale.
visible. For example, images where sharks were too deep in Mask R-CNN [24] is an extention of Faster R-CNN. It
the water column to fully articulate their silhouette via labeling adds a branch for predicting instance segmentation mask on
software, or where humans were obscured by broken waves each Region of Interest (RoI), in parallel with existing branch
were not analyzed. for classification and bounding box regression. In addition to
two outputs of Faster R-CNN, Mask R-CNN adds a third
B. Method output, a binary mask for each RoI. But this additional mask
In this section, we explore several state-of-the-art region output requires an extraction of finer spatial layout of an
proposal based object detection models. R-CNN [26] is a object. So Mask R-CNN introduces a simple, quantization-free
two-stage detection algorithm. Regions with CNN features. layer, called RoIAlign that can preserve spatial information
The first stage identifies region proposals in an image that of different scales objects. During the training procedure,
may contain object. The second stage classifies object in each Mask R-CNN used multi-task loss on each RoI as L =
region. R-CNN detector first generates region proposals using Lcls + Lbox + Lmask [24]. Here, classification loss, Lcls and
external region proposal methods such as Selective Search or regression loss, Lbox are identical as those of Fast R-CNN.
Edge Boxes. Then CNN extracts a fixed-length vector (feature Lmask is defined as average binary cross-entropy loss for every
map) from each region. Finally, region bounding boxes are re- pixel sigmoid on ground-truth class mask. Mask classifier
fined by SVM using feature map generated by CNN. However, predicts m×m mask for each RoI to retain spatial dimensions.
training with R-CNN is expensive and object detection is slow RoIPool, a key operation of Faster R-CNN, performs coarse
at testing time. Fast region-based Convolutional Network (Fast spatial quantization for feature extraction which introduced
R-CNN) [27] solves these issues. The approach is similar to R- misalignment. This may not impact classification, but it neg-
CNN, but instead of feeding each region proposals to CNN, an atively affects a pixel-accurate mask prediction. RoIAlign
input image with region proposals is fed to CNN to generate resolve this issues by replacing harsh quantization of RoIPool
the feature map. From feature map, each region proposal is with binary interpolation, computing the exact values of the
wrapped into squares and reshaped into fixed size feature input features. RoIAlign had made significant improvement
vector by using Region of Interest (RoI) pooling layer. RoI for mask accuracy. Therefore, Mask R-CNN is simple, flexible
feature vector is used to predict the label of proposed region and fast system for instance segmentation and object detection.
and bounding box. Fast R-CNN is much more efficient than RetinaNet [25] is a simple one-stage detector that utilizes
R-CNN as the convolutional computations for overlapping a novel loss function (i.e., focal loss function) to address
regions are shared. the class imbalance problem during training. Class imbalance
Even Fast R-CNN achieves some promising experimental is a common problem during training, in which the number
results, it still heavily relies on external region proposal of locations that don’t contain objects (negative locations)
models to construct region proposals. Region proposals are the dramatically surpasses the number of locations that contain
bottleneck in preventing the efficiency of detection system. To objects (positive locations). The vast amount of negative
Solve this problem, Ren et al. [23] introduced Region Proposal locations may overwhelm the model and lead to degenerate
Network (RPN). RPN shares convolutional layers with object models. Recent two-stage detector models address this issue
detection network. On top of these layers, RPN has a few by filtering out most negative locations in the first stage [23],
additional convolutional layers that can regress bounding box [28], [32], [33]. Correspondingly, the speed of these detectors
and object score at each rectangle region. Unlike Selective is compromised. Rather than using two-stages, RetinaNet uses
Search [28] or Edge Boxes [29], RPN is a Fully Convolutional a modulating factor in the loss function to dynamically adjust
Network (FCN), designed as training end-to-end for generating the scaling factor of cross-entropy loss, which down-weights
region proposals. Therefore, Faster R-CNN can be trained end- the contribution of easily classified negative locations and
to-end by back propagation and stochastic gradient descent highlights the contribution of positive locations.
(SGD) to learn shared features.
III. E XPERIMENTAL R ESULTS AND E VALUATION
Feature Pyramid Network (FPN) [30] introduced a multi-
scale architecture for a feature extractor, for the usage of A. Experimental Setup
combination with other independent object detection archi- Our provided shark detection data set contains total 1241
tectures. A series of convolutional layers creates a feature images with the size of 3840 × 2160. The data set has
pyramid with different receptive fields, a “bottom-up”. Then been initially considered as multi-class, multi-scale and sparse

Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
(a) Original shark image

(a) Original shark image

(b) RetinaNet detection result


Fig. 2: Shark Example Image One

individual model for 500 epochs and inference only costs


(b) RetinaNet detection result around 300 milliseconds to process one single image.
Detectorn2 library was developed by Facebook AI Research
Fig. 1: Example image from Microsoft COCO (FAIR) that aims to provide state-of-the-art instance and
semantic segmentation and object detection models. Due to
issue of the small amount of our shark images, we make use
learning problem since it includes multiple classes such as of the Detectron2 Model Zoo to fine-tune on our own specific
shark, wader, surfer, wader, body-boarding, stand-up paddle- data using pre-trained models. In Detectron2 Model Zoo, they
boarding (SUP), multiple objects have different scales in the used Microsoft Common Object in Context (COCO) data
same image and each image is quite sparse since only small set [36] to train the models. COCO data set consists of 330,000
area has labeled objects. Also the non-shark classes such as images with more than 200,000 of them are labeled. There
wader, surfer, wade, body-boarding and SUP only contain are 80 objects categories among all images. The resolution
small amounts of images. In order to resolve the imbalance of the images are 640x480. In Detectron2, they used images
data issue, we decide to treat it as binary class problem, which from train2017 and val2017 to train and validate all the pre-
is only the shark and non-shark object detection. In the end, trained model. There are more than 500,000 object instances
we have about 1/3 images containing shark objects and 2/3 segmented from the data set they are using. In Fig.1, we select
containing non-shark objects. We finally randomly split the one example image from COCO, and from the RetinaNet
whole data into a training set of 803 images, a validation detection result, we can easily visualize that multiple objects
set of 109 images and a testing set of 329 images. During such as horse, person, umbrella, etc have been successfully
the training process, the images have been augmented such identified and detected with bounding boxes and probabilities.
as flipping horizontally, vertically, rotating clockwise by 90
degrees. In addition, the image hue has been tuned by some B. Experimental Results and Discussions
number from -30 to 30. We finally randomly shift and re-scale In the experimental evaluation part, we decide to choose
each image by some value from -10% to 10%. Average Precision (AP) metric to represent the average AP of
All the models have been implemented using pre-trained low to high Intersection of Union (IoU) thresholds ranged from
models from Detectron2 [34] and PyTorch 1.1.0 [35] deep 0.5 to 0.95 with increment of 0.05. Both AP50 and AP75 have
learning tool on a workstation with the following configura- also been represented the AP score when the IoU threshold is
tion: One NVIDIA RTX 2060 SUPER 8 GB GPU, Intel i7- at 0.5 and 0.75. AP-Shark and AP-NoShark are to represent
10700K 8 Core 3.8GHz, and Windows 10 operating system. the AP score that has been calculated by the two different
In our experiment, it takes about 3 minutes to fine-tune each categories. Average Recall (AR) metric is the average recall

Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
(a) Original shark image (a) Original standup paddleboarding image

(b) RetinaNet detection result


(b) RetinaNet detection result
Fig. 3: Shark Example Image Two
Fig. 5: Standup Paddleboarding Example Image

Both the AP and AR score calculation is provided by Detec-


tron2 that is specially designed for COCO-detection evalua-
tion.
In the experiment, we make use of the several backbone
models from Detectron2, for example, R-50 is from MSRA
original ResNet-50 layers model [31], R-101 is also from
MSRA original ResNet-101 layers model [31]. For Faster and
Mask R-CNN, Detectron2 has provided us baseline with three
different backbone combinations, which are ResNet with FPN,
(a) Original body-boarding image ResNet with conv4 layer and ResNet with dilated conv5 layer.
The experimental results of all the major three object detection
models have been listed in Table I, II and III. RetinaNet
achieves the best performance in the measurement of AP,
AP50, AP75, AP-shark, AP-NoShark and recall. However, due
to the sparse nature of provided images and limited number of
training images, the effectiveness and efficiency of RetinaNet
performance is still limited.

TABLE I: Comparison of Experimental Results Faster R-CNN


Model AP(%) AP50(%) AP75(%) AP-Shark(%) AP-NoShark(%) AR(%) Time(s/img)
(b) RetinaNet detection result
R50-C4 17.835 47.354 7.640 25.697 9.972 35.3 0.245

Fig. 4: Body-boarding Example Image R50-DC5 14.656 42.711 6.384 23.671 5.641 34.3 0.139
R50-FPN 26.958 63.111 13.180 30.417 23.498 44.1 0.081
R101-FPN 26.389 62.835 11.469 32.399 20.379 43.7 0.105

for the whole training process, the formula is as follows, where


We also provide some example detection results for easy
TP term represents True positive and FN term represents False
visualization. In fig. 2, the shark object has been successfully
negative:
detected with high probability. In fig. 3, the hue of objective
TP shark is quite closed to the background, but our detection
recall = model is still able to handle illumination challenge. In fig. 4, 5
TP + FN

Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
(a) Wave image without label
(a) Original wader image

(b) RetinaNet Detection Result

(b) RetinaNet detection result Fig. 7: Non-shark object falsely detected

Fig. 6: Wader Example Image TABLE III: Comparison of Experimental Results RetinaNet
TABLE II: Comparison of Experimental Results Mask R-CNN Model AP(%) AP50(%) AP75(%) AP-Shark(%) AP-NoShark(%) AR(%) Time(s/img)
R50-FPN 36.004 70.549 33.233 51.374 20.634 43.9 0.083
Model AP(%) AP50(%) AP75(%) AP-Shark(%) AP-NoShark(%) AR(%) Time(s/img) R101-FPN 38.390 73.219 34.438 52.521 24.259 50.5 0.110
R50-C4 16.384 45.871 6.285 20.588 12.180 35.1 0.463
R50-DC5 20.331 56.286 10.462 28.527 12.134 37.5 0.428
R50-FPN 23.932 60.747 12.878 26.984 20.897 43.0 0.251
R101-FPN 25.078 65.492 11.354 27.577 22.578 42.6 0.219
R EFERENCES
[1] P. Simmons and M. I. Mehmet, “Shark management strategy policy
considerations: community preferences, reasoning and speculations,”
and 6, these images only contain the non-shark objects such as Marine Policy, vol. 96, pp. 111–119, 2018.
body-boarding, Stand-up paddle-boarding (SUP) and wader, [2] G. Shrivakshan and C. Chandrasekar, “A comparison of various edge
RetinaNet can accurately detect each non-shark object with detection techniques used in image processing,” International Journal
of Computer Science Issues (IJCSI), vol. 9, no. 5, p. 269, 2012.
bounding boxes using different sizes of region proposals.
[3] J. C. van Gemert, C. R. Verschoor, P. Mettes, K. Epema, L. P. Koh,
However, due to the sparsity nature of images and limited and S. Wich, “Nature conservation drones for automatic localization
number of training set, we list one mistake prediction exam- and counting of animals,” in European Conference on Computer Vision.
ple implemented by RetinaNet. In fig. 7, the model falsely Springer, 2014, pp. 255–270.
[4] L. F. Gonzalez, G. A. Montes, E. Puig, S. Johnson, K. Mengersen, and
identifies the wave as non-shark object. K. J. Gaston, “Unmanned aerial vehicles (uavs) and artificial intelligence
revolutionizing wildlife monitoring and conservation,” Sensors, vol. 16,
IV. C ONCLUSION AND F UTURE W ORK no. 1, p. 97, 2016.
In this work, we apply deep learning networks on automatic [5] B. Kane, C. A. Zajchowski, T. R. Allen, G. McLeod, and N. H.
Allen, “Is it safer at the beach? spatial and temporal analyses of
shark object detection application. We implement the major beachgoer behaviors during the covid-19 pandemic,” Ocean & Coastal
three Deep Learning architectures based on Resnet and region Management, vol. 205, p. 105533, 2021.
object detection models. In particular, we utilize Feature [6] N. Sharma, P. Scully-Power, and M. Blumenstein, “Shark detection from
aerial imagery using region-based cnn, a study,” in Australasian Joint
Pyramid Network (RPN) and focal loss function to solve the Conference on Artificial Intelligence. Springer, 2018, pp. 224–236.
imbalanced shark detection problem. In the future, we plan [7] R. Gorkin, K. Adams, M. J. Berryman, S. Aubin, W. Li, A. R. Davis, and
to implement You Only Look Once (YOLO) network system J. Barthelemy, “Sharkeye: real-time autonomous personal shark alerting
via aerial surveillance,” Drones, vol. 4, no. 2, p. 18, 2020.
and Single-shot multibox detector (SSD) to further improve [8] A. P. Colefax, B. P. Kelaher, D. E. Pagendam, and P. A. Butcher,
the performance of automatic shark object detection tasks. “Assessing white shark (carcharodon carcharias) behavior along coastal
beaches for conservation-focused shark mitigation,” Frontiers in Marine
ACKNOWLEDGMENT Science, vol. 7, p. 268, 2020.
[9] P. C. Ng and S. Henikoff, “Sift: Predicting amino acid changes that
We would like to acknowledge Adrian Campos and affect protein function,” Nucleic acids research, vol. 31, no. 13, pp.
Bernardo Cobos for engineering setup and discussion. 3812–3814, 2003.

Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
[10] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector with imperative style, high-performance deep learning library,” arXiv preprint
partial occlusion handling,” in 2009 IEEE 12th international conference arXiv:1912.01703, 2019.
on computer vision. IEEE, 2009, pp. 32–39. [36] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[11] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
vol. 20, no. 3, pp. 273–297, 1995. context,” in European conference on computer vision. Springer, 2014,
[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning pp. 740–755.
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[15] J. F. Kolen and S. C. Kremer, A field guide to dynamical recurrent
networks. John Wiley & Sons, 2001.
[16] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training
recurrent neural networks,” in International conference on machine
learning, 2013, pp. 1310–1318.
[17] ——, “Understanding the exploding gradient problem,” CoRR,
abs/1211.5063, vol. 2, p. 417, 2012.
[18] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and understanding
recurrent networks,” arXiv preprint arXiv:1506.02078, 2015.
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint
arXiv:1706.03762, 2017.
[20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
arXiv preprint arXiv:1406.2661, 2014.
[21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 779–
788.
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European conference on
computer vision. Springer, 2016, pp. 21–37.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
time object detection with region proposal networks,” arXiv preprint
arXiv:1506.01497, 2015.
[24] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
2017, pp. 2961–2969.
[25] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
for dense object detection,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 2980–2988.
[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[27] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 1440–1448.
[28] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders,
“Selective search for object recognition,” International journal of com-
puter vision, vol. 104, no. 2, pp. 154–171, 2013.
[29] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from
edges,” in European conference on computer vision. Springer, 2014,
pp. 391–405.
[30] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp.
2117–2125.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[32] P. O. Pinheiro, R. Collobert, and P. Dollár, “Learning to segment object
candidates,” arXiv preprint arXiv:1506.06204, 2015.
[33] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning to refine
object segments,” in European conference on computer vision. Springer,
2016, pp. 75–91.
[34] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,”
https://github.com/facebookresearch/detectron2, 2019.
[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An

Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.

You might also like