Deep Learning for Shark Detection
Deep Learning for Shark Detection
Wenlu Zhang∗ , Xinyi Chen∗ , Dhara Bhadani∗ , Patrick Rex† , Yu Yang‡ , Christopher G. Lowe† , Hen-Geul Yeh§
∗ Department of Computer Engineering and Computer Science, California State University Long Beach, CA, 90840
† Department of Biological Sciences, California State University Long Beach, CA, 90840
‡ Department of Chemical Engineering, California State University Long Beach, CA, 90840
2021 IEEE Green Energy and Smart Systems Conference (IGESSC) | 978-1-6654-3456-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/IGESSC53124.2021.9618703
§ Department of Electrical Engineering, California State University Long Beach, CA, 90840
Abstract—Automatic detection of free-ranging sharks from in object detection modeling design. The most state-of-the-
beach areas is of great importance in maintaining a safe human- art object detection can be divided into two major categories:
shark interaction. The task is especially challenging due to most region proposed based detection models and classification and
existing shark detection methods and the sparsity features of
field images collected from Unmanned Aerial Vehicle (UAV). bounding box regression based detection models [21], [22]. In
Recently, deep learning has been tremendously successful in this paper, we mainly focus on region proposed based Convo-
various real-world applications such as automatic driving system, lutional Neural Networks including Faster R-CNN [23], Mask
object detection, face recognition, medical diagnosis, etc. In this R-CNN [24] and RetinaNet [25], because these mentioned
paper, we propose an automated pipeline of shark detection tasks. methods can handle imbalanced classes, image illumination
In specific, we implement several state-of-the-art object detection
models into our shark field data set. These algorithms are Faster and sparsity challenges.
R-CNN, Mask R-CNN, Feature Pyramid Network (FPN) and II. M ATERIALS AND M ETHOD
RetinaNet. We report the quantitative comparison results on the
above mentioned object detection models and we also provide A. Field Data Collection
some detection example images. The experiments show that the Survey Protocol Small Unmanned Aerial Vehicles (sUAVs)
models are capable of making a fast and efficient detection among
shark and non-shark objects. were used to conduct video surveys of the southern Califor-
Index Terms—Object Detection, Shark Recognition, Deep nia coast between Point Conception, California (34.4486◦ N ,
Learning, Convolutional Neural Network 120.4716◦ W ) and San Diego, California (32.7157◦ N ,
117.1611◦ W ) from January 2019 to December 2020. Specific
I. I NTRODUCTION beaches where large aggregations of juvenile white sharks
(Carcharodon carcharias) were present were selected to in-
Automatically identifying and detecting the activities of crease the probability of shark observations. To ensure that
free-ranging sharks play an essential role in maintaining a data was collected under a wide range of environmental
healthy marine ecosystem and reducing the risk to public conditions, days when surveys were performed were selected
safety for beachgoers [1]. The recent enhanced technical de- semi-haphazardly. The sUAV was flown between 5.5 and 6.0
velopment of UAVs provides a new opportunity for managing m/s along a 1 km stretch of coastline, following the specific
the human-shark interaction with low cost [2]–[5]. Currently, contour of each beach. Altitude of the sUAV varied from
most of the existing researches related with shark detection 30 m to 120 m, resulting in variation in pixel silhouettes
have combine UAVs and machine learning techniques [6]– of subjects in the frame. The sUAV was positioned so that
[8]. However, these methods failed to give a full investigation the shoreline and the outside of the wave break were within
about state-of-the-art object detection models. the same frame of the camera at all times. This ensured that
From the perspective of computer vision, classical object all human subjects using the shoreline for recreation would
detection methods can be usually involved into the following be encompassed within the first transect of the survey. If no
three major stages: 1. They mainly used the multi-scale sliding juvenile white sharks were observed during the first 1 km
window by scanning the whole image to get the informative transect of the survey, the pilot flew the drone offshore 75
selection of the region proposal. 2. They usually used the hand- m and then returned to a position parallel to the start of the
crafted feature extractions such as SIFT [9], HOG [10], etc. 3. survey. This would repeat until a shark was spotted or until
A shallow machine learning classifier has been used to make the pilot performed a transect 500 m offshore, in which case
prediction, such as Support Vector Machine [11]. However, the survey would end. If a shark was observed, it would be
the computation cost of the traditional models is high and may tracked by positioning the sUAV directly above the central
not produce robust features due to the nature of hand-crafted point of the shark and following the shark for the remaining
feature extraction. battery life of the sUAV. Survey duration ranged from 16 to
Recently, deep learning has made significant gains in broad 22 min.
range of models such as Convolutional Neural Network Image Selection For Analysis Video surveys were filmed
(CNN) [12]–[14], Recurrent Neural Network (RNN) [15]– at 4k resolution (3840 × 2160) at 30 frames per second using
[18], Transformer [19], and Generative Adversarial Network the stock, onboard camera of the Phantom 4 Pro v2.0 (Da-
(GAN) [20]. Deep learning also made tremendous success Jiang Innovations) sUAV was used to collect data. Images were
Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
selected from the video using VLC media player (VideoLan) a “top-down” pyramidal series of layers is created via up-
during post-survey review of the video surveys. Instances sampling to simulate a higher resolution for layers of higher
where humans and sharks were in the same frame of the semantic value. To this end, the convolution outputs from
camera were prioritized for analysis. However, images where the first pyramid are laterally connected to the corresponding
only humans or sharks were within the frame were also layer on the top-down pathway-pyramid in a manner similar
analyzed. Images with varying environmental conditions were to ResNet [31]. At each of these merged layers, a 3 × 3
selected to ensure we were training the algorithm with a range convolution layer is applied to reduce the aliasing effect of
of light levels, glare, wind waves, and water clarity. Images upsampling. That output is the final feature map for the usage
were only selected if all subjects in the frame were clearly of object detection at that layer’s specific scale.
visible. For example, images where sharks were too deep in Mask R-CNN [24] is an extention of Faster R-CNN. It
the water column to fully articulate their silhouette via labeling adds a branch for predicting instance segmentation mask on
software, or where humans were obscured by broken waves each Region of Interest (RoI), in parallel with existing branch
were not analyzed. for classification and bounding box regression. In addition to
two outputs of Faster R-CNN, Mask R-CNN adds a third
B. Method output, a binary mask for each RoI. But this additional mask
In this section, we explore several state-of-the-art region output requires an extraction of finer spatial layout of an
proposal based object detection models. R-CNN [26] is a object. So Mask R-CNN introduces a simple, quantization-free
two-stage detection algorithm. Regions with CNN features. layer, called RoIAlign that can preserve spatial information
The first stage identifies region proposals in an image that of different scales objects. During the training procedure,
may contain object. The second stage classifies object in each Mask R-CNN used multi-task loss on each RoI as L =
region. R-CNN detector first generates region proposals using Lcls + Lbox + Lmask [24]. Here, classification loss, Lcls and
external region proposal methods such as Selective Search or regression loss, Lbox are identical as those of Fast R-CNN.
Edge Boxes. Then CNN extracts a fixed-length vector (feature Lmask is defined as average binary cross-entropy loss for every
map) from each region. Finally, region bounding boxes are re- pixel sigmoid on ground-truth class mask. Mask classifier
fined by SVM using feature map generated by CNN. However, predicts m×m mask for each RoI to retain spatial dimensions.
training with R-CNN is expensive and object detection is slow RoIPool, a key operation of Faster R-CNN, performs coarse
at testing time. Fast region-based Convolutional Network (Fast spatial quantization for feature extraction which introduced
R-CNN) [27] solves these issues. The approach is similar to R- misalignment. This may not impact classification, but it neg-
CNN, but instead of feeding each region proposals to CNN, an atively affects a pixel-accurate mask prediction. RoIAlign
input image with region proposals is fed to CNN to generate resolve this issues by replacing harsh quantization of RoIPool
the feature map. From feature map, each region proposal is with binary interpolation, computing the exact values of the
wrapped into squares and reshaped into fixed size feature input features. RoIAlign had made significant improvement
vector by using Region of Interest (RoI) pooling layer. RoI for mask accuracy. Therefore, Mask R-CNN is simple, flexible
feature vector is used to predict the label of proposed region and fast system for instance segmentation and object detection.
and bounding box. Fast R-CNN is much more efficient than RetinaNet [25] is a simple one-stage detector that utilizes
R-CNN as the convolutional computations for overlapping a novel loss function (i.e., focal loss function) to address
regions are shared. the class imbalance problem during training. Class imbalance
Even Fast R-CNN achieves some promising experimental is a common problem during training, in which the number
results, it still heavily relies on external region proposal of locations that don’t contain objects (negative locations)
models to construct region proposals. Region proposals are the dramatically surpasses the number of locations that contain
bottleneck in preventing the efficiency of detection system. To objects (positive locations). The vast amount of negative
Solve this problem, Ren et al. [23] introduced Region Proposal locations may overwhelm the model and lead to degenerate
Network (RPN). RPN shares convolutional layers with object models. Recent two-stage detector models address this issue
detection network. On top of these layers, RPN has a few by filtering out most negative locations in the first stage [23],
additional convolutional layers that can regress bounding box [28], [32], [33]. Correspondingly, the speed of these detectors
and object score at each rectangle region. Unlike Selective is compromised. Rather than using two-stages, RetinaNet uses
Search [28] or Edge Boxes [29], RPN is a Fully Convolutional a modulating factor in the loss function to dynamically adjust
Network (FCN), designed as training end-to-end for generating the scaling factor of cross-entropy loss, which down-weights
region proposals. Therefore, Faster R-CNN can be trained end- the contribution of easily classified negative locations and
to-end by back propagation and stochastic gradient descent highlights the contribution of positive locations.
(SGD) to learn shared features.
III. E XPERIMENTAL R ESULTS AND E VALUATION
Feature Pyramid Network (FPN) [30] introduced a multi-
scale architecture for a feature extractor, for the usage of A. Experimental Setup
combination with other independent object detection archi- Our provided shark detection data set contains total 1241
tectures. A series of convolutional layers creates a feature images with the size of 3840 × 2160. The data set has
pyramid with different receptive fields, a “bottom-up”. Then been initially considered as multi-class, multi-scale and sparse
Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
(a) Original shark image
Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
(a) Original shark image (a) Original standup paddleboarding image
Fig. 4: Body-boarding Example Image R50-DC5 14.656 42.711 6.384 23.671 5.641 34.3 0.139
R50-FPN 26.958 63.111 13.180 30.417 23.498 44.1 0.081
R101-FPN 26.389 62.835 11.469 32.399 20.379 43.7 0.105
Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
(a) Wave image without label
(a) Original wader image
Fig. 6: Wader Example Image TABLE III: Comparison of Experimental Results RetinaNet
TABLE II: Comparison of Experimental Results Mask R-CNN Model AP(%) AP50(%) AP75(%) AP-Shark(%) AP-NoShark(%) AR(%) Time(s/img)
R50-FPN 36.004 70.549 33.233 51.374 20.634 43.9 0.083
Model AP(%) AP50(%) AP75(%) AP-Shark(%) AP-NoShark(%) AR(%) Time(s/img) R101-FPN 38.390 73.219 34.438 52.521 24.259 50.5 0.110
R50-C4 16.384 45.871 6.285 20.588 12.180 35.1 0.463
R50-DC5 20.331 56.286 10.462 28.527 12.134 37.5 0.428
R50-FPN 23.932 60.747 12.878 26.984 20.897 43.0 0.251
R101-FPN 25.078 65.492 11.354 27.577 22.578 42.6 0.219
R EFERENCES
[1] P. Simmons and M. I. Mehmet, “Shark management strategy policy
considerations: community preferences, reasoning and speculations,”
and 6, these images only contain the non-shark objects such as Marine Policy, vol. 96, pp. 111–119, 2018.
body-boarding, Stand-up paddle-boarding (SUP) and wader, [2] G. Shrivakshan and C. Chandrasekar, “A comparison of various edge
RetinaNet can accurately detect each non-shark object with detection techniques used in image processing,” International Journal
of Computer Science Issues (IJCSI), vol. 9, no. 5, p. 269, 2012.
bounding boxes using different sizes of region proposals.
[3] J. C. van Gemert, C. R. Verschoor, P. Mettes, K. Epema, L. P. Koh,
However, due to the sparsity nature of images and limited and S. Wich, “Nature conservation drones for automatic localization
number of training set, we list one mistake prediction exam- and counting of animals,” in European Conference on Computer Vision.
ple implemented by RetinaNet. In fig. 7, the model falsely Springer, 2014, pp. 255–270.
[4] L. F. Gonzalez, G. A. Montes, E. Puig, S. Johnson, K. Mengersen, and
identifies the wave as non-shark object. K. J. Gaston, “Unmanned aerial vehicles (uavs) and artificial intelligence
revolutionizing wildlife monitoring and conservation,” Sensors, vol. 16,
IV. C ONCLUSION AND F UTURE W ORK no. 1, p. 97, 2016.
In this work, we apply deep learning networks on automatic [5] B. Kane, C. A. Zajchowski, T. R. Allen, G. McLeod, and N. H.
Allen, “Is it safer at the beach? spatial and temporal analyses of
shark object detection application. We implement the major beachgoer behaviors during the covid-19 pandemic,” Ocean & Coastal
three Deep Learning architectures based on Resnet and region Management, vol. 205, p. 105533, 2021.
object detection models. In particular, we utilize Feature [6] N. Sharma, P. Scully-Power, and M. Blumenstein, “Shark detection from
aerial imagery using region-based cnn, a study,” in Australasian Joint
Pyramid Network (RPN) and focal loss function to solve the Conference on Artificial Intelligence. Springer, 2018, pp. 224–236.
imbalanced shark detection problem. In the future, we plan [7] R. Gorkin, K. Adams, M. J. Berryman, S. Aubin, W. Li, A. R. Davis, and
to implement You Only Look Once (YOLO) network system J. Barthelemy, “Sharkeye: real-time autonomous personal shark alerting
via aerial surveillance,” Drones, vol. 4, no. 2, p. 18, 2020.
and Single-shot multibox detector (SSD) to further improve [8] A. P. Colefax, B. P. Kelaher, D. E. Pagendam, and P. A. Butcher,
the performance of automatic shark object detection tasks. “Assessing white shark (carcharodon carcharias) behavior along coastal
beaches for conservation-focused shark mitigation,” Frontiers in Marine
ACKNOWLEDGMENT Science, vol. 7, p. 268, 2020.
[9] P. C. Ng and S. Henikoff, “Sift: Predicting amino acid changes that
We would like to acknowledge Adrian Campos and affect protein function,” Nucleic acids research, vol. 31, no. 13, pp.
Bernardo Cobos for engineering setup and discussion. 3812–3814, 2003.
Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.
[10] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human detector with imperative style, high-performance deep learning library,” arXiv preprint
partial occlusion handling,” in 2009 IEEE 12th international conference arXiv:1912.01703, 2019.
on computer vision. IEEE, 2009, pp. 32–39. [36] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[11] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
vol. 20, no. 3, pp. 273–297, 1995. context,” in European conference on computer vision. Springer, 2014,
[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning pp. 740–755.
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[15] J. F. Kolen and S. C. Kremer, A field guide to dynamical recurrent
networks. John Wiley & Sons, 2001.
[16] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training
recurrent neural networks,” in International conference on machine
learning, 2013, pp. 1310–1318.
[17] ——, “Understanding the exploding gradient problem,” CoRR,
abs/1211.5063, vol. 2, p. 417, 2012.
[18] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and understanding
recurrent networks,” arXiv preprint arXiv:1506.02078, 2015.
[19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint
arXiv:1706.03762, 2017.
[20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,”
arXiv preprint arXiv:1406.2661, 2014.
[21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 779–
788.
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in European conference on
computer vision. Springer, 2016, pp. 21–37.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
time object detection with region proposal networks,” arXiv preprint
arXiv:1506.01497, 2015.
[24] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
2017, pp. 2961–2969.
[25] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
for dense object detection,” in Proceedings of the IEEE international
conference on computer vision, 2017, pp. 2980–2988.
[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[27] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 1440–1448.
[28] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders,
“Selective search for object recognition,” International journal of com-
puter vision, vol. 104, no. 2, pp. 154–171, 2013.
[29] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from
edges,” in European conference on computer vision. Springer, 2014,
pp. 391–405.
[30] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
“Feature pyramid networks for object detection,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2017, pp.
2117–2125.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[32] P. O. Pinheiro, R. Collobert, and P. Dollár, “Learning to segment object
candidates,” arXiv preprint arXiv:1506.06204, 2015.
[33] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning to refine
object segments,” in European conference on computer vision. Springer,
2016, pp. 75–91.
[34] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,”
https://github.com/facebookresearch/detectron2, 2019.
[35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
Authorized licensed use limited to: Universidad Nacional de Colombia (UNAL). Downloaded on June 06,2023 at 14:20:37 UTC from IEEE Xplore. Restrictions apply.