Facemask Detection Using MMdetection Toolbox

This document summarizes and compares two object detection models for face mask detection: MMdetection and Detectron2. MMdetection is an object detection toolbox that contains popular detection methods like Faster R-CNN, Mask R-CNN, and RetinaNet. It has a modular design that allows customization and supports multiple frameworks. Detectron2 is also a detection toolbox but is optimized for speed and uses Feature Pyramid Networks for feature extraction. The paper describes the architecture of these models, including the backbone, neck, heads, and ROI extractor components. It also provides equations to determine which feature maps are used and compares performance metrics like average recall and inference time between models with and without Feature Pyramid Networks.

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views6 pages

Facemask Detection Using MMdetection Toolbox

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Facemask Detection using MMdetection Toolbox

Mukul Kumar Vishwas Petr Menshanov Aleksey Okunev
Department of Mathematics and Novosibirsk State University Vice director
MechanicsNovosibirsk State Novosibirsk, Russia NSU Higher College of Computer
University, Novosibirsk, Russia Science Novosibirsk, Russia

Abstract:- This paper described and compare between II. MMDETECTION

two object detection model for face mask detection.
Using object detection we can predicts if the person on A. What is MMdetection
the picture wearing the mask correctly/incorrectly or As per the paper published by [Link] in 2019
not, in the current situation this model is extremely ”MMdetec- tion is an object detection toolbox that contains
useful as this simple precaution will help to stop the a rich set of object detection and instance segmentation
spreading of deadly Coronavirus. In this paper, a methods as well as related components and modules[2]”.
comprehensive description of two operational and Major features of MMdetection are:
functional model was discussed how the data flows in • Modular design: Because of it’s design the detection
the model and the type of operation performed on that. framework can be easily changed and a
Additionally how the input data annotated and the flexible/customize version can be created as per the
result. The result was described using the mAP metric. requirement, it can be done by combining different kind
of modules likebackbone, neck and RIO extractor.
Keywords:- MMdetection, Detectron2, COVID-19, Object • Support of multiple frameworks: The toolbox and it’s
Detection, Coronavirus, mAP. simple architecture made it very easy to use
additionally it provides a large variety of detection
I. INTRODUCTION frameworks like Fast R-CNN, Faster R-CNN, Mask R-
CNN, RetinaNet, DCN etc.
The COVID-19 pandemic also widely known as • High efficiency: All operations (masking, boundary box
coronavirus pandemic caused by Severe Acute Respiratory Creation, prediction) run on GPUs, hence the training
Syndrome Coronavirus 2 (SARS-Co-2). The World Health speed is faster than or comparable to other code–bases
Organization instantly declared the outbreak on 30 January, including Detectron, mask rcnn-benchmark and
and a pandemic on 11 March[1], It completely affected SimpleDet. It also provides vides weight of more than
social and economi- cally and cost more than 1 million of 200 network model.
lives[9]. On the other hand, by following some simple rules
(mentioned below) the spreading of this virus can be B. Architecture
stopped: Although the model architectures of different
• Face masks and respiratory hygiene. detectors are different, they have common components,
• Social distancing. which can be roughly summarized into the following
• Self-isolation. classes:
• Backbone: Backbone is the part that transforms an
The goal of this work is to compare two object image to feature maps, such as a ResNet-50[8] without
detection model and train the best neural network to the last fully connected layer.
discriminate between peoples who follow sanitary rules like • Neck: Neck is the part that connects the backbone and
wearing the face mask properly from those people who are heads. It performs some refinements or re configurations
violating them by following this simple rule this virus can on the raw feature maps produced by the backbone. An
be stopped from spreading, until the vaccine was created example is Feature Pyramid Network (FPN).
these are our only option to be safe. As till October 321 • Dense Head (Anchor Head/ Anchor Free Head):
vaccine candidates are developing vaccine but none of them Dense Head is the part that operates on dense locations of
are able to complete clinical trials to prove it’s safety and feature maps, including Anchor Head and Anchor Free
efficiency[10]. Described model should able to perform Head, e.g., RPNHead, RetinaHead, FCOSHead.
below task’s: • RoIExtractor:RoIExtractor is the part that extracts fea-
• Identify the position of the person. tures as per the region of interest from a single or
• Identify person with correct mask, no mask or multiple feature maps. An example that extracts RoI
incorrect mask. features from
• A boundary region with the probability/confidence of
themodel’s prediction (varies from 0-1).
• A segmentation mask on the confident region.

IJISRT21APR432 [Link] 965

Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
increases with more high- level structures detected.

FPN extracts feature maps and later feeds into a

detector, says RPN, for object detection. RPN applies a
sliding window over the feature maps to make predictions on
the object (has an object or not) and the object boundary box
at each location[3]. In the FPN framework, for each scale
level a 3 × 3 convolution filter is applied over the feature
maps followed by separate 1 × 1 convolution for object
predictions and boundary box regression. These 3 × 3 and 1
× 1 convolutional layers are called the RPN head. The same
head is applied to all different scale levels of feature maps.
Fig. 1. Framework of single-stage and two stage detector. D. Formula to pick feature map
The equation for determining the characteristic maps is
dependent on the ROI’s width w and height h.
√
k = ko + log2 wh/224) (1)

Where, ko = 4
k is the Pk layer in the FPN used to generate the feature
patch. if the model has assigned k = 2, it made P2 as
model’s feature maps and ROI pooling will be done and
it will feed the result to the framework used like Fast R-
CNN head (Fast R-CNN and Faster R-CNN have the same
head) to finish the prediction.

E. Comparison
Fig. 2. Feature Pyramid Network. The comparison of different feature with and without FPN
is mentioned in the table 1.
The corresponding level of feature pyramids is single AR (Average recall): The ability to capture Object.
RoIExtractor. Inference time: Time taken for prediction.
• RoIHead (BBoxHead/ MaskHead): RoIHead is the part
that takes RoI features as input and makes RoI-wise task Table 1:- Comparison of Feature with and Without
specific predictions, such as bounding box classification/ FPN
regression, mask prediction. Feature Without FPN With FPN
Training Time Normal Increased
C. Resnet50 with FPN Dataset requirement Big Small
In this section, the architecture of the object detector Test/validation time High Low
explained With Feature Pyramid Network (FPN). In the Accuracy Normal Increased
Fig. 2 the working of FPN is shown. AR 44.9 56.3
Inference time 0.32 sec. 0.148 sec.
Identifying components of different scale and
complexity is a difficult task which we overcome by using III. MODEL
the same picture of the different size /scale, however this
approach has some disadvantages, including high memory A. Model composition
demand and high time consumption. [3]. In this section, a detail description of the model used
for mask detection is explained. The model described part
For the above issue, FPN has a fantastic approach, it by part in every section it’s corresponding architecture was
con- structs a feature pyramid and uses this for object explained in dictionary format for ease of understanding.
recognition. The Feature Pyramid Network (FPN) is a • Backbone: The backbone of this model was created
feature extractor programmed with accuracy and speed in using Resnet50[8] with batch normalization it is used for
mind. the feature extraction from the image, this can easily be
replaced by some other feature extracting network.
It substitutes the feature extractor of detectors such as • Neck: The neck used in this model is FPN and the full
Faster R-CNN and generates various feature map layers description is below:
(multi-scale feature maps) with higher quality information
as compare to the standard feature pyramid for object Neck = dict (type = ‘FPN’,
detection. A bottom- up and a top-down pathway are In channels = [256, 512, 1024, 2048],
constructed of FPN as shown in Fig. 2. The image quality Out channels = 256,Num outs = 5)
reduces as we go up. The semantic meaning for every layer

IJISRT21APR432 [Link] 966

Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
• Dense head: For the dense head architecture had used
RPNHead it extract feature from the dense part of the
image:
Rpn head = dict (type = ‘RPNHead’, In channels = 256,
Feat channels = 256)

• ROI head:ROI head is a very interesting part of the

model in this model CascadeROIHead was used, the main
work of RIO head is to propose the region of interest from
where the model should detect the object. The description
is below:
Roi head = dict ( type = CascadeRoIHead’, num stages =
3)
Fig. 3. Log loss
Table 2 Model Comparison Detectron2 And
Mmdetection • Process of Testing: Below is all the operation
Module MMdetection Detectron2 performed onthe data during testing.
Base Model Resnet50 Resnet50
Neck(FPN) Yes Yes test pipeline = [
Training time more less dict(type = ‘LoadImageFromFile’), dict(type =
RIO Yes Yes ‘MultiScaleFlipAug, imgscale = (1333, 800), flip =
Learning rate 0.02 0.02 False, transform = [
dict(type = ‘Resize′, keepratio = True), dict(type =
mAP 0.139 0.158
‘RandomFlip′),
dict(type = ‘Normalize′∗ ∗ imgnormcfg), dict(type =
• Optimizer: The Stochastic gradient descent ( SGD)
‘Pad′, sizedivisor = 32),
optimizer was used in this model. SGD is an iterative
dict(type = ‘ImageT oT ensor′ , keys = [‘img ′ ]),
method for max- imizing an objective function with
dict(type = ‘Collect′, keys = [‘img′])]
adequate and appropriate (e.g. differentiable or
]
subdifferentiable) smoothness properties. It can be
described as a stochastic gradient descent optimization
B. Model loss during training
approximation, since it replaces the actual gradient (cal-
To fine tuning our model, we used Cross Entropy
culated out of the whole data set) with a gradient
Loss. Cross-entropy loss, or log loss, measures the
optimization approximation (calculated from a
efficiency ofa model of classification whose output is a zero
randomly selected subset of data). In high-dimensional
to one probability value. As the expected probability
optimization problems, this approach is very usefull as it
diverges from the real mark, cross-entropy loss increases.
decrease the computational stress and achive faster
But it will be wrong to estimate a chance of .015 where the
iteration in exchange of low convergence rate.
real observation label is 1 and result in a high loss value. A
• Process of Training: In this section we will see the flow
great model will have a 0[5] log loss Fig.3.
of data during training cycle. We can see all the main
operation and transformation performed on the data.
The graph in Fig.3 shows the range of possible loss
values given a true observation (Masked = 1). As the
train pipeline = [
expected likelihood reaches 1, log loss decreases
dict(type = ‘LoadImageFromFile’),
steadily. However, the log loss increases significantly as the
dict(type = ‘LoadAnnotations’, with bbox = True),
expected likelihood decreases. Log loss penalizes both types
dict(type = ‘Resize′, imgscale = (1333, 800), keepr atio
of errors, however especially those predictions that are
= True,
confident and wrong!
dict(type = ‘RandomFlip′, flipratio = 0.5), dict(type
= ‘Normalize′, ∗ ∗ imgnormcfg), Cross-entropy and log loss are slightly different
dict(type = ‘Pad′, sizedivisor = 32), dict(type =
depending on context, however in prediction when
‘DefaultFormatBundle′), dict(type = ‘Collect′,
calculating error rates/probability between 0 and 1 they
keys = [‘img′ ,′ gtb boxes′ ,′ gtl abels′ ])
resolve to the same thing.
]
Fig. 4 showing epochs versus loss value graph for our
model,

IJISRT21APR432 [Link] 967

Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. DATASET

To train the model it need’s annotated images so

the model can extract features from the images and
distinguish between masked and unmasked faces. We used
total 8982 annotated images to train and validate the
model, we used LabelImg software to annotate all
images (figure 6). It creates an individual json file for
each image file according to the annotation, these json
files contain the information like file/image name,
category(one or more than one), bounded region information
Fig. 4. Loss during training in image, height of the bounded region, width of the
bounded region and total area enclosed within labelled
region.

All the individual json files converted into one single

COCO file using the labelme2coco converter package in
python. We need the images to be in Common Objects in
Context (COCO) formats, it stores the annotation details for
the bounding box in JSON format. The main component of
the coco file are:
– Info:Contains high-level information about the data set.
– Licenses: Contains a list of image licenses that apply to
images in the data set.
– Categories: Contains a list of categories. Categories can
belong to a super category.
Fig. 5. Detectron2 Archetecture – Images: Contains all the image information in the data
set without bounding box or segmentation information.
We trained our model for 200 epochs. The loss value image id’s need to be unique.
was takenon every 25 epochs to draw. – Annotations: List of every individual object annotation
from every image in the data set.
IV. DETECTRON2
Example of a coco format(file and image) is below. “info”:
A. What is Detectron2 info,
Detectron2 is Facebook AI research’s next generation “licenses”: [licenses], “categories”: [categories], “images”:
software system, it is a ground-up reconstructed on the [images], “annotations”: [annotations]
previ- ous version of Detectron and it originated from
maskrcnn- benchmark [11]. VI. RESULT

Major feature of Detectron2 are: The final output is displayed in figure 7, the model is
• It is based on PyTorch. successfully able to predict and discriminate between
• It has more feature like panoptic segmentation, Masked and unmasked faces. Mask face is further classified
Densepose, Cascade R-CNN, rotated bounding boxes, as Correct(Mask) and Incorrect.
PointRend, DeepLab, etc.
• It can be easily integrated with different project’s As per the mAP vale in table III, we used
becauseit’s usability as a library. MMdetection for our facemask detection model.
• Less training time.
• Models can be exported to TorchScript format or Caffe2
format for deployment.

The main component of Detectron2 is shown in figure

Both model Detectron2 and MMdetection share some

common module like FPN, RPN, ROI pool and backend
network (Neck), all these are explained in MMdetection
section.

We used ResNet50 for feature extraction with a Fig. 6. Example of Annotated Image
learning rate of 0.02

IJISRT21APR432 [Link] 968

Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig. 8. Visual representation of IoU.

Fig. 7. Final output of the model.

To measure the accuracy of our model we used mAP

(mean Average Precision), it is a popular metric to measure
object detector like faster R-CNN, SSD, etc. It gives a
calculated value between 0 and 1.

Calculating accuracy for an object detector is a little Fig. 9. Calculation of IOU

complicated as we have to detect the object or class and also
the area where the object was detected. To get TP and FP we use IoU, we now have to identify
– Precision: It calculates the Accuracy of the predictton. if the detection (a Positive) is correct (True) or not (False).
– Recall: It measures how good model find all the The most commonly used threshold is 0.5 - i.e. if the
positives. IoU is ¿ 0.5, it is considered a True Positive, else it is
considered a false positive.
Precision = TP/ (TP + FP ) (2)
Recall = TP/ (TP + FN ) (3) Table 3:- Comparison Of Different Map Between
Where, TP = True Positive TN = True Negative Mmdetection And Detectron2
FP = False Positive FN = False Negative Metric IoU MMdetection Detectron2
mAP @[IoU= 0.50:0.95] 0.158 0.139
Another important term we have to understand is IoU
mAP @[IoU= 0.50] 0.238 0.277
(Intersection over Union) [6], To check the correctness of
mAP @[IoU= 0.75] 0.181 0.109
our model’s we first have to judge the correctness of each of
these detection’s. The metric that tells us the correctness of
VII. FUTURE WORK
a given bounding box is the IoU.
We are trying to create a pipeline of 2 model which
A visual representation of IoU is shown in Fig.8.
should be able to do facemask detection and person re-
subcap-tion
identification. For that purpose we are using the facemask.

Fig. 10. Final pipeline of the model

IJISRT21APR432 [Link] 969

Volume 6, Issue 4, April – 2021 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Detection models output and using the boundary box [10]. Wikipedia contributors, ”COVID-19vac- cine,”
information to create a database of all person who enteredthe Wikipedia, The Free Encyclopedia,
premises. Then we used torchreid a python library and [Link]
pre-trained person re-identification model available on our 19- vaccineoldid=984344201, (accessed October 20,
collected data. 2020).
[11]. Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R.
Initially our gallery size is 21127 images and we used Girshick, “Detec- tron2,”
top- 10 search result to calculate accuracy. if we choose 10 [Link] 2019.
images randomly we get 2% accuracy but with torchreid we [12]. Zhou, K. and Tao Xiang. “Torchreid: A Library for
are able to get 67% accuracy. Currently we are working to Deep Learn- ing Person Re-Identification in Pytorch.”
improve our person re-identification model and complete ArXiv abs/1910.10093 (2019): n. pag.
pipeline.

REFERENCES

[1]. “Naming the Coronavirus Disease (COVID-19)

and the Virus That Causes It.” World Health
Organization, World Health Organization,
[Link]/emergencies/diseases/novel-
coronavirus-2019/technical -guidance/naming-the-
coronavirus- disease-(covid-2019)-and-the-virus -that-
causes-it (accessed October 19, 2020).
[2]. K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, Z.
Zhang, (2019). MMDetection: Open MMLab
Detection Toolbox and Benchmark. arXiv preprint
arXiv:1906.07155.
[3]. Lin, T.-Yi, P. Dollár, R. Girshick, K. He, B.
Hariharan, and S. Belongie. ”Feature pyramid
networks for object detection.” In Proceedings of the
IEEE conference on computer vision and pattern
recognition, pp. 2117-2125. 2017.
[4]. Wikipedia contributors, ”Stochastic gradient descent,”
Wikipedia, The Free Encyclopedia,
[Link] /w/[Link] ?title
=Stochastic-gradient-descentoldid=983180780
(accessed Octo- ber 19, 2020).
[5]. Wikipedia contributors, ”Cross entropy,” Wikipedia,
The Free Encyclopedia, [Link]
/w/[Link]?title= Cross- entropy oldid=983515385
(accessed October 19, 2020).
[6]. T. Shah, “Measuring Object Detection Models-MAP-
What Is Mean Average Precision?” Tarang Shah -
Blog, 26 Jan. 2018 Tarangshah .com /blog /2018-01-
27/ what-is -map-understanding- the-statistic-of-
choice-for-comparing-object-detection-models
[7]. J. Hui., “MAP (Mean Average Precision) for Object
Detection.” Medium, Medium, 3 Apr. 2019,
[Link]/@jonathan- hui/map-mean-average-
precision-for-object-detection- 45c121a31173
(accessed October 19, 2020).
[8]. K. He, X. Zhang, S. Ren. J. Sun (2016). Deep residual
learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern
recognition (pp. 770-778).
[9]. Wikipedia contributors, ”COVID-19 pandemic by
country and territory,” Wikipedia, The Free
Encyclopedia,
[Link]
19-pandemic- by-country-and-
territoryoldid=984313691 (accessed October 20,
2020).