0% found this document useful (0 votes)
80 views13 pages

Mask R-CNN for Indoor Scene Segmentation

This document discusses using Mask R-CNN for multi-class semantic segmentation on fused indoor images containing both depth and RGB information. Mask R-CNN combines region proposal, classification, and segmentation into a single network. The authors enhance Mask R-CNN by improving the fusion of depth and RGB images to handle illumination variations better and by enhancing non-maximum suppression and loss functions for multi-class object detection. The improved model achieves excellent performance on indoor scene instance segmentation, clearly separating objects of various sizes under uneven lighting conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views13 pages

Mask R-CNN for Indoor Scene Segmentation

This document discusses using Mask R-CNN for multi-class semantic segmentation on fused indoor images containing both depth and RGB information. Mask R-CNN combines region proposal, classification, and segmentation into a single network. The authors enhance Mask R-CNN by improving the fusion of depth and RGB images to handle illumination variations better and by enhancing non-maximum suppression and loss functions for multi-class object detection. The improved model achieves excellent performance on indoor scene instance segmentation, clearly separating objects of various sizes under uneven lighting conditions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

1

Instance Segmentation and Object Detection Using


Mask RCNN
Dr.N.Balaganesh, S.Aadith thillai arasu, S.Pranav and S.Prathes

I. INTRODUCTION
Abstract—The State-of-the-Art for object recognition has been
improving, and many algorithms are advancing to the point

O
where only a few number of them are capable of accurately
detecting objects in fused images, which are made up of depth
and RGB images together. In the current study, outdoor scenes bject detection is a hot topic of computer vision and
and single-task semantic segmentation are primarily trained
Finding interesting items in pictures or videos and concurrently
using color image data but the Mask-RCNN technique is used in
this research to implement a multi-class semantic segmentation
model in a challenging indoor environment is done for fused
determining their location and size is the major goal of object
image. We have shortened these detection networks running detection. Object detection combines segmentation and item
several times, revealing region proposal computation as a recognition by basing picture segmentation on the geometrical
bottleneck. In this study, we introduce the Region Proposal
and statistical properties of the object. The complete system's
Network (RPN), which collaborates with the detection network to
object identification capabilities, particularly its precision and
share full-image convolutional features and enable almost cost-
free region suggestions. A fully convolutional network known as
real-time speed, are crucial. Automatic object extraction and
an RPN predicts object limits and abjectness ratings at each
point at the same time. The RPN is fully trained to producerecognition are crucial when several objects need to be handled
superior region suggestions, which Fast R-CNN uses for in real-time, especially in complicated scenarios. The usage of
detection. By combining the convolutional features of RPN and
object detection in artificial intelligence, facial recognition,
Fast R-CNN into a single network, we further combine the two
autonomous driving, and other areas has increased significantly
networks. Following that, using a Region of Interest pooling
layer, the suggested areas are mapped to the feature map and the
in recent years. The object detection techniques that are now in
fixed-size proposal boxes (RoI). Then, after performing certain
use include both conventional and deep learning-based
classification tasks using the classification network, we compute
the bounding box regression. The approach of fusing RGB photos
algorithms. The majority of conventional object identification
and depth images is enhanced in light of the impact of ambient
methods rely on feature point matching or sliding window
illumination variations. It enhances the effectiveness of model
training while also improving the fusion image feature frames. The absence of pertinence when employing sliding
information. Simultaneously, the non-maximum value
suppression technique is enhanced to enhance the model's
windows for area selection results in significant temporal
complexity and window redundancy, despite the fact that this
performance in order to fulfil the requirements for working on
multi-class objects. The loss function has also been revised and
strategy has produced positive outcomes. Additionally, manual
optimised to realise the output of the model's multi-class
feature selection-based procedures are frequently not particularly
information. In addition to having excellent performance and
reliable. Deep neural network-based object recognition
high efficiency, the indoor scene instance segmentation model
developed in this study can clearly separate the shapes of objects
algorithms have replaced older ones that relied on manually
of various sizes and adapt to the interior environment's uneven
illumination. chosen characteristics as a result of advancements in deep
learning technology. Deep neural network-based detection
Inde x Terms — Re sne t 101 , RPN( Re gi on al Proposal techniques fall primarily into two categories: one is a one-stage
Network),FCN(Fully Connected Neural Network),ROI(Region of
Interest),IoU(Intersection over Union),NMS(Non-Maximum algorithm that turns object identification into a regression issue,
suppression)ROI align and the other is a two-stage strategy that combines area proposal
with convolutional neural networks (CNN), such as R-CNN used
a large-capacity CNN to analyse bottom-up area suggestions .
The R-CNN had an impact on all subsequent two-stage
algorithms and contributed to the progressive mainstreaming of
CNN-based object detection algorithms.The application of deep
learning has improved detection accuracy and detection speed.
2

based on R-CNN, which decomposed the output of the fully II. LITERATURE SURVEY

connected layer into (SVD) to obtain two output vectors: the Neural Networks is one of the prominent machine learning
classification score of soft-max and the window regression algorithms nowadays. So these Neural Networks and Deep
of the bounding rectangle of the Bounding-box. This layer learning have definitively demonstrated that they can exceed
simplified the spatial pyramid pooling (SPP) layer to the other algorithms when comes to accuracy and speed been at
region of interest (ROI) Pooling layer on the basis of SPP- the same time colossal amount of data has to be processed by
Net. The classification problem and the bounding box them i.e. incase of object detection methods. Actually any
regression problem are combined in this approach neural network comprises of neurons and activation functions
enhancement. Fast R- CNN significantly increases detection which are referred as basic building blocks of a neural
performance while consuming less storage space by network. First and foremost in order to understand a neural
switching from singular value decomposition (SVM) to network we need to checkout the layers that are present in a
soft-max and storing all of the features in the video neural network .It is an assortment of neurons when fed up
memory. The following issue, however, cannot be resolved with inputs leads to formation of outputs. Generally it has one
by R-CNN or Fast R-CNN: employing selective search and input layer ,one output layer and middle layer of nodes is
comparable approaches to choose region proposals would known as hidden layer.
result in a high number of erroneous regions. It causes
Generally neural network with more than one hidden layer is
inefficiencies and computer power waste. On the basis of
referred as deep neural network . Some applications require
Fast R-CNN, Ren et al. proposed replacing the Selective
high processing power such as image processing and object
Search method with Region Proposal Networks (RPN).
identification paved the way for development of deep neural
Scene Instance segmentation has more challenging
networks such as convolutional neural networks .So its named
problems when attempting to provide a preset semantic
after the hidden layers that comprises of convolutional layers,
category pixel label for a scene picture or video when
pooling layers, normalization layers and fully connected layers
compared to image segmentation for a single item.The
and also one of the downside of general CNN is that it can
abundance of instance categories, mutual occlusion, uneven
describe the class of objects present in that scene and it is
illumination, resemblance of various objects, and other
achievable to regress bounding boxes from the cnn. It can be
issues in interior scenes provide several obstacles. However,
done for one object at a time to may not give information
as instance segmentation uses a lot of computational power,
where the objects located but it can be done .For suppose
it is important to find ways to make the algorithm more
conglomerate of objects in the scene/field of view bounding
effective, particularly on edge devices. Many academics are
box regression may not work well due to the interference.
interested in interior scene comprehension, which is closely
connected to indoor scene semantic segmentation, as a In case of rcnn where cnn is contrived to concentrate on a
result of the widespread use of service robots. Based on this, single region of image/ frame in a video. So there is a
this study contributes to reducing the quantity of fusion contraction of interference to a maximum extent and here
information in order to minimise the consumption of image or frame in a video is divided into nearly 2000 regions
equipment computation, with a focus on the process of of recommendations . After that cnn is enforced for each and
fusing colour information and depth information. every region of image or frame in a video because only Single
The key contributions of this work are as follows: object of interest influence in a given region. Thereupon that a
(i) A multi-task instance segmentation model for multi- Selective search algorithm detects the regions in a given image
scale target is proposed on the basis of the Mask or video and pursued by rescaling leads to formation of regions
RCNN model of same sizes prior victualed to cnn for classification and
bounding box regressor when compared to RCNN here Fast
(ii) A fusion method of depth image and color image is RCNN uses single convolutional neural networks instead of
applied to improve the performance of the model. 2000 convolutional networks used by rcnn for each image and
fast rcnn uses selective search algorithm i.e softmax which
(iii) An improved NMS method is proposed for exceed in terms of performance when compared to svm
better selecting of local candidate regions. algorithm used by rcnn and also inorder to increase object
recognition accuracy fast rcnn employs multitask loss on
(iv) The performance of the proposed algorithm is training of deep convolutional neural networks. So here after
analyzed and verified with results of several rcnn fast rcnn, faster rcnn were discovered
experiments.
Here image is given as input to cnn which provides
convolutional feature map .Faster RCNN is faster when
3

compared to fast rcnn , rcnn because here selective search layers, Region Proposal Networks, Roi Pooling, and
algorithm is not deployed to predict and recognize the Classification [3]. To extract the feature map, a PxQ picture of
region of proposals and then they are mutated here after arbitrary size is first scaled to a fixed size MxN and delivered
that adopted by ROI pooling layer. So that is used to to Conv layers. The next RPN layer and fully linked layer both
classify the image with in prospective region to forecast the share this feature map. The Region Proposal Network then
offset values of bounding boxes. Mask rcnn is almost receives the feature maps and generates Region of Interest
similar to faster rcnn but some improvements are made. (ROIs). RPN first utilises soft-max to detect whether the
When compared to faster rcnn mask rcnn employs a distant anchors are positive or negative, and then it corrects the
branch was included following roi pooling layer and roi anchors in order to apply bounding box regression to get an
align. So that the branch added is binary the described pixel accurate proposal. The fully connected layer receives the
will give response to query i.e. whether to mask or not .So feature maps and proposals from the Roi Pooling layer,
if it is one it will be masked and if it is zero it so after processes them to extract the proposal feature maps, and then
getting won’t be masked so after getting feature map from uses them to identify the object category. The Classification
faster rcnn . We move to the masking step. component then determines the category of the proposal using
In the mask step of Mask R-CNN, a binary mask is proposal feature maps while simultaneously determining the
predicted for each detected object in the image. This mask exact end position of the detection frame using bounding box
identifies the pixels that belong to the object and those that regression. Deep learning is the most widely used technique
do not. This segmentation is performed in addition to the for instance segmentation of images. Due to the presence of
bounding box detection step, which identifies the location various lighting environments and scene mesoscale issues,
of the object in the image. segmentation efficiency, accuracy, and visualisation effect
cannot completely satisfy the criteria while dealing with
To produce these segmentation masks, the mask step in interior scenes. The form of the candidate areas, the quantity of
Mask R-CNN uses a fully convolutional neural network network computations, and the high number of created target
(FCN) architecture. The FCN takes as input the features candidate regions are still issues with RCNN. To lessen the
extracted by the region proposal network (RPN) and number of repeated computations in the network's feature
produces a mask for each detected object. The FCN is extraction procedure, He [27] et al. constructed a spatial
trained end-to-end with the rest of the network using a pyramid pooling network (SPPNet) and attached it to the back
multi-task loss function that combines the losses for object of the convolutional layer. To swiftly provide superior
detection and segmentation. candidate areas, Ren et al. added a Region Proposal Network
The mask step in Mask R-CNN is important because it (RPN) to the Fast-RCNN network. A pyramid module
allows the model to perform instance segmentation, which comprising the contextual information of several areas was
involves not only detecting the object but also segmenting proposed by Chen [29] et al., and it increased the calibre of
it at the pixel level. This can be useful in many tasks involving scene comprehension. Long [and others created
applications, such as medical imaging, where precise a fully convolutional neural network structure to address the
segmentation of objects is important for diagnosis and issues of low computing efficiency and large storage cost of
treatment planning. picture instance segmentation based on deep learning in the
early stages (FCNs). To summarise, the most common
technique for achieving picture instance segmentation is to
apply deep learning. Due to the presence of various lighting
III RELATED WORK environments and scene mesoscale issues, segmentation
Prior to the widespread adoption of deep learning efficiency, accuracy, and visualisation effect cannot completely
technology, traditional image semantic segmentation satisfy the criteria while dealing with interior scenes. As a
primarily carried out related operations on the target region result, this work focuses primarily on the deep learning
of the image, using artificially created feature extractors to approach, further strengthens the rgb-d information fusion
extract pertinent features like texture, colour, and shape of approach, minimises the quantity of information needed for
the image, which would be sent to a classifier (such as image processing, and increases the model's effectiveness. The
SVM, etc.) or other intelligent algorithms to predict the model's stability in the interior setting is improved at the same
outcome. Target category in the image. However, these time by enhancing another pertinent method.
methods often contain less information due to single
features, resulting in unsatisfactory segmentations.
According to the description of Ren et al. The four steps of
the original paper's Faster R-CNN procedure include Conv
4

IV PROPSED WORK

3.1 Fusing RGB-D image


the ground and angle between the surface)
The combination of RGB data with depth data seems
to hold promise because depth data suffer from low depth
resolution and are resilient to changes in light. Rich in colour
and texture, with a high angular resolution, image data
degrade fast under less-than-ideal lighting. Indoor scenes
frequently have uneven lighting, which interferes with the
extraction of RGB picture features and results in a loss of
feature information. This work integrated the colour and
depth photos early to provide a more feature-rich image
training in order to address the issue. The depth pictures in
the NUYDv2 dataset were transformed into HHA using Fig.1. RGB Image Fig.2. Depth Image
advice from Saurabh Gupta [32] et al. pictures in three
channels (horizontal difference, height to the ground and the
angle of the surface normal vector). This study examined
RGB pictures' frequency domain properties. In order to
combine the colour and depth images to create the fused
HHG image, this article opts to remove the third channel
from the HHA image and use the colour image's grayscale Fig.3. HHG Image Fig.4. HHA Image
version in place of the A channel (the first H is the horizontal
parallax, the second H is the height to the ground, and the G 3.2 Choosing backbone network
channel is the grayscale image of the colour image which can
reduce the amount of information while preserving image The feature extractor for the backbone network was
features as much as possible). It is not essential to train the chosen to be the common convolutional neural network
colour and depth pictures individually during training since (ResNet50 or ResNet101). ResNet, also known as
the HHG image completely takes into account the Residual Network, is a network that adds a direct
information from both photos. Better experimental findings connection channel in accordance with the Highway
will be attained when the model training speed is Network. It was first suggested by He [33] et al. in
increased.The Fig 1 represents the rgb image Fig 2 represents 2015.A fraction of the output from the prior network layer
the depth image Fig 3 represents HHG(horizontal disparity, can be obtained thanks to highway networks. As a result,
height above the ground and grey scale of the rgb-image) and rather of having to learn the entire output from the
Fig 4 represents the HHA(horizontal disparity, height above previous network layer, the present one may only learn
5

what is left of it. ResNet primarily creates the Skip classification.


Connection structure to obtain the better identity mapping Overall, ResNet-101 is a powerful and widely used
capabilities, enabling the network to be enlarged and its architecture for image-related tasks, with a strong
performance to be enhanced. performance record in several benchmarks.

3.3 Region Proposal Networks (RPN)

Fig. 5. Resnet-101
The Highway Network make it possible to gain a certain
proportion of the output of the previous network layer.
Therefore, the current network layer can learn the
remaining output of the before one rather than the
whole output. As shown in Fig. 5, define the underlying
Fig. 6. RPN Network
mapping so that the superimposed nonlinear layer
satisfies another mapping The Region Proposal Network (RPN) is a key component of
H(xinput) = G(xinput) − xinput, So the original feature is the Faster R-CNN object detection architecture that proposes
mapped to H(xinput) + xinput. The whole process can be object regions for further processing. The RPN takes an image
expressed as: as input and outputs a set of object proposals, each with an
youtput = H(xinput,{ωi})+xinput (1) objectness score. The RPN generates object proposals by
sliding a small network (usually a small CNN) over the
represents the residual block in a ResNet architecture,
convolutional feature map output by the backbone network. At
where xinput is the input to the block, H is a sequence of
each sliding position, the network simultaneously predicts
convolutional and activation layers that learn the residual
mapping, ωi are the learnable weights of the layers in H, objectness scores and object bounding box coordinates relative
and youtput is the output of the block. The addition of to a set of predefined anchor boxes which is shown in Fig 6.
xinput to the output of H forms a residual connection,
Mathematically, the RPN can be expressed as follows:
which allows the gradient to flow easily during
backpropagation and enables the network to learn deeper Let F be the convolutional feature map output by the backbone
and more complex features. network, and let x be a pixel location on F. At location x, the
The ResNet-101 architecture is composed of residual RPN applies a small network with parameters ω to a local
blocks, which allow for the construction of much deeper region of the feature map. The output of the network is a set of
networks than were previously feasible. The residual k bounding box offsets and objectness scores, denoted by {Δk,
blocks enable the network to learn residual mappings, pk}. The k bounding box offsets are defined as:
which are the difference between the input and output of a
block. The residual mappings are then added to the input Δk = (Δxk, Δyk, Δwk, Δhk) (2)
of the block to produce the output. This approach enables
the network to learn the identity mapping when it is where (Δxk, Δyk) is the offset from the center of the k-th
optimal to do so, thereby alleviating the problem of anchor box to the center of the proposed bounding box, and
vanishing gradients and enabling the construction of (Δwk, Δhk) is the width and height of the proposed bounding
deeper networks. ResNet-101 consists of a deep box relative to the size of the k-th anchor box.
convolutional backbone with 101 layers, followed by a
The objectness score pk represents the probability that an
region proposal network (RPN) and a region-based fully
convolutional network (R-CNN) for object detection. The object is present in the proposed bounding box. A higher score
backbone comprises several stages, each of which indicates a higher probability of objectness.The RPN generates
contains a sequence of residual blocks. The first stage a set of object proposals by applying a non-maximum
starts with a 7x7 convolutional layer followed by a 3x3 suppression (NMS) algorithm to the set of proposed bounding
max pooling layer. Subsequent stages down-sample the boxes. The NMS algorithm selects the proposals with the
feature maps by a factor of 2 and increase the number of highest objectness scores and eliminates highly overlapping
filters in each block. The final stage ends with a global proposals. Overall, the RPN is a powerful tool for object
average pooling layer and a fully connected layer for proposal generation and plays a critical role in the success of
6

Faster R-CNN and related object detection architectures. IoU threshold leads to fewer false positives but may result
in missed detections, while a lower IoU threshold leads to
more false positives but also more detections.
3.3.1 Non-maximum suppression algorithm
(NMS)
Non-Maximum Suppression (NMS) is a post-
processing step used in object detection algorithms to
eliminate duplicate detections of the same object. The
3.4 RoIAlign
basic idea behind NMS is to select the bounding box with
RoI Align is a method for aligning features with
the highest confidence score and eliminate other
the RoI boundaries in object detection and instance
bounding boxes that overlap with it above a certain
segmentation. It is an improvement over RoI pooling,
threshold.
which can cause misalignments between the RoI and the
The algorithm for NMS can be described with the
features due to the quantization of the RoI boundaries.
following steps:
RoI Align avoids this issue by computing the exact
locations of the features at the RoI boundaries using
1)Sort the detected bounding boxes by their confidence
bilinear interpolation.
scores in descending order.
The RoI Align operation can be defined as follows:
2)Select the bounding box with the highest confidence
score and save it as a detection result. Divide the RoI into equal-sized sections (e.g.,
7x7).
3)Calculate the Intersection over Union (IoU) between For each section, compute the bilinearly interpolated
the selected bounding box and all the remaining bounding value of the feature map at the corresponding location in
boxes. the input image. Pool the interpolated values using max
pooling to obtain a fixed-sized output. The output of RoI
4)Eliminate the bounding boxes with an IoU greater than Align can be fed into a fully connected layer or another
a pre-defined threshold, as they are considered to be type of head for object detection or instance
duplicates of the selected bounding box. segmentation.

5)Repeat steps 2-4 until all bounding boxes have been The formula for RoI Align can be expressed as:
processed.
RoI Align(x,k) = Σi,j w(i,j) * x(y(i,j))
(4)
3.3.2 Intersection-over-Union (IoU)
Intersection over Union (IoU) is a metric commonly where x is the input feature map, k is the RoI, y(i,j) is the
used in object detection to evaluate the overlap between location of the i-th row and j-th column of the RoI, and
the predicted bounding boxes and the ground truth w(i,j) is the bilinear interpolation weight of the
bounding boxes. It is defined as the ratio of the area of corresponding location. The output of RoI Align is a
overlap between the predicted and ground truth bounding fixed-sized feature map of the same spatial resolution for
boxes to the area of union between the two bounding each RoI. This can be visualized in Fig 7
boxes.
IoU = (Area of overlap) / (Area of union)
(3)
Area of overlap: the area of the intersection between the
predicted and ground truth bounding boxes.
Area of union: the area of the union between the predicted
and ground truth bounding boxes.
Fig.7. RoIAlign schematic
In practice, a threshold is set for the IoU value to
determine whether a predicted bounding box is
considered a true positive or a false positive. A higher
7

In this study, interior scene photos were captured


3.5 Loss function using the Kinect colour depth camera under a
variety of lighting conditions, and an indoor scene RGB-
The model constructed in this paper includes three D dataset was created as experimental data. 2900 colour
tasks (classification, detection and segmentation), so the photos and 2900 depth images made up the data set, of
loss function should also be divided into three parts, which 2100 were test sets and the remaining 1900 were
which are the corresponding errors. Thereby, The loss training sets The scene picture includes four categories of
function of the model is defined as follows: everyday objects: The remaining items were considered
background, including the Cup, Cap, Can, Bowl and
Ltotal = Lcls + Lbox + Lseg Cornflakes box. In this post, the Windows 10 x86
(5) hardware environment was configured and built using
Tensorflow. Among them, the CPU was an Intel Core i7,
where: the GPU was RTX 2070 super, the RAM 32GB, and the
Lcls measures the accuracy of object classification software versions of Cuda with Cudnn 9.0, 7.0,
(i.e., whether an object is present or not) and is typically Tensorflow 2.2, Python 3.7, and OpenCV 2.0 all the
computed using cross-entropy loss. details are available in Table 1. The model was trained
after the construction of the experimental setting and data
Lbox measures the error in predicting the object's set preparation. The following settings were made for the
bounding box location and size, and is usually computed experiment's parameters: batch size = 8, epoch = 19, steps
using a regression loss such as smooth L1 loss. per epoch = 10k, or 19w total iterations; learning rate =
0.001, learning momentum = 0.9, and weight decay =
0.0001 All the details are shown in Table 2.
Lseg measures the quality of the predicted instance
TABLE 1
segmentation mask and is computed using binary cross- EXPERIMENTAL ENVIRONMENT INFORMATION
entropy loss or dice loss. TABLE

Like the loss function defined in the target detection Attribute name Attribute value
algorithm Faster-RCNN, the fully connected layer is used
to predict the category of each RoI and the coordinate TensorFlow 1.14.0
position of its rectangular box in the figure; Lseg is the version 2.2.5
loss function of the semantic segmentation branch, the Keras version 32GB
output dimension of the segmentation branch for each RAM Inter(R)Core(TM)i76700KCP
Processor U
RoI is km2,which represents k binary semantic @4.00GHz x 8
segmentation masks with a resolution of m × m, and each Graphics RTX 2070 super
category has a binary mask with a resolution of, where Operating system Windows 10
k is the category quantity. Therefore, this experiment
uses sigmoid on each pixel, in which Lseg is defined as
the average cross-entropy binary loss function. For RoI
with a real category of k, only the loss function is TABLE 2
calculated on the kth instance segmentation mask, and the EXPERIMENTAL PART PARAMETER TABLE
output of other masks is not included
Parameter Value

A. Abbreviations and Acronyms


LEARNING_RATE 0.001
RPN(Regional Proposal Network),FCN(Fully Connected
LEARNING_MOMENTUM 0.9
Neural Network),ROI(Region of
Interest),IoU(Intersection over Union),NMS(Non- WEIGHT_DECAY 0.0001
Maximum suppression) DETECTION_MIN_CONFIDENCE 0.8
STEPS_PRE_EPOCH 100
NUM_CLASSES 2
MASK_POOL_SIZE 14
4 Experiment Data POOL_SIZE 7
VALIDATION_STEPS 50
4.1 DataSet Preparation
8

experiment is 4.

4.2 Algorithm evaluation criteria The enhanced algorithm is contrasted with the sophisticated
detection method Mask RCNN algorithm in order to examine
There are several techniques to assess the the detection performance of the upgraded algorithm on the
effectiveness of the segmentation algorithm, including RGB-D data set.
recall rate and precision rate. In F1 (or F-score) can be The P-R curve derived using Mask RCNN is displayed in
used to gauge them in the event that both are required to Figure 8.
meet high standards. But, mAP also addresses the issue of
their having a single point value constraint by rating the
retrieval impact. As a result, this article selects the MAP
value. Calculating the accuracy and recall is required
Bowl(AP) Cap(AP) Can(AP) Cup(AP) MAP
prior to determining the MAP value.
Fusion 0.994 0.997 0.987 0.923 0.949
To calculate the accuracy rate, which is the ratio of the Image

actual number of positive samples in the prediction RGB 0.925 0.925 0.969 0.833 0.921
sample to the total number of positive samples, and the Image
recall rate, which is the ratio of the actual number of
positive samples in the prediction sample to the total
number of predicted samples.

Formula precision = TPnum/(TPnum + FPnum)


(6)

recall = TPnum/(TPnum + FNnum)


(7)

where:
TPnum is the actual number of positive samples in Fig.8. Precision-Recall Curve
the right prediction sample, FP num is the number of 4.3. Results of the experiment
improperly divided positive samples, and NP num is the
number of errors that were split into negative samples. 4.3.1 Mean average accuracy

With precision and recall, you can calculate the average The test accuracy comparison of the two picture data
precision (AP), and draw the P–R curve according to the training models for instance segmentation. During colour
precision and recall. The area under the P–R curve is the image training, the instance segmentation algorithm's
average precision AP. The calculation formula is as MAP value is 92.1% The MAP value under image
follows: training has increased after fusion to 94.9% from when
utilised alone. Image training in colour and depth
1 increased by 2.65% and 17.675%, respectively. Also,
AP = ∫ p ( r ) dr compared to training using either colour , the accuracy of
0
the four categories acquired through fusion data training
(8) is substantially greater. Which can be observed in Table 3
MAP (Mean Average Precision) is to average all and also there is bar graph comparison between the fusion
categories, the formula is: model and the RGB model from which we can see that
n
fusion model perform much better compared to the RGB
MAP =
∑ AP(K ) model
k=1
N
(9)
TABLE 3
Among them, N represents the number of categories
in the picture, and the number of categories in this
9

Fusion RGB

99.7
99.4

98.7
96.9
92.5

92.5

92.3
83.3
Bowl C ap C an Cup

Fig.9. Comparison between Fusion model and RGB


model
Fig.11. Visualization based on RGB training model

4.3.2 Visualization of model prediction results 5. Conclusion

On the basis of the colour image training model, the Mask R-CNN is a practical method for object recognition and
visualisation results are displayed in Fig. 10 and Figure segmentation in indoor contexts with uneven lighting,
11 displays the visualisation outcomes based on the employing RGB-D datasets, according to tests and analyses
training model's fusion of colour and depth images. The provided in this journal publication. The suggested model uses
recognition impact of items in the scene was improved depth information to enhance object recognition and
following information fusion, according to the visual segmentation in a variety of lighting situations, particularly for
picture results. The Results revealed that the multitask objects with intricate forms and textures.
model that was built could not only provide multiple The tests revealed that the proposed model beat
outcomes but also perform more precise edge contour other cutting-edge techniques on several benchmark
segmentation of items of various scales, such as the Bowl, datasets with asymmetric lighting, proving the viability of
Cup, Cap and Can that were on exhibit in the picture, the suggested strategy. Moreover, the trials demonstrated
when paired with information fusion. that the suggested model is resilient to variations in
illumination, which is critical for real-world applications
where the lighting environment might be unexpected.The
suggested methodology is applicable to several fields,
including robotics, surveillance, and autonomous
vehicles, where precise and reliable object segmentation
is crucial. To enhance the performance of the model
under difficult lighting situations, more study may be
done to investigate the usage of various designs or fusion
techniques.

REFERENCES

J. You, W. Liu, J. Lee, A DNN-based semantic segmentation


for detecting weed and crop, Comput. Electron. Agric. 178
(2020) 105750
Fig.10. Visualization based on Fusion training model
L. Huang, M. He, C. Tan, D. Jiang, G. Li, H. Yu, Jointly
network image processing: multi-task image semantic
segmentation of indoor scene based on CNN, IET Image
Process. 14 (15) (2020) 3689–3697

P. Hu, F. Perazzi, F.C. Heilbron, O. Wang, Z. Lin, K. Saenko,


S. Sclaroff, Real- time semantic segmentation with fast
attention, IEEE Robot. Autom. Lett. 6 (1) (2020) 263–270 J.
1
0

You, W. Liu, J. Lee, A DNN-based semantic segmentation


for detecting weed and crop, Comput. Electron. Agric. 178
(2020) 105750

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature


hierarchies for accurate object detection and semantic
segmentation, in: 27th IEEE Conference on Computer
Vision and Pattern Recognition, 2014, pp. 580–587

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for


image recognition, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2016, pp.
770–778.

S. Khan, K. Muhammad, S. Mumtaz, S.W. Baik, V.H.C. de


Albuquerque, Energy-efficient deep CNN for smoke
detection in foggy IoT environment, IEEE Internet Things J.
6 (6) (2019) 9237–9245, http://dx.doi.org/10.1109/
JIOT.2019.2896120

D. Jiang, G. Li, Y. Sun, J. Hu, J. Yun, Y. Liu, Manipulator


grabbing position detection with information fusion of color
image and depth image using deep learning, J. Ambient
Intell. Hum. Comput. (2021) http://dx.doi.org/10.
1007/s12652-020-02843-w
1
1

.
1
2
1
3

You might also like