0% found this document useful (0 votes)
23 views25 pages

Satellite Image Segmentation Using Unet andMobileN

This research article presents a novel approach for satellite image segmentation using a combination of UNet++ architecture and MobileNetV2 deep learning model, aimed at enhancing accuracy and computational efficiency for applications in disaster management and environmental monitoring. The proposed model leverages self-supervised learning and instance segmentation techniques to reduce reliance on large labeled datasets, making it suitable for resource-constrained environments. Experimental results demonstrate the effectiveness of this approach in achieving accurate and efficient segmentation of satellite images, showcasing its potential for real-world remote sensing applications.

Uploaded by

mitanshushinde9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views25 pages

Satellite Image Segmentation Using Unet andMobileN

This research article presents a novel approach for satellite image segmentation using a combination of UNet++ architecture and MobileNetV2 deep learning model, aimed at enhancing accuracy and computational efficiency for applications in disaster management and environmental monitoring. The proposed model leverages self-supervised learning and instance segmentation techniques to reduce reliance on large labeled datasets, making it suitable for resource-constrained environments. Experimental results demonstrate the effectiveness of this approach in achieving accurate and efficient segmentation of satellite images, showcasing its potential for real-world remote sensing applications.

Uploaded by

mitanshushinde9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Satellite image segmentation using Unet++

andMobileNetV2 deep learning model


Babitha Lokula

Koneru Lakshmaiah Education Foundation


Ramakrishna Tirumuri
Koneru Lakshmaiah Education Foundation
Narasimha Prasad L V
Institute Of Aeronautical Engineering, Hyderabad

Research Article

Keywords: Satellite image segmentation, UNet++, MobileNet encoder, deep learning model, land cover
classi cation, environmental monitoring, disaster management

Posted Date: March 26th, 2024

DOI: https://doi.org/10.21203/rs.3.rs-4144393/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Satellite image segmentation using Unet++ and
MobileNetV2 deep learning model
Babitha Lokula1*, Ramakrishna Tirumuri2 and
L V Narasimha Prasad3
1,2* ECE, KL deemed to be university, Vaddeshwaram, Vijayawada,
522502, Andhra Pradesh, India.
3 ECE, Institute of Aeronautical Engineering, Dundigal, Hyderabad,

500043, Telangana, India.

*Corresponding author(s). E-mail(s): [email protected];


Contributing authors: [email protected];
[email protected];

Abstract
Satellite image segmentation plays a pivotal role in extracting valuable informa-
tion for applications like ,disaster management, environmental monitoring and
land cover classification. In this work, a comprehensive approach for satellite im-
age segmentation utilizing a fusion of UNet++ architecture and the lightweight
MobileNetv2 encoder deep learning model is proposed. The UNet++ architec-
ture, an extension of the widely adopted UNet, is employed for its ability to
capture hierarchical features and enhance segmentation performance. Integrat-
ing MobileNet as the encoder provides computational efficiency, making the
model well-suited for resource-constrained environments, such as satellite image
analysis on edge devices. The proposed model leverages the strengths of both
architectures, combining the expressive power of UNet++ with the efficiency
of MobileNet. Extensive experiments are conducted on a diverse satellite image
dataset, evaluating the model’s segmentation accuracy, computational efficiency,
and generalization capability. The results demonstrate the effectiveness of the
proposed approach in achieving accurate and efficient satellite image segmenta-
tion, making it a promising solution for real-world applications in remote sensing
and geospatial analysis.

Keywords: Satellite image segmentation, UNet++, MobileNet encoder, deep learning


model, land cover classification, environmental monitoring, disaster management.

1
1 Introduction
Satellite imaging is useful in many different fields, including agriculture, urban plan-
ning, disaster monitoring, and environmental studies, to name a few. It is very
necessary to be able to derive relevant information from satellite photos in order to
effectively make decisions, manage resources, and respond to natural disasters [1]. The
process of segmentation, which includes identifying and classifying all of the objects
and land cover elements included within the picture, is one of the most important steps
in satellite image processing. Significant problems for accurate and effective segmen-
tation are presented by the complexity of satellite imagery, which is characterized by
huge geographical extents, varied resolutions, and diverse landscapes [2]. Traditional
approaches often fail to keep up with all of these different obstacles, which results in
solutions that are not ideal. In addition, the human interpretation of training data in
supervised learning is not only labor-intensive but also time-consuming, which limits
the scalability of these approaches [3]. As a consequence of this, there is an urgent
need for robust and automated segmentation algorithms that are capable of adjusting
to the varied and ever-changing characteristics of satellite data.
Even though it has made great strides and is more successful than previous
methods, satellite image segmentation using deep learning models still has certain
limitations.The enormous amount of processing power that deep learning algorithms
frequently demand is one of the most prominent disadvantages [4]. These models’ in-
tricate designs need a large amount of processing power and memory, which makes
them computationally demanding and sometimes unsuitable for applications that have
restricted resources. Another disadvantage is that it requires a substantial quantity of
labeled training data to function properly. When it comes to successful training, deep
learning models, like those used for segmenting satellite images, flourish when given
access to large datasets. Obtaining and annotating such information may be a process
that is both time-consuming and resource-intensive. This is especially true for satellite
images that include a variety of different landscapes and features [5]. This dependence
on large amounts of labeled data might provide difficulties in circumstances in which
acquiring such data is difficult from a budgetary or logistical perspective.
The most recent investigations into the use of deep learning models for image seg-
mentation in satellite imagery are at an advanced stage of developments in remote
sensing and earth observation [6]. The complex and ever-changing nature of satellite
imagery presents a number of issues, which are now being actively addressed by re-
searchers by actively exploring creative ways [7]. There is now work being done to
improve previously developed deep learning architectures and to create new models
that are specifically suited for the segmentation of satellite pictures. The optimiza-
tion of computational efficiency to manage the huge volumes of data associated with
high-resolution satellite imagery [8] is one of the primary focuses of this research. This
optimization is done to ensure that these models can function well in real-time or
near-real-time applications.
In addition, researchers are looking at the use of multi-modal data sources in order
to improve the accuracy of the segmentation process. A more accurate depiction of the
Earth’s surfaces may be obtained by the combination of data from a variety of satellite
sensors and other datasets that are supplementary to those data. This comprehensive

2
approach adds to superior segmentation outcomes, especially in complicated situations
where standard approaches may fail to function well. The investigation of transfer
learning approaches is also gaining attention. The goal of transfer learning is to make
use of models that have been pre-trained on big datasets in order to improve segmen-
tation performance in situations when there are few annotated satellite pictures. The
direction that research in satellite image segmentation using deep learning models is
taking at the moment demonstrates a dedication to overcoming obstacles and pushing
the limits of what is possible within the remote sensing industry applications.
In this research, the difficult problem of segmenting satellite images is tackled,
and a unique method based on self-supervised learning and instance segmentation is
presented as a potential solution. The process of segmenting satellite images lends itself
especially well to the use of self-supervised learning, which is a promising method in
the area of computer vision. Self-supervised learning is a method of machine learning
that enables a model to acquire meaningful representations without the need for vast
human-labeled datasets. This method works by exploiting the innate patterns and
connections that are present within satellite images. Because of this modification in
the approach of learning, the model is now able to recognize and differentiate between
objects and aspects of land cover even when there is no pre-labeled training data
available. Furthermore, the method goes beyond standard segmentation methods by
including instance segmentation techniques. This enables the detection and distinction
of individual instances of the same object class included inside the picture. This is
very helpful in situations in which accurate and comprehensive segmentation is of
the utmost importance, such as when one is tasked with counting trees in a forest or
monitoring automobiles in a parking lot.
In this research, a novel Self-Supervised Learning Based Instance Segmentation
Method is presented. It is designed to particularly handle the one-of-a-kind difficulties
that are inherent in the processing of satellite images. By using the full potential of
self-supervised learning, this approach is able to make efficient use of the vast amounts
of information that are included inside satellite photos. Because of its decreased de-
pendence on manually annotated training data, this strategy mitigates a constraint
that is often experienced when using more standard supervised approaches, which is
one of the methodology’s significant advantages.

2 Related work
Tareque Bashar Ovi et al [9] introduced a novel tri-level attention-based DeepLabv3+
architecture, referred to as DeepTriNet, for the purpose of semantic segmentation of
satellite images. The hybrid technique under consideration integrates squeeze-and-
excitation networks (SENets) and tri-level attention units (TAUs) into the existing
DeepLabv3+ architecture. The TAUs are used to address the semantic feature dis-
parity between the output of encoders, while the SENets are utilized to assign more
importance to pertinent features. The DeepTriNet model, as presented, determines the
most significant characteristics in a more generic manner via self-supervision, rather
than relying on manual annotation.

3
Yann Fabel et al [10] presented a novel approach of self-supervised learning to
effectively use a much bigger dataset in comparison to traditional supervised training
methods, resulting in improved performance of the model. The first stage of the study
entails using over 300,000 Atmospheric State Indices (ASIs) in two separate pretext
assignments as part of the pretraining process. One of the objectives focuses on the
process of reconstructing images, while the other job is concentrated on the utilization
of the DeepCluster model. The DeepCluster model is an iterative procedure that
includes grouping and categorizing the neural network’s output. Following that, the
model is subjected to a process of fine-tuning using a rather modest labeled dataset
consisting of 770 ASIs. Out of these, 616 ASIs are allocated for training purposes,
while the remaining 154 ASIs are reserved for validation. Every Artificial Intelligence
System (ASI) is linked to a ground truth mask that classifies individual pixels into
several categories, such as clear sky, low-layer clouds, mid-layer clouds, or high-layer
clouds. In order to evaluate the efficacy of self-supervised pretraining, a comparison
study is undertaken, whereby this methodology is contrasted with models that are
started with random weights and those that are pretrained using ImageNet data. All
models are trained and validated using identical datasets.
Fabien H.Wagner et al [11] presented the k-textures technique, which offers a self-
supervised approach for segmenting a 4-band picture (consisting of RGB and NIR
bands) into k distinct classes. An example of its use using high-resolution Planet satel-
lite images is shown. According to the algorithmic analysis, it has been determined
that the use of convolutional neural networks (CNN) in conjunction with gradient de-
scent renders discrete search a viable approach. The model is capable of identifying k
distinct clustering classes within the data. These classes are represented by k discrete
binary masks and their corresponding separately produced textures. When merged,
these masks and textures simulate the initial picture. The similarity loss refers to the
average squared error between the features of the actual picture and the simulated
image. These features are obtained from the penultimate convolutional block of two
different models: The Keras ”imagenet” pre-trained VGG-16 model and a custom fea-
ture extractor created using Planet data. The primary advancements of the k-textures
model include the acquisition of k discrete binary masks inside the model via the use
of gradient descent. The proposed model facilitates the production of discrete binary
masks via the use of a unique approach including a hard sigmoid activation function.
Furthermore, the algorithm offers hard clustering classes, where each pixel is assigned
to a single class. In contrast to the k-means algorithm, which treats each pixel as an
independent entity, the approach discussed here incorporates contextual information
and associates each class not just with comparable color channel values, but also with
texture. The proposed methodology aims to facilitate the generation of training sam-
ples for satellite image segmentation. Additionally, the k-textures architecture may
be modified to accommodate varying numbers of bands and to address more intricate
self-segmentation problems, such as object self-segmentation.
Wadii Boulila et al [12] introduced a hybrid strategy for object categorization
in very-high-resolution satellite images, using the PPDL framework. The encryp-
tion technique under consideration involves the integration of Paillier homomorphic
encryption (PHE) and slightly homomorphic encryption (SHE). The objective of

4
this combination is to augment the encryption of satellite images while simultane-
ously maintaining optimal runtime and achieving a high level of accuracy in object
categorization. The encryption technique used for pictures is supported by the uti-
lization of the public keys associated with Partially Homomorphic Encryption (PHE)
and Somewhat Homomorphic Encryption (SHE). The researchers performed experi-
ments utilizing high-resolution satellite images obtained from the SPOT6 and SPOT7
satellites in real-world scenarios. This study examined four distinct convolutional neu-
ral network (CNN) architectures, namely ResNet50, InceptionV3, DenseNet169, and
MobileNetV2.
Wenyuan Li et al [13] proposed a self−supervised multitask methodology for
acquiring representations in remote sensing images that effectively captures visual as-
pects. The proposed approach entails the development of three separate pretext tasks
and the use of a triplet Siamese network to simultaneously capture both high-level
and low-level visual features. The training process of this network does not need the
use of labeled data. However, the resulting model may be further refined by the use
of annotated segmentation datasets during the fine-tuning phase. The efficacy of their
methodology is validated by empirical investigations carried out on several datasets,
including Potsdam, Vaihingen, and the Levir CS dataset, which focuses on cloud and
snow identification. The trial’s results demonstrate that the suggested approach ef-
fectively lowers the reliance on labeled datasets and improves the performance of
remote sensing semantic segmentation. When comparing their method to recent state-
of-the-art self-supervised representation learning methods and commonly employed
initialization methods such as random initialization and ImageNet pretraining, it is
observed that their method consistently outperforms the others in the majority of
experiments, particularly in situations where there is a scarcity of training data. Sur-
prisingly, their strategy demonstrates equivalent performance to randomly initialized
models with a little 10 to 50 labeled data.
Haifeng Li et al [14] presented a new network called the Global Style and Local
Matching Contrastive Learning Network (GLCNet) for the task of semantic segmenta-
tion in Remote Sensing Images (RSIs). The GLCNet has been designed with a unique
structure to improve the segmentation of Remote Sensing Images (RSIs). During the
first stage, the use of the Global Style Contrastive Learning module is implemented to
enhance the process of acquiring image−level representations. This premise is based
on the notion that stylistic attributes have the capacity to accurately encapsulate the
holistic qualities of a picture. The subsequent module, known as the Local Features
Matching Contrastive Learning module, has been carefully developed to acquire rep-
resentations of local areas, which play a critical role in semantic segmentation tasks.
The authors of the study conducted a thorough evaluation of their technique by us-
ing four separate datasets for RSI semantic segmentation. The experimental findings
repeatedly demonstrate that their approach significantly outperforms both contem-
porary self−supervised approaches and the ImageNet pretraining method in terms of
performance.
Wenbo Sun et al [15] proposed a novel approach aimed at enhancing the accuracy
of picture segmentation by including depth estimation techniques into the analysis of

5
RGB images. Subsequently, the obtained depth map is utilized as input for a con-
volutional neural network (CNN) to facilitate the process of semantic segmentation.
Moreover, for the purpose of concurrently parsing the depth map and RGB pictures,
An encoder-decoder network with several branches is designed, and the RGB and
depth characteristics are progressively integrated. The results of the extensive exper-
imental assessment on four baseline networks indicate that the suggested technique
significantly improves the quality of segmentation and achieves superior performance
when compared to other segmentation networks.
Jannik Zurn et al [16] proposed a novel framework for terrain categorization that
leverages an unsupervised proprioceptive classifier. The classifier in question acquires
knowledge from the auditory signals generated during the interactions between vehicles
and the terrain. This allows for the autonomous training of a classifier that can perform
pixelwise semantic segmentation of pictures, based on external sensory information.
The methodology initiates by creating a discriminative embedding space for the sounds
produced during vehicle-terrain interaction. This is achieved by using triplets of audio
clips, which are constructed by combining the visual attributes of the respective ter-
rain patches. The produced embeddings are further subjected to clustering, whereby
these clusters are used as labels for the visual terrain patches. The assignment of these
labels is accomplished by projecting the pathways walked by the robot onto the cam-
era pictures. The use of poorly labeled pictures for training the semantic segmentation
network is achieved by the application of weak supervision. The study provides a thor-
ough collection of quantitative and qualitative results, illustrating the superiority of
their proprioceptive terrain classifier over current unsupervised approaches. Further-
more, the self−supervised exteroceptive semantic segmentation model developed by
the researchers demonstrates performance levels that are equivalent to those reached
by supervised learning using manually annotated data.
Huihui Dong et al [17] proposed an innovative approach to self− supervised rep-
resentation learning for remote sensing picture change detection. This technique is
centered on temporal prediction. The primary objective is to enhance the consistency
of feature representations in two satellite pictures via a self−supervised process, with-
out relying on semantic supervision or requiring extra computations. By using the
modified feature representations, it is possible to produce an improved difference im-
age (DI) that effectively minimizes the error transmitted by the DI in the end result of
detection. In the framework of self-supervision, the neural network is tasked with dis-
cerning distinct sample patches within a pair of temporal pictures, hence engaging in
temporal prediction. By using a network architecture that emulates the discriminator
component of generative adversarial networks, the temporal prediction task is able to
capture distribution−aware feature representations, leading to a resultant model that
exhibits strong resilience.

3 Existing models
The CNN architectures are used to perform the image classification. To achieve high
accuracy using CNN, the more number of layers are added to the architecture so that it
becomes deeper CNN. The process of adding more number of layers to the architecture

6
indicates the depth scaling. The deeper CNN architecture is more powerful and it gives
high efficiency but goes on increasing the number of layers in deeper CNN, after some
extent it saturates and get the problem of vanishing gradient. To mitigate the problem
of vanishing gradient, in conventional deep CNN’s skip connections were used. Also
in CNN lots of layers are present, so that lots of processing, lots of computation is
required while training and this is the time consuming process and it may face the
problems like complexity, high computational time.

3.1 Efficientnetb
The Efficientnet is used to perform scaling on depth,width and resolution.
Resolution scaling: Resolution scaling is nothing but the increase the size of less
DPI(Dots Per Inch) images into big size in order to capture high complex features
and fine grained patterns from the image.The low resolution images, having less fea-
tures requires less training time but the architecture couldn’t able to classify images
perfectly so the accuracy of an algorithm is also less. On the otherhand high resolu-
tion images having many feautres , requires high training time and the architecture
learns more complex features and gives better classification so that the accuracy of
an algorithm is improved.
Depth scaling: Depth scaling is nothing but adding more number of layes in the
architecture. To process the high resolution images reqires more number of neurons
so that many layers need to be included in architecture that leads to the use of
depth scaling. How much depth scaling is required for the particular increment in the
resolution of image?
Width scaling: Width scaling is nothing but increasing the number of channels or
feature maps. For high resolution images having more pixels requires many feature
maps in order to get each and every information of image. How much width scaling is
required to increase the performance? Mingxing Tan et al.[18] made two observations
regarding efficientnets.
1)Scaling up any dimension of width, depth or resolution improve accuracy, but the
accuracy gain diminishes for bigger models.
2)In oredr to persue better accuracy and efficiency, it is critical to balance all dimen-
sions of network width, depth and resolution during scaling. The compound scaling
method is used to decide how much scaling is required for width, depth and resolu-
tion. The baseline model designed by Neural Architecture Serach(NAS) is required to
perform compound scaling.
The compound scaling method
f=α.βφ.γφ
f=d.wφ.rφ
α is d:depth scaling factor
β is w:width scaling factor
γ is r:resolution scaling factor
f is network scaling factor
where α,β and γ are constants that can be determined by a small grid search.

7
3.1.1 Efficientnetb0 and Efficientnetb1

Figure 1: Architecture of Efficientnet

The architecture of efficientnet is shown in figure 1. The problems which are oc-
curring while increasing the number of layers in CNN i.e., deeper CNN that is having
deeper layers then need to go for depth scaling, increase the width of channels and
high resolution images. By addition of all these things to the architecture there is a
chance of getting vanishing gradient problem. To address these issues, the efficient-
net b0-b7 models were implemented by google. In all these architectures depth scaling
was used. The scaling can be done in efficientnetb0 by using grid search. The chosen
values for effiecientnetbo are α=1.2,β=1.1,γ=1.15 and φ=1. The compound scaling
suggests that the scaling of the network should be performed using a constant ratio
in all the dimensions. The compound scaling method balances all the dimensions of
the network i.e., width or depth or resolution.
Disadvantages:The main drawbacks of efficientnetb0 are huge memory and the
computational requirement, the overfitting effect, and the high variance.

3.1.2 VGG19

Figure 2: Architecture of VGG 19 model

VGG also known as VGGNet, is a classical convolutional neural network (CNN)


architecture. VGG was implemented to increase the depth of such CNNs in order

8
to increase the model performance. The architecture of VGG 19 is shown in figure
2. VGG 19 implemented by Simonyan and Zisserman [19] is a convolutional neural
network which consists of 19 layers with 16 convolution layers and 3 fully connected
to classify the pictures into 1000 object categories. It is very famous technique to do
the image classification because of using number of 3x3 filters in every convolution
layer. The figure depicts the architecture of VGG19. The figure 1 shows that first
16 convolutional layers are used for feature extraction and the last 3 layers used for
classification. The layers used for feature extraction are segregated into 5 groups where
each group is followed by a max-pooling layer. An image of size 224×224 is inputted
into this model and the model outputs the label of the object in the image.

3.1.3 Resnet 50
ResNet50 is a deep convolutional neural network (CNN) architecture developed by Mi-
crosoft Research in 2015.The architecure of Resnet50 is shown in figure 3. The widely
used ”Residual Network,” or ResNet, architecture has undergone some changes. The
number ”50” in the system’s layer count, which indicates that there are 50 layers total,
is where the network’s name comes from. ResNet50 is a potent picture classification
model that produces cutting-edge outcomes when trained on huge datasets [20]. Using
residual connections, the network can learn a set of residual functions that transform
input into the intended output, is one of its primary advances. With these remaining
connections, the network is able to learn much deeper structures without encountering
the vanishing gradients issue. The resnet50 architecture is classified as 4 Convolutional

Figure 3: Architecture of Resnet50

layers, identity block, convolutional block, and fully linked layers are the main compo-
nents. The features that the convolution layers have extracted from the applied image
are processed and transformed using the identity block and convolutional block. The
completely connected layers are also in charge of the final classification. The several
convolutional layers that comprise the convolutional layer are followed by batch nor-
malisation and ReLU activation. These layers are used to extract features, such as
edges, textures, and forms, of the chosen image. Max pooling layers, which minimise
the spatial dimensions of the feature maps while maintaining the most crucial proper-
ties, come after convolutional layers. The two primary ResNet50 building blocks are
the identification block and the convolutional block. The identity block is the basic
component which integrates the input back to the output after passing it through sev-
eral convolutional layers. As a result, the network is able to learn residual functions,

9
which convert input into desired output. Along with the inclusion that of a 1x1 con-
volutional layer to lower how many filters are present before the 3x3 convolutional
layer, the convolutional block resembles the identity block. The completely connected
layers make up the last section of ResNet50. The last classification is determined by
these layersIn order to find the last class probabilities, the outcome of the final fully
connected layer is given to softmax activation function.

4 Proposed UNet++ model


The UNet++ framework is an expansion of the UNet design, characterized by the
inclusion of an encoder-decoder structure that incorporates skip connections. The pro-
cess of feature extraction is performed by the encoder on the input image, while the
decoder utilizes these extracted features to build the segmentation mask. The UNet++
architecture incorporates nested skip connections to enhance the transmission of in-
formation between the encoder and decoder components. This methodology facilitates
the comprehensive capture of both low−level and high−level features, hence enhancing
the quality of segmentation outcomes.
Figure 4 shows architecture of unet++ with integration of mobilenetV2 back bone.
The encoder component of the UNet++ design has the potential to be substituted
with a pre-trained MobileNetV2 backbone. This enables the model to use the efficiency
of MobileNetV2 in terms of its computational speed and resource utilization. The
MobileNetV2 encoder is responsible for extracting hierarchical features from the input
image, while the UNet++ decoder further refines these features in order to generate the
final segmentation mask. The inclusion of skip links between the encoder and decoder
components of the model facilitates the acquisition of both low-level and high-level
information, hence enhancing the performance of segmentation.
MobileNetV2 is used as an encoder for UNet++ in the proposed model. Mo-
bileNetV2 is a convolutional neural network (CNN) architecture specifically developed
to cater to the computational constraints of mobile and edge devices. The approach
employs depthwise separable convolutions and linear bottlenecks to effectively decrease
the quantity of parameters and computational burden, while still achieving satisfac-
tory performance. MobileNetV2 is renowned for its efficacy in terms of velocity and
resource consumption, rendering it well-suited for real-time applications on devices
with constrained processing capabilities. The model is trained using a dataset that
contains segmentation masks that have been labeled. During the training process, the
model acquires the ability to establish a mapping between input images and their
related segmentation masks. The commonly employed loss functions for semantic seg-
mentation encompass loss of cross-entropy, which calculates the difference between the
ground truth and the prediction. pixel-wise labels.

4.1 MobileNetV2 encoder


Convolutional layers, batch normalisation, ReLU activation, and other components
make up the encoder and a series of 17 Inverted Residual (IR) blocks, followed by addi-
tional convolutional layers. Figure 5 shows the MobileNetV2 architecture.The encoder
begins with a convolutional layer, then came batch normalization and Rectified Linear

10
Figure 4: Architecture of unet++ with MobilenetV2 as encoder

Unit (ReLU) activation. The initial module is accountable for the processing of the
raw input image and producing a collection of feature maps. The core component of
the MobileNetV2 architecture is the Inverted Residual block. In the proposed model,
there are a total of 17 blocks that have been arranged in a sequential manner. The In-
verted Residual block effectively retains and enhances the features extracted from the
preceding layer. The architecture is comprised of depthwise separable convolutions,
linear bottlenecks, and skip connections.
After the series of Inverted Residual blocks, a final set of convolutional layers batch
normalization and ReLU activation layers are employed. The last stage of this block
involves enhancing the acquired features by the network and readying the output for
subsequent processing.
The initial layers and the first Inverted Residual block capture low-level features
from the input image. The subsequent Inverted Residual blocks progressively capture
more abstract and high-level features through their skip connections and hierarchical
processing. The final layers refine these features and prepare them for the transition
to the decoder part of the UNet++ architecture.

4.1.1 Convolution layer:


The convolutional layer is an essential component in convolutional neural networks
(CNNs), specifically intended for the purpose of analyzing 2D structures, such as
images. The input undergoes a convolution operation, wherein learnable filters or
kernels are employed to extract features from the input data.
• Filters: Filters are small matrices that may be moved horizontally to cover the input
data. Every filter acquires the ability to identify particular patterns or characteristics
within the given input.

11
Figure 5: MobileNetV2 architecture

• Parameters: The learnable parameters of the filters in a convolutional layer are


subject to adjustment throughout the training process via back propagation. These
settings facilitate the network in autonomously acquiring hierarchical properties
from the input.
• Stride: Stride is responsible for determining the magnitude of the step taken by
the filter as it traverses the input data. Increasing the stride size leads to a decrease
in the spatial dimensions of the resulting feature map.
• Padding: Padding is a technique that entails the addition of additional border
pixels to the input in order to mitigate the risk of information loss occurring at the
edges.

4.1.2 Batch Normalization


Batch Normalization serves to enhance the stability of training and expedite the pro-
cess of convergence. The process of normalizing involves the adjustment and scaling
of activations during the training phase. The fundamental components of the Batch
Normalization layer are outlined as follows:
• Normalization: During the training process, Normalization is applied to each mini-
batch by normalizing the input through the subtraction of the mean and division
by the standard deviation. This procedure effectively mitigates the issue of internal
covariate shift by maintaining a relatively consistent distribution of inputs to a layer
throughout successive batches.

12
• Parameters: Batch Normalization incorporates a pair of trainable parameters for
each feature (or channel) within the layer, namely scale (gamma) and shift (beta).
The utilization of these parameters enables the network to dynamically adjust the
normalized output, hence offering adaptability and maintaining the layer’s ability
to effectively represent information.

4.1.3 ReLu
The Rectified Linear Unit (ReLU) is an activation function that is frequently employed
in artificial neural networks, specifically in deep learning architectures. The incor-
poration of the function within the network introduces a non-linear element, hence
facilitating The capacity of the network to acquire knowledge of complex patterns,
additionally correlations inherent Within the data.

4.1.4 Inverted Residual block


The Inverted Residual block is a fundamental component used in lightweight convo-
lutional neural network structures, notably in topologies like MobileNetV2. Figure 6
shows the inverted residual block.The present block comprises a series of convolutional
layers enclosed within a Sequential container. The initial ConvBNReLU submodule
consists of convolutional layer that modifies the input channels, succeeded by batch
normalization and ReLU6 activation. The second sub module of ConvBNReLU con-
sists of depthwise separable convolutional layer with a stride, which enhances the
efficiency of the block. The concluding component comprises convolutional layer and
batch normalization. Skip connections are utilized in order to include the initial input
into the final output, hence facilitating the acquisition of residual mappings. In gen-
eral, the Inverted Residual block has been specifically devised to effectively capture
and process features, all the while preserving a lightweight architecture that is well-
suited for deployment on mobile and edge devices. The figure 7 shows the proposed
unet++ model with mobilenetv2.

4.2 Unet++ Decoder


The decoder block is designed to upsample and refine feature maps in the decoding
part of a neural network. Figure 8 shows the unet++ decoder architecture. The block
consists of two convolutional layers (conv1 and conv2) each followed by batch nor-
malization and ReLU activation, aiming to capture and enhance spatial features. The
attention mechanisms (attention1 and attention2) within the block are implemented
using an identity function, which effectively serves as a placeholder for an attention
mechanism. The attention mechanisms can be later replaced or modified to incorpo-
rate attention mechanisms that dynamically adjust the importance of different parts
of the feature maps.
The attention layer is a component commonly used in neural network architectures,
especially in natural language processing and computer vision tasks, to selectively
focus on specific parts of the input or feature maps. The goal of attention mechanisms
is to assign varying degrees of importance to different elements in the input, allowing
the network to weigh and consider certain information more prominently.

13
Figure 6: Inverted Residual block

5 Methodology
5.1 Preprocessing of dataset
The dataset called ”Semantic segmentation of aerial imagery” is downloaded from
kaggle. This Dataset consists of aerial imagery of Dubai obtained by Mohammad Bin
Rashid Space Centre(MBRSC) satellites. This dataset is annotated with pixel-wise
semantic segmentation into 6 classes. The total images in the dataset are 72 grouped
into 8 larger tiles. The classes of dataset are building, land, road, vegetation, water
and unlabeled. Each tile consists of 2 subfolders i.e., images and masks. Image sub-
folder consists of 9 images and masks subfolder consists of corresponding masks for
those images. The images which are present in dataset are of many different sizes like
797x644, 509x544, 682x658, 1099x846, 1126x1058, 859x838, 1817x2061, 2149x1479 in
each tile respectively. To process the pictures for testing and training, the dimensions
of all pictures should be of equal size. To achieve this, dataset need to be prepro-
cessed. The preprocessing is carried out by cropping each image and masks into size
divisible by 256. Further these images and masks are patchified to the size of 256x256.
The sample images and their ptchifying images are shown in figure 9. For example,
tile1 consists of 797x644 size of images and masks. So choose the nearest size with
divisible by 256, we can get 768x512 size, from this total 6 patches are appearing.
Similary tile 2, 3, 4, 5, 6,7 and 8 has 2,4,12,16,9,56 and 40 patches respectively. Each
tile consists of 9 images. So, a total of 1305 patches are available for both images

14
Figure 7: Proposed UnetPlusPlus model with mobileNetV2

and masks after pactchifying. Masks are in RGB form and information is in the form
of hexadecimal color code. So we need to convert hexadecimal to RGB values and
then convert RGB labels to integer values and then to one hot encoding. Segmented
images need to convert back into original RGB colors, otherwise the colors of image
and its mask will be different and we could not identify the corresponding mask
of each image. Predicted tiles need to be merged into a large image by minimizing
blending artifacts or edge effects.

15
Figure 8: Unet++ decoder architecture

5.2 Environmental setup


This section is a description of the results obtained from the simulations conducted
using the proposed methodology. The programming language used is python. The
execution is performed in Google Colab work environment with Python 3 Google
Compute Engine backend, T4 GPU, 12.7GB RAM and 78 GB Disk space. A training
: testing ratio of 70 : 30 is used for all experiments.

5.3 Training and testing process


Figure 10 shows loss of training and validation loss of proposed method. The ideas of
training loss and validation loss have significant importance within the field of machine
learning, specifically in the context of training and evaluating models such as neural
networks. The training loss is a statistic used to assess the discrepancy between the
anticipated output and the actual target values for the training dataset during the
training phase. In essence, it measures the extent of the model’s deviation from the
data it is undergoing training on. During the training phase, the model iteratively
modifies its internal parameters, such as weights and biases in the context of neural
networks, in order to minimize the training loss. The primary objective is to educate

16
(a) Image 1 before patchify (b) Image 1 after patchify

(c) Image 2 before patchify (d) Image 2 after patchify

Figure 9: Image 1 and 2 before and after patchify

the model in a manner that enables it to exhibit strong generalization capabilities


when presented with new, unfamiliar data.
However, the concept of validation loss is significant since it serves as an au-
tonomous metric for evaluating the effectiveness of a model . The measure is obtained
by the assessment of the model’s performance on a distinct dataset, known as the
validation dataset, which has not been previously encountered by the model during
the training process. In contrast to the training data, the validation dataset does not
have any impact on the updates of the model’s parameters. On the other hand, the
validation loss offers valuable information about the model’s potential performance on
entirely novel and unknown data. The use of this technique is crucial in the identifica-
tion of overfitting, which refers to a situation where a model demonstrates exceptional
performance on the training dataset but fails to properly generalize to new, unseen
data.The training loss and validation loss are recorded for every epoch, acting as cru-
cial metrics for evaluating the model’s performance. The training loss, which measures

17
Figure 10: Training and validation loss of proposed method

Table 1: Evaluation Metrics


Unet++ Mobilenet-v2 Precision Recall (%) F1 score Mean IoU
Image 1 0.807041 0.807023 0.806393 0.482
Image 2 0.819062 0.796282 0.807511 0.417
Image 3 0.790811 0.783484 0.772145 0.411
Image 4 0.828221 0.821937 0.825067 0.508

the discrepancy between predicted and actual values on the training data, shows a
progressive decline from 0.83 in the tenth epoch to 0.65 in the hundred epoch. In con-
trast, the validation loss, which evaluates the model’s efficacy on a distinct dataset
that was not used throughout the training process, exhibits an early decline from 0.83
to 0.72. However, it then undergoes a marginal rise during the hundred epoch. Figure
11 shows accuracy of training and validation of proposed method. The evaluation of
machine learning models, particularly in supervised learning settings, relies heavily on
the basic statistic of training accuracy. The performance of the model on the training
dataset is assessed by quantifying it via the calculation of the ratio between the num-
ber of properly predicted occurrences and the total number of examples in the training
dataset. The main objective of training accuracy is to assess the model’s proficiency
in acquiring the underlying patterns and connections present within the training data.
A high training accuracy is indicative of the model’s ability to effectively remember
the training examples and generate precise predictions on the data it was trained on.
Nevertheless, it is essential to acknowledge that achieving a high training accuracy does
not automatically ensure favorable performance when applied to novel, unobserved
data. Overfitting is a potential concern, since the model may inadvertently include
irrelevant noise or idiosyncratic features that hinder its ability to effectively generalize
to other datasets.

18
Figure 11: Training and validation Accuracy of Proposed Method

Table 2: Comparative Analysis


Method Accuracy Mean IoU (%) Precision Recall
Unet++ with Efficientnet-b0 0.79 0.367 0.7970 0.515
Unet++ with Efficientnet-b1 0.77 0.378 0.7807 0.689
Unet++ with VGG19 0.74 0.427 0.7987 0.746
Unet++ with Mobileone s0 0.76 0.434 0.7789 0.774
Unet++ with Mo-
0.83 0.521 0.8090 0.832
bilenet v2

The validation accuracy serves as a complementary metric to the training accuracy,


since it evaluates the performance of the model on an independent dataset referred
to as the validation dataset. The dataset in question has unique characteristics that
distinguish it apart from the training data, rendering it unsuitable for use during
the model training phase. The accuracy of validation functions as an indicator of the
model’s generalizability well to novel, unseen data. Throughout the training phase,
models undergo evaluation on both the training and validation datasets. When a
model possesses a elevated level of accuracy during training but a low level of accuracy
during validation, it is possible that the model is overfitting. Overfitting occurs when
the model becomes too specialized to the training data and encounters difficulties in
generalizing its predictions to new instances. The incorporation of a validation dataset
is crucial in the process of model selection, which aims to identify a model that exhibits
satisfactory performance not just on the training dataset but also on unobserved data.
During the tenth epoch, the validation accuracy was recorded as 0.54, whereas the
training accuracy exhibited a higher value of 0.83. In the subsequent epochs, the val-
idation accuracy demonstrates an improvement, reaching a value of 0.62. Conversely,

19
the training accuracy experiences a tiny reduction, reaching a value of 0.77. As the
training process advances, the validation accuracy demonstrates a consistent upward
trend, ultimately attaining a value of 0.77 during the sixty epoch. However, it is note-
worthy that the training accuracy experiences a decline, reaching a value of 0.69.
During the ninety epoch, there is a significant rise in the validation accuracy, reaching
a value of 0.80. The maximum validation accuracy attained is “0.83”, observed on the
ninety epoch.

5.4 Evaluation metrics


This section describes the evaluation metrics used and gives the extensive description
about the results.The most extensively utilised metric for segmentation performannce
is the accuracy(A). Recall (R) and precision (P) are often used metrics for evaluating
how well image classification systems work.
Accuracy(A): is described as the total number of accurately located and separated in-
stances (images) in the dataset under investigation [21]. The mathematical expression
for accuracy is
tp + tn
Ac =
tp + tn + f p + f n
where the terms true positives (tp), true negatives (tn), false positives (fp), and false
negatives (fn) are used.
Precision:The equivalency of the proportion of accurately classified photos to all
classified images is known as precision.

tp
P =
tp + f p

Here, tp denotes the appropriately categorised image and fp denotes false positives,
or inaccurately classified photos.
Recall:The percentage of correctly classified photographs to all linked images in the
database is known as recall. The recall as mathematically represented as

tp
R=
tp + f n

false negatives (fn) are photos that belonged to the right class but were incorrectly
labelled by the classifier.
F1 score: An increased F1-score, which is the result of multiplying the harmonic mean
of recall and precision, indicates that the system has more predictive power. Evaluation
of a system’s performance requires more than just precision or recall. Mathematically
speaking, the F-score is expressed as
 
P.R
F 1score = 2
P +R

In this case, P and R stand for recall and precision, respectively.


Mean IoU: determines the ratio of area overlapped by the two bounding boxes to

20
Table 3: Images, masks and various architectures results
S.No. Image 1 Image 2 Image 3 Image 4

Original
input im-
age

Ground
truth
mask

Unet++
with
Efficientnet-
b0

Unet++
with
VGG19

Unet++
with
Resnet50

21

Unet++
with Mo-
bilenetv2
the area of their union. The 2 bounding boxes are the real observation and prediction.
The mathematical formula is:

|A ∩ B| |A ∩ B|
J(A, B) = =
|A ∪ B| |A| + |B| − |A ∩ B|

Where,The total area covered by both the bounding boxes (union). The area common
between the bounding boxes (intersection)
.
Table 1 shows the evaluation metrics of proposed method Unet++ with mo-
bilenetv2. The metrics such as presicion, Mean IoU, F1 score, and recall are determined
for four diffrent images of corresponding data set.

6 Comparative analysis of Experimental Results


Table 2 shows a comparative examination of several methodologies, with a focus on
their respective levels of Accuracy, Precision, Recall and Mean IoU. The techniques
used in this study include Efficientnet-b0, Efficientnet-b1, VGG19, Mobileone s0, and
Mobilenet v2. These methods were evaluated based on their respective accuracies,
which were found to be 0.79, 0.77, 0.74, 0.76, and the maximum accuracy of 0.83, re-
spectively. These methods were also evaluated based on their respective Mean IoU,
which were found to be 0.367, 0.378, 0.427, 0.434 and the maximum Mean IoU of
0.521 respectively.These methods even evaluated based on their respective precisions,
which were found to be 0.7970,0.7807,0.7987,0.7789 and the maximum recall of 0.8090
respectively. These methods even evaluated based on their respective recalls, which
were found to be 0.515,0.689,0.746,0.774 and the maximum recall of 0.832 respectively.
The table presents a concise overview of the performance of the various approaches,
revealing that Mobilenet v2 attained the greatest level of performance compared to
the other methods under consideration. This finding demonstrates a significant de-
gree of proficiency in handling unfamiliar data, implying that the model has strong
applicability capabilities.

7 Conclusion
The proposed satellite image segmentation model, which combines the UNet++ ar-
chitecture with the lightweight MobileNet encoder, has demonstrated noteworthy
performance in accurately delineating features within satellite images. Through exten-
sive experiments conducted on a diverse satellite image dataset, the model achieved
an impressive accuracy rate of 83 percent, Mean IoU as 43.98 and Recall of 83.24.
This level of accuracy is particularly promising for real-world applications, like disaster
management,land cover classification and environmental monitoring. The integration
of UNet++ for its hierarchical feature capturing capabilities and MobileNet for com-
putational efficiency has proven to be a successful fusion, striking a balance between
accuracy and resource efficiency. The achieved accuracy of 83 percent underscores the
model’s effectiveness in extracting valuable information from satellite imagery, making
it a compelling solution for remote sensing tasks in resource-constrained environments.

22
References
[1] Yuan, K., Zhuang, X., Schaefer, G., Feng, J., Guan, L., Fang, H.: Deep-learning-
based multispectral satellite image segmentation for water body detection. IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing
14, 7422–7434 (2021)

[2] Jia, H., Lang, C., Oliva, D., Song, W., Peng, X.: Dynamic harris hawks optimiza-
tion with mutation mechanism for satellite image segmentation. Remote sensing
11(12), 1421 (2019)

[3] Ghassemi, S., Fiandrotti, A., Francini, G., Magli, E.: Learning and adapting ro-
bust features for satellite image segmentation on heterogeneous data sets. IEEE
Transactions on Geoscience and Remote Sensing 57(9), 6517–6529 (2019)

[4] Rahaman, J., Sing, M.: An efficient multilevel thresholding based satellite image
segmentation approach using a new adaptive cuckoo search algorithm. Expert
Systems with Applications 174, 114633 (2021)

[5] Kotaridis, I., Lazaridou, M.: Remote sensing image segmentation advances: A
meta-analysis. ISPRS Journal of Photogrammetry and Remote Sensing 173, 309–
322 (2021)

[6] Pare, S., Kumar, A., Singh, G.K., Bajaj, V.: Image segmentation using multi-
level thresholding: a research review. Iranian Journal of Science and Technology,
Transactions of Electrical Engineering 44, 1–29 (2020)

[7] Gupta, A., Watson, S., Yin, H.: Deep learning-based aerial image segmenta-
tion with open data for disaster impact assessment. Neurocomputing 439, 22–33
(2021)

[8] Iqbal, J., Ali, M.: Weakly-supervised domain adaptation for built-up region seg-
mentation in aerial and satellite imagery. ISPRS Journal of Photogrammetry and
Remote Sensing 167, 263–275 (2020)

[9] Ovi, T.B., Mosharrof, S., Bashree, N., Islam, M.N., Islam, M.S.: Deeptrinet:
A tri-level attention-based deeplabv3+ architecture for semantic segmenta-
tion of satellite images. In: International Conference on Human-Centric Smart
Computing, pp. 373–384 (2023). Springer

[10] Fabel, Y., Nouri, B., Wilbert, S., Blum, N., Triebel, R., Hasenbalg, M., Kuhn,
P., Zarzalejo, L.F., Pitz-Paal, R.: Applying self-supervised learning for seman-
tic cloud segmentation of all-sky images. Atmospheric Measurement Techniques
15(3), 797–809 (2022)

[11] Wagner, F.H., Dalagnol, R., Sánchez, A.H., Hirye, M., Favrichon, S., Lee, J.H.,

23
Mauceri, S., Yang, Y., Saatchi, S.: K-textures, a self-supervised hard clus-
tering deep learning algorithm for satellite image segmentation. Frontiers in
Environmental Science 10, 946729 (2022)

[12] Boulila, W., Khlifi, M.K., Ammar, A., Koubaa, A., Benjdira, B., Farah, I.R.: A
hybrid privacy-preserving deep learning approach for object classification in very
high-resolution satellite images. Remote Sensing 14(18), 4631 (2022)

[13] Li, W., Chen, H., Shi, Z.: Semantic segmentation of remote sensing images with
self-supervised multitask representation learning. IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing 14, 6438–6450 (2021)

[14] Li, H., Li, Y., Zhang, G., Liu, R., Huang, H., Zhu, Q., Tao, C.: Global and
local contrastive self-supervised learning for semantic segmentation of hr remote
sensing images. IEEE Transactions on Geoscience and Remote Sensing 60, 1–14
(2022)

[15] Sun, W., Gao, Z., Cui, J., Ramesh, B., Zhang, B., Li, Z.: Semantic segmentation
leveraging simultaneous depth estimation. Sensors 21(3), 690 (2021)

[16] Zürn, J., Burgard, W., Valada, A.: Self-supervised visual terrain classification
from unsupervised acoustic feature learning. IEEE Transactions on Robotics
37(2), 466–481 (2020)

[17] Dong, H., Ma, W., Wu, Y., Zhang, J., Jiao, L.: Self-supervised representation
learning for remote sensing image change detection based on temporal prediction.
Remote Sensing 12(11), 1868 (2020)

[18] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neu-
ral networks. In: International Conference on Machine Learning, pp. 6105–6114
(2019). PMLR

[19] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)

[20] Ikechukwu, A.V., Murali, S., Deepu, R., Shivamurthy, R.: Resnet-50 vs vgg-19 vs
training from scratch: A comparative analysis of the segmentation and classifica-
tion of pneumonia from chest x-ray images. Global Transitions Proceedings 2(2),
375–381 (2021)

[21] Shabbir, A., Ali, N., Ahmed, J., Zafar, B., Rasheed, A., Sajid, M., Ahmed, A.,
Dar, S.H.: Satellite and scene image classification based on transfer learning and
fine tuning of resnet50. Mathematical Problems in Engineering 2021, 1–18 (2021)

24

You might also like