Satellite Image Segmentation Using Unet andMobileN
Satellite Image Segmentation Using Unet andMobileN
Research Article
Keywords: Satellite image segmentation, UNet++, MobileNet encoder, deep learning model, land cover
classi cation, environmental monitoring, disaster management
DOI: https://doi.org/10.21203/rs.3.rs-4144393/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Abstract
Satellite image segmentation plays a pivotal role in extracting valuable informa-
tion for applications like ,disaster management, environmental monitoring and
land cover classification. In this work, a comprehensive approach for satellite im-
age segmentation utilizing a fusion of UNet++ architecture and the lightweight
MobileNetv2 encoder deep learning model is proposed. The UNet++ architec-
ture, an extension of the widely adopted UNet, is employed for its ability to
capture hierarchical features and enhance segmentation performance. Integrat-
ing MobileNet as the encoder provides computational efficiency, making the
model well-suited for resource-constrained environments, such as satellite image
analysis on edge devices. The proposed model leverages the strengths of both
architectures, combining the expressive power of UNet++ with the efficiency
of MobileNet. Extensive experiments are conducted on a diverse satellite image
dataset, evaluating the model’s segmentation accuracy, computational efficiency,
and generalization capability. The results demonstrate the effectiveness of the
proposed approach in achieving accurate and efficient satellite image segmenta-
tion, making it a promising solution for real-world applications in remote sensing
and geospatial analysis.
1
1 Introduction
Satellite imaging is useful in many different fields, including agriculture, urban plan-
ning, disaster monitoring, and environmental studies, to name a few. It is very
necessary to be able to derive relevant information from satellite photos in order to
effectively make decisions, manage resources, and respond to natural disasters [1]. The
process of segmentation, which includes identifying and classifying all of the objects
and land cover elements included within the picture, is one of the most important steps
in satellite image processing. Significant problems for accurate and effective segmen-
tation are presented by the complexity of satellite imagery, which is characterized by
huge geographical extents, varied resolutions, and diverse landscapes [2]. Traditional
approaches often fail to keep up with all of these different obstacles, which results in
solutions that are not ideal. In addition, the human interpretation of training data in
supervised learning is not only labor-intensive but also time-consuming, which limits
the scalability of these approaches [3]. As a consequence of this, there is an urgent
need for robust and automated segmentation algorithms that are capable of adjusting
to the varied and ever-changing characteristics of satellite data.
Even though it has made great strides and is more successful than previous
methods, satellite image segmentation using deep learning models still has certain
limitations.The enormous amount of processing power that deep learning algorithms
frequently demand is one of the most prominent disadvantages [4]. These models’ in-
tricate designs need a large amount of processing power and memory, which makes
them computationally demanding and sometimes unsuitable for applications that have
restricted resources. Another disadvantage is that it requires a substantial quantity of
labeled training data to function properly. When it comes to successful training, deep
learning models, like those used for segmenting satellite images, flourish when given
access to large datasets. Obtaining and annotating such information may be a process
that is both time-consuming and resource-intensive. This is especially true for satellite
images that include a variety of different landscapes and features [5]. This dependence
on large amounts of labeled data might provide difficulties in circumstances in which
acquiring such data is difficult from a budgetary or logistical perspective.
The most recent investigations into the use of deep learning models for image seg-
mentation in satellite imagery are at an advanced stage of developments in remote
sensing and earth observation [6]. The complex and ever-changing nature of satellite
imagery presents a number of issues, which are now being actively addressed by re-
searchers by actively exploring creative ways [7]. There is now work being done to
improve previously developed deep learning architectures and to create new models
that are specifically suited for the segmentation of satellite pictures. The optimiza-
tion of computational efficiency to manage the huge volumes of data associated with
high-resolution satellite imagery [8] is one of the primary focuses of this research. This
optimization is done to ensure that these models can function well in real-time or
near-real-time applications.
In addition, researchers are looking at the use of multi-modal data sources in order
to improve the accuracy of the segmentation process. A more accurate depiction of the
Earth’s surfaces may be obtained by the combination of data from a variety of satellite
sensors and other datasets that are supplementary to those data. This comprehensive
2
approach adds to superior segmentation outcomes, especially in complicated situations
where standard approaches may fail to function well. The investigation of transfer
learning approaches is also gaining attention. The goal of transfer learning is to make
use of models that have been pre-trained on big datasets in order to improve segmen-
tation performance in situations when there are few annotated satellite pictures. The
direction that research in satellite image segmentation using deep learning models is
taking at the moment demonstrates a dedication to overcoming obstacles and pushing
the limits of what is possible within the remote sensing industry applications.
In this research, the difficult problem of segmenting satellite images is tackled,
and a unique method based on self-supervised learning and instance segmentation is
presented as a potential solution. The process of segmenting satellite images lends itself
especially well to the use of self-supervised learning, which is a promising method in
the area of computer vision. Self-supervised learning is a method of machine learning
that enables a model to acquire meaningful representations without the need for vast
human-labeled datasets. This method works by exploiting the innate patterns and
connections that are present within satellite images. Because of this modification in
the approach of learning, the model is now able to recognize and differentiate between
objects and aspects of land cover even when there is no pre-labeled training data
available. Furthermore, the method goes beyond standard segmentation methods by
including instance segmentation techniques. This enables the detection and distinction
of individual instances of the same object class included inside the picture. This is
very helpful in situations in which accurate and comprehensive segmentation is of
the utmost importance, such as when one is tasked with counting trees in a forest or
monitoring automobiles in a parking lot.
In this research, a novel Self-Supervised Learning Based Instance Segmentation
Method is presented. It is designed to particularly handle the one-of-a-kind difficulties
that are inherent in the processing of satellite images. By using the full potential of
self-supervised learning, this approach is able to make efficient use of the vast amounts
of information that are included inside satellite photos. Because of its decreased de-
pendence on manually annotated training data, this strategy mitigates a constraint
that is often experienced when using more standard supervised approaches, which is
one of the methodology’s significant advantages.
2 Related work
Tareque Bashar Ovi et al [9] introduced a novel tri-level attention-based DeepLabv3+
architecture, referred to as DeepTriNet, for the purpose of semantic segmentation of
satellite images. The hybrid technique under consideration integrates squeeze-and-
excitation networks (SENets) and tri-level attention units (TAUs) into the existing
DeepLabv3+ architecture. The TAUs are used to address the semantic feature dis-
parity between the output of encoders, while the SENets are utilized to assign more
importance to pertinent features. The DeepTriNet model, as presented, determines the
most significant characteristics in a more generic manner via self-supervision, rather
than relying on manual annotation.
3
Yann Fabel et al [10] presented a novel approach of self-supervised learning to
effectively use a much bigger dataset in comparison to traditional supervised training
methods, resulting in improved performance of the model. The first stage of the study
entails using over 300,000 Atmospheric State Indices (ASIs) in two separate pretext
assignments as part of the pretraining process. One of the objectives focuses on the
process of reconstructing images, while the other job is concentrated on the utilization
of the DeepCluster model. The DeepCluster model is an iterative procedure that
includes grouping and categorizing the neural network’s output. Following that, the
model is subjected to a process of fine-tuning using a rather modest labeled dataset
consisting of 770 ASIs. Out of these, 616 ASIs are allocated for training purposes,
while the remaining 154 ASIs are reserved for validation. Every Artificial Intelligence
System (ASI) is linked to a ground truth mask that classifies individual pixels into
several categories, such as clear sky, low-layer clouds, mid-layer clouds, or high-layer
clouds. In order to evaluate the efficacy of self-supervised pretraining, a comparison
study is undertaken, whereby this methodology is contrasted with models that are
started with random weights and those that are pretrained using ImageNet data. All
models are trained and validated using identical datasets.
Fabien H.Wagner et al [11] presented the k-textures technique, which offers a self-
supervised approach for segmenting a 4-band picture (consisting of RGB and NIR
bands) into k distinct classes. An example of its use using high-resolution Planet satel-
lite images is shown. According to the algorithmic analysis, it has been determined
that the use of convolutional neural networks (CNN) in conjunction with gradient de-
scent renders discrete search a viable approach. The model is capable of identifying k
distinct clustering classes within the data. These classes are represented by k discrete
binary masks and their corresponding separately produced textures. When merged,
these masks and textures simulate the initial picture. The similarity loss refers to the
average squared error between the features of the actual picture and the simulated
image. These features are obtained from the penultimate convolutional block of two
different models: The Keras ”imagenet” pre-trained VGG-16 model and a custom fea-
ture extractor created using Planet data. The primary advancements of the k-textures
model include the acquisition of k discrete binary masks inside the model via the use
of gradient descent. The proposed model facilitates the production of discrete binary
masks via the use of a unique approach including a hard sigmoid activation function.
Furthermore, the algorithm offers hard clustering classes, where each pixel is assigned
to a single class. In contrast to the k-means algorithm, which treats each pixel as an
independent entity, the approach discussed here incorporates contextual information
and associates each class not just with comparable color channel values, but also with
texture. The proposed methodology aims to facilitate the generation of training sam-
ples for satellite image segmentation. Additionally, the k-textures architecture may
be modified to accommodate varying numbers of bands and to address more intricate
self-segmentation problems, such as object self-segmentation.
Wadii Boulila et al [12] introduced a hybrid strategy for object categorization
in very-high-resolution satellite images, using the PPDL framework. The encryp-
tion technique under consideration involves the integration of Paillier homomorphic
encryption (PHE) and slightly homomorphic encryption (SHE). The objective of
4
this combination is to augment the encryption of satellite images while simultane-
ously maintaining optimal runtime and achieving a high level of accuracy in object
categorization. The encryption technique used for pictures is supported by the uti-
lization of the public keys associated with Partially Homomorphic Encryption (PHE)
and Somewhat Homomorphic Encryption (SHE). The researchers performed experi-
ments utilizing high-resolution satellite images obtained from the SPOT6 and SPOT7
satellites in real-world scenarios. This study examined four distinct convolutional neu-
ral network (CNN) architectures, namely ResNet50, InceptionV3, DenseNet169, and
MobileNetV2.
Wenyuan Li et al [13] proposed a self−supervised multitask methodology for
acquiring representations in remote sensing images that effectively captures visual as-
pects. The proposed approach entails the development of three separate pretext tasks
and the use of a triplet Siamese network to simultaneously capture both high-level
and low-level visual features. The training process of this network does not need the
use of labeled data. However, the resulting model may be further refined by the use
of annotated segmentation datasets during the fine-tuning phase. The efficacy of their
methodology is validated by empirical investigations carried out on several datasets,
including Potsdam, Vaihingen, and the Levir CS dataset, which focuses on cloud and
snow identification. The trial’s results demonstrate that the suggested approach ef-
fectively lowers the reliance on labeled datasets and improves the performance of
remote sensing semantic segmentation. When comparing their method to recent state-
of-the-art self-supervised representation learning methods and commonly employed
initialization methods such as random initialization and ImageNet pretraining, it is
observed that their method consistently outperforms the others in the majority of
experiments, particularly in situations where there is a scarcity of training data. Sur-
prisingly, their strategy demonstrates equivalent performance to randomly initialized
models with a little 10 to 50 labeled data.
Haifeng Li et al [14] presented a new network called the Global Style and Local
Matching Contrastive Learning Network (GLCNet) for the task of semantic segmenta-
tion in Remote Sensing Images (RSIs). The GLCNet has been designed with a unique
structure to improve the segmentation of Remote Sensing Images (RSIs). During the
first stage, the use of the Global Style Contrastive Learning module is implemented to
enhance the process of acquiring image−level representations. This premise is based
on the notion that stylistic attributes have the capacity to accurately encapsulate the
holistic qualities of a picture. The subsequent module, known as the Local Features
Matching Contrastive Learning module, has been carefully developed to acquire rep-
resentations of local areas, which play a critical role in semantic segmentation tasks.
The authors of the study conducted a thorough evaluation of their technique by us-
ing four separate datasets for RSI semantic segmentation. The experimental findings
repeatedly demonstrate that their approach significantly outperforms both contem-
porary self−supervised approaches and the ImageNet pretraining method in terms of
performance.
Wenbo Sun et al [15] proposed a novel approach aimed at enhancing the accuracy
of picture segmentation by including depth estimation techniques into the analysis of
5
RGB images. Subsequently, the obtained depth map is utilized as input for a con-
volutional neural network (CNN) to facilitate the process of semantic segmentation.
Moreover, for the purpose of concurrently parsing the depth map and RGB pictures,
An encoder-decoder network with several branches is designed, and the RGB and
depth characteristics are progressively integrated. The results of the extensive exper-
imental assessment on four baseline networks indicate that the suggested technique
significantly improves the quality of segmentation and achieves superior performance
when compared to other segmentation networks.
Jannik Zurn et al [16] proposed a novel framework for terrain categorization that
leverages an unsupervised proprioceptive classifier. The classifier in question acquires
knowledge from the auditory signals generated during the interactions between vehicles
and the terrain. This allows for the autonomous training of a classifier that can perform
pixelwise semantic segmentation of pictures, based on external sensory information.
The methodology initiates by creating a discriminative embedding space for the sounds
produced during vehicle-terrain interaction. This is achieved by using triplets of audio
clips, which are constructed by combining the visual attributes of the respective ter-
rain patches. The produced embeddings are further subjected to clustering, whereby
these clusters are used as labels for the visual terrain patches. The assignment of these
labels is accomplished by projecting the pathways walked by the robot onto the cam-
era pictures. The use of poorly labeled pictures for training the semantic segmentation
network is achieved by the application of weak supervision. The study provides a thor-
ough collection of quantitative and qualitative results, illustrating the superiority of
their proprioceptive terrain classifier over current unsupervised approaches. Further-
more, the self−supervised exteroceptive semantic segmentation model developed by
the researchers demonstrates performance levels that are equivalent to those reached
by supervised learning using manually annotated data.
Huihui Dong et al [17] proposed an innovative approach to self− supervised rep-
resentation learning for remote sensing picture change detection. This technique is
centered on temporal prediction. The primary objective is to enhance the consistency
of feature representations in two satellite pictures via a self−supervised process, with-
out relying on semantic supervision or requiring extra computations. By using the
modified feature representations, it is possible to produce an improved difference im-
age (DI) that effectively minimizes the error transmitted by the DI in the end result of
detection. In the framework of self-supervision, the neural network is tasked with dis-
cerning distinct sample patches within a pair of temporal pictures, hence engaging in
temporal prediction. By using a network architecture that emulates the discriminator
component of generative adversarial networks, the temporal prediction task is able to
capture distribution−aware feature representations, leading to a resultant model that
exhibits strong resilience.
3 Existing models
The CNN architectures are used to perform the image classification. To achieve high
accuracy using CNN, the more number of layers are added to the architecture so that it
becomes deeper CNN. The process of adding more number of layers to the architecture
6
indicates the depth scaling. The deeper CNN architecture is more powerful and it gives
high efficiency but goes on increasing the number of layers in deeper CNN, after some
extent it saturates and get the problem of vanishing gradient. To mitigate the problem
of vanishing gradient, in conventional deep CNN’s skip connections were used. Also
in CNN lots of layers are present, so that lots of processing, lots of computation is
required while training and this is the time consuming process and it may face the
problems like complexity, high computational time.
3.1 Efficientnetb
The Efficientnet is used to perform scaling on depth,width and resolution.
Resolution scaling: Resolution scaling is nothing but the increase the size of less
DPI(Dots Per Inch) images into big size in order to capture high complex features
and fine grained patterns from the image.The low resolution images, having less fea-
tures requires less training time but the architecture couldn’t able to classify images
perfectly so the accuracy of an algorithm is also less. On the otherhand high resolu-
tion images having many feautres , requires high training time and the architecture
learns more complex features and gives better classification so that the accuracy of
an algorithm is improved.
Depth scaling: Depth scaling is nothing but adding more number of layes in the
architecture. To process the high resolution images reqires more number of neurons
so that many layers need to be included in architecture that leads to the use of
depth scaling. How much depth scaling is required for the particular increment in the
resolution of image?
Width scaling: Width scaling is nothing but increasing the number of channels or
feature maps. For high resolution images having more pixels requires many feature
maps in order to get each and every information of image. How much width scaling is
required to increase the performance? Mingxing Tan et al.[18] made two observations
regarding efficientnets.
1)Scaling up any dimension of width, depth or resolution improve accuracy, but the
accuracy gain diminishes for bigger models.
2)In oredr to persue better accuracy and efficiency, it is critical to balance all dimen-
sions of network width, depth and resolution during scaling. The compound scaling
method is used to decide how much scaling is required for width, depth and resolu-
tion. The baseline model designed by Neural Architecture Serach(NAS) is required to
perform compound scaling.
The compound scaling method
f=α.βφ.γφ
f=d.wφ.rφ
α is d:depth scaling factor
β is w:width scaling factor
γ is r:resolution scaling factor
f is network scaling factor
where α,β and γ are constants that can be determined by a small grid search.
7
3.1.1 Efficientnetb0 and Efficientnetb1
The architecture of efficientnet is shown in figure 1. The problems which are oc-
curring while increasing the number of layers in CNN i.e., deeper CNN that is having
deeper layers then need to go for depth scaling, increase the width of channels and
high resolution images. By addition of all these things to the architecture there is a
chance of getting vanishing gradient problem. To address these issues, the efficient-
net b0-b7 models were implemented by google. In all these architectures depth scaling
was used. The scaling can be done in efficientnetb0 by using grid search. The chosen
values for effiecientnetbo are α=1.2,β=1.1,γ=1.15 and φ=1. The compound scaling
suggests that the scaling of the network should be performed using a constant ratio
in all the dimensions. The compound scaling method balances all the dimensions of
the network i.e., width or depth or resolution.
Disadvantages:The main drawbacks of efficientnetb0 are huge memory and the
computational requirement, the overfitting effect, and the high variance.
3.1.2 VGG19
8
to increase the model performance. The architecture of VGG 19 is shown in figure
2. VGG 19 implemented by Simonyan and Zisserman [19] is a convolutional neural
network which consists of 19 layers with 16 convolution layers and 3 fully connected
to classify the pictures into 1000 object categories. It is very famous technique to do
the image classification because of using number of 3x3 filters in every convolution
layer. The figure depicts the architecture of VGG19. The figure 1 shows that first
16 convolutional layers are used for feature extraction and the last 3 layers used for
classification. The layers used for feature extraction are segregated into 5 groups where
each group is followed by a max-pooling layer. An image of size 224×224 is inputted
into this model and the model outputs the label of the object in the image.
3.1.3 Resnet 50
ResNet50 is a deep convolutional neural network (CNN) architecture developed by Mi-
crosoft Research in 2015.The architecure of Resnet50 is shown in figure 3. The widely
used ”Residual Network,” or ResNet, architecture has undergone some changes. The
number ”50” in the system’s layer count, which indicates that there are 50 layers total,
is where the network’s name comes from. ResNet50 is a potent picture classification
model that produces cutting-edge outcomes when trained on huge datasets [20]. Using
residual connections, the network can learn a set of residual functions that transform
input into the intended output, is one of its primary advances. With these remaining
connections, the network is able to learn much deeper structures without encountering
the vanishing gradients issue. The resnet50 architecture is classified as 4 Convolutional
layers, identity block, convolutional block, and fully linked layers are the main compo-
nents. The features that the convolution layers have extracted from the applied image
are processed and transformed using the identity block and convolutional block. The
completely connected layers are also in charge of the final classification. The several
convolutional layers that comprise the convolutional layer are followed by batch nor-
malisation and ReLU activation. These layers are used to extract features, such as
edges, textures, and forms, of the chosen image. Max pooling layers, which minimise
the spatial dimensions of the feature maps while maintaining the most crucial proper-
ties, come after convolutional layers. The two primary ResNet50 building blocks are
the identification block and the convolutional block. The identity block is the basic
component which integrates the input back to the output after passing it through sev-
eral convolutional layers. As a result, the network is able to learn residual functions,
9
which convert input into desired output. Along with the inclusion that of a 1x1 con-
volutional layer to lower how many filters are present before the 3x3 convolutional
layer, the convolutional block resembles the identity block. The completely connected
layers make up the last section of ResNet50. The last classification is determined by
these layersIn order to find the last class probabilities, the outcome of the final fully
connected layer is given to softmax activation function.
10
Figure 4: Architecture of unet++ with MobilenetV2 as encoder
Unit (ReLU) activation. The initial module is accountable for the processing of the
raw input image and producing a collection of feature maps. The core component of
the MobileNetV2 architecture is the Inverted Residual block. In the proposed model,
there are a total of 17 blocks that have been arranged in a sequential manner. The In-
verted Residual block effectively retains and enhances the features extracted from the
preceding layer. The architecture is comprised of depthwise separable convolutions,
linear bottlenecks, and skip connections.
After the series of Inverted Residual blocks, a final set of convolutional layers batch
normalization and ReLU activation layers are employed. The last stage of this block
involves enhancing the acquired features by the network and readying the output for
subsequent processing.
The initial layers and the first Inverted Residual block capture low-level features
from the input image. The subsequent Inverted Residual blocks progressively capture
more abstract and high-level features through their skip connections and hierarchical
processing. The final layers refine these features and prepare them for the transition
to the decoder part of the UNet++ architecture.
11
Figure 5: MobileNetV2 architecture
12
• Parameters: Batch Normalization incorporates a pair of trainable parameters for
each feature (or channel) within the layer, namely scale (gamma) and shift (beta).
The utilization of these parameters enables the network to dynamically adjust the
normalized output, hence offering adaptability and maintaining the layer’s ability
to effectively represent information.
4.1.3 ReLu
The Rectified Linear Unit (ReLU) is an activation function that is frequently employed
in artificial neural networks, specifically in deep learning architectures. The incor-
poration of the function within the network introduces a non-linear element, hence
facilitating The capacity of the network to acquire knowledge of complex patterns,
additionally correlations inherent Within the data.
13
Figure 6: Inverted Residual block
5 Methodology
5.1 Preprocessing of dataset
The dataset called ”Semantic segmentation of aerial imagery” is downloaded from
kaggle. This Dataset consists of aerial imagery of Dubai obtained by Mohammad Bin
Rashid Space Centre(MBRSC) satellites. This dataset is annotated with pixel-wise
semantic segmentation into 6 classes. The total images in the dataset are 72 grouped
into 8 larger tiles. The classes of dataset are building, land, road, vegetation, water
and unlabeled. Each tile consists of 2 subfolders i.e., images and masks. Image sub-
folder consists of 9 images and masks subfolder consists of corresponding masks for
those images. The images which are present in dataset are of many different sizes like
797x644, 509x544, 682x658, 1099x846, 1126x1058, 859x838, 1817x2061, 2149x1479 in
each tile respectively. To process the pictures for testing and training, the dimensions
of all pictures should be of equal size. To achieve this, dataset need to be prepro-
cessed. The preprocessing is carried out by cropping each image and masks into size
divisible by 256. Further these images and masks are patchified to the size of 256x256.
The sample images and their ptchifying images are shown in figure 9. For example,
tile1 consists of 797x644 size of images and masks. So choose the nearest size with
divisible by 256, we can get 768x512 size, from this total 6 patches are appearing.
Similary tile 2, 3, 4, 5, 6,7 and 8 has 2,4,12,16,9,56 and 40 patches respectively. Each
tile consists of 9 images. So, a total of 1305 patches are available for both images
14
Figure 7: Proposed UnetPlusPlus model with mobileNetV2
and masks after pactchifying. Masks are in RGB form and information is in the form
of hexadecimal color code. So we need to convert hexadecimal to RGB values and
then convert RGB labels to integer values and then to one hot encoding. Segmented
images need to convert back into original RGB colors, otherwise the colors of image
and its mask will be different and we could not identify the corresponding mask
of each image. Predicted tiles need to be merged into a large image by minimizing
blending artifacts or edge effects.
15
Figure 8: Unet++ decoder architecture
16
(a) Image 1 before patchify (b) Image 1 after patchify
17
Figure 10: Training and validation loss of proposed method
the discrepancy between predicted and actual values on the training data, shows a
progressive decline from 0.83 in the tenth epoch to 0.65 in the hundred epoch. In con-
trast, the validation loss, which evaluates the model’s efficacy on a distinct dataset
that was not used throughout the training process, exhibits an early decline from 0.83
to 0.72. However, it then undergoes a marginal rise during the hundred epoch. Figure
11 shows accuracy of training and validation of proposed method. The evaluation of
machine learning models, particularly in supervised learning settings, relies heavily on
the basic statistic of training accuracy. The performance of the model on the training
dataset is assessed by quantifying it via the calculation of the ratio between the num-
ber of properly predicted occurrences and the total number of examples in the training
dataset. The main objective of training accuracy is to assess the model’s proficiency
in acquiring the underlying patterns and connections present within the training data.
A high training accuracy is indicative of the model’s ability to effectively remember
the training examples and generate precise predictions on the data it was trained on.
Nevertheless, it is essential to acknowledge that achieving a high training accuracy does
not automatically ensure favorable performance when applied to novel, unobserved
data. Overfitting is a potential concern, since the model may inadvertently include
irrelevant noise or idiosyncratic features that hinder its ability to effectively generalize
to other datasets.
18
Figure 11: Training and validation Accuracy of Proposed Method
19
the training accuracy experiences a tiny reduction, reaching a value of 0.77. As the
training process advances, the validation accuracy demonstrates a consistent upward
trend, ultimately attaining a value of 0.77 during the sixty epoch. However, it is note-
worthy that the training accuracy experiences a decline, reaching a value of 0.69.
During the ninety epoch, there is a significant rise in the validation accuracy, reaching
a value of 0.80. The maximum validation accuracy attained is “0.83”, observed on the
ninety epoch.
tp
P =
tp + f p
Here, tp denotes the appropriately categorised image and fp denotes false positives,
or inaccurately classified photos.
Recall:The percentage of correctly classified photographs to all linked images in the
database is known as recall. The recall as mathematically represented as
tp
R=
tp + f n
false negatives (fn) are photos that belonged to the right class but were incorrectly
labelled by the classifier.
F1 score: An increased F1-score, which is the result of multiplying the harmonic mean
of recall and precision, indicates that the system has more predictive power. Evaluation
of a system’s performance requires more than just precision or recall. Mathematically
speaking, the F-score is expressed as
P.R
F 1score = 2
P +R
20
Table 3: Images, masks and various architectures results
S.No. Image 1 Image 2 Image 3 Image 4
Original
input im-
age
Ground
truth
mask
Unet++
with
Efficientnet-
b0
Unet++
with
VGG19
Unet++
with
Resnet50
21
Unet++
with Mo-
bilenetv2
the area of their union. The 2 bounding boxes are the real observation and prediction.
The mathematical formula is:
|A ∩ B| |A ∩ B|
J(A, B) = =
|A ∪ B| |A| + |B| − |A ∩ B|
Where,The total area covered by both the bounding boxes (union). The area common
between the bounding boxes (intersection)
.
Table 1 shows the evaluation metrics of proposed method Unet++ with mo-
bilenetv2. The metrics such as presicion, Mean IoU, F1 score, and recall are determined
for four diffrent images of corresponding data set.
7 Conclusion
The proposed satellite image segmentation model, which combines the UNet++ ar-
chitecture with the lightweight MobileNet encoder, has demonstrated noteworthy
performance in accurately delineating features within satellite images. Through exten-
sive experiments conducted on a diverse satellite image dataset, the model achieved
an impressive accuracy rate of 83 percent, Mean IoU as 43.98 and Recall of 83.24.
This level of accuracy is particularly promising for real-world applications, like disaster
management,land cover classification and environmental monitoring. The integration
of UNet++ for its hierarchical feature capturing capabilities and MobileNet for com-
putational efficiency has proven to be a successful fusion, striking a balance between
accuracy and resource efficiency. The achieved accuracy of 83 percent underscores the
model’s effectiveness in extracting valuable information from satellite imagery, making
it a compelling solution for remote sensing tasks in resource-constrained environments.
22
References
[1] Yuan, K., Zhuang, X., Schaefer, G., Feng, J., Guan, L., Fang, H.: Deep-learning-
based multispectral satellite image segmentation for water body detection. IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing
14, 7422–7434 (2021)
[2] Jia, H., Lang, C., Oliva, D., Song, W., Peng, X.: Dynamic harris hawks optimiza-
tion with mutation mechanism for satellite image segmentation. Remote sensing
11(12), 1421 (2019)
[3] Ghassemi, S., Fiandrotti, A., Francini, G., Magli, E.: Learning and adapting ro-
bust features for satellite image segmentation on heterogeneous data sets. IEEE
Transactions on Geoscience and Remote Sensing 57(9), 6517–6529 (2019)
[4] Rahaman, J., Sing, M.: An efficient multilevel thresholding based satellite image
segmentation approach using a new adaptive cuckoo search algorithm. Expert
Systems with Applications 174, 114633 (2021)
[5] Kotaridis, I., Lazaridou, M.: Remote sensing image segmentation advances: A
meta-analysis. ISPRS Journal of Photogrammetry and Remote Sensing 173, 309–
322 (2021)
[6] Pare, S., Kumar, A., Singh, G.K., Bajaj, V.: Image segmentation using multi-
level thresholding: a research review. Iranian Journal of Science and Technology,
Transactions of Electrical Engineering 44, 1–29 (2020)
[7] Gupta, A., Watson, S., Yin, H.: Deep learning-based aerial image segmenta-
tion with open data for disaster impact assessment. Neurocomputing 439, 22–33
(2021)
[8] Iqbal, J., Ali, M.: Weakly-supervised domain adaptation for built-up region seg-
mentation in aerial and satellite imagery. ISPRS Journal of Photogrammetry and
Remote Sensing 167, 263–275 (2020)
[9] Ovi, T.B., Mosharrof, S., Bashree, N., Islam, M.N., Islam, M.S.: Deeptrinet:
A tri-level attention-based deeplabv3+ architecture for semantic segmenta-
tion of satellite images. In: International Conference on Human-Centric Smart
Computing, pp. 373–384 (2023). Springer
[10] Fabel, Y., Nouri, B., Wilbert, S., Blum, N., Triebel, R., Hasenbalg, M., Kuhn,
P., Zarzalejo, L.F., Pitz-Paal, R.: Applying self-supervised learning for seman-
tic cloud segmentation of all-sky images. Atmospheric Measurement Techniques
15(3), 797–809 (2022)
[11] Wagner, F.H., Dalagnol, R., Sánchez, A.H., Hirye, M., Favrichon, S., Lee, J.H.,
23
Mauceri, S., Yang, Y., Saatchi, S.: K-textures, a self-supervised hard clus-
tering deep learning algorithm for satellite image segmentation. Frontiers in
Environmental Science 10, 946729 (2022)
[12] Boulila, W., Khlifi, M.K., Ammar, A., Koubaa, A., Benjdira, B., Farah, I.R.: A
hybrid privacy-preserving deep learning approach for object classification in very
high-resolution satellite images. Remote Sensing 14(18), 4631 (2022)
[13] Li, W., Chen, H., Shi, Z.: Semantic segmentation of remote sensing images with
self-supervised multitask representation learning. IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing 14, 6438–6450 (2021)
[14] Li, H., Li, Y., Zhang, G., Liu, R., Huang, H., Zhu, Q., Tao, C.: Global and
local contrastive self-supervised learning for semantic segmentation of hr remote
sensing images. IEEE Transactions on Geoscience and Remote Sensing 60, 1–14
(2022)
[15] Sun, W., Gao, Z., Cui, J., Ramesh, B., Zhang, B., Li, Z.: Semantic segmentation
leveraging simultaneous depth estimation. Sensors 21(3), 690 (2021)
[16] Zürn, J., Burgard, W., Valada, A.: Self-supervised visual terrain classification
from unsupervised acoustic feature learning. IEEE Transactions on Robotics
37(2), 466–481 (2020)
[17] Dong, H., Ma, W., Wu, Y., Zhang, J., Jiao, L.: Self-supervised representation
learning for remote sensing image change detection based on temporal prediction.
Remote Sensing 12(11), 1868 (2020)
[18] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neu-
ral networks. In: International Conference on Machine Learning, pp. 6105–6114
(2019). PMLR
[19] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
[20] Ikechukwu, A.V., Murali, S., Deepu, R., Shivamurthy, R.: Resnet-50 vs vgg-19 vs
training from scratch: A comparative analysis of the segmentation and classifica-
tion of pneumonia from chest x-ray images. Global Transitions Proceedings 2(2),
375–381 (2021)
[21] Shabbir, A., Ali, N., Ahmed, J., Zafar, B., Rasheed, A., Sajid, M., Ahmed, A.,
Dar, S.H.: Satellite and scene image classification based on transfer learning and
fine tuning of resnet50. Mathematical Problems in Engineering 2021, 1–18 (2021)
24