0% found this document useful (0 votes)
26 views44 pages

Deep Learning UDA

The project report details a study on unsupervised domain adaptation using Generative Adversarial Networks (GANs) for semantic segmentation of aerial images, supervised by Prof. Nguyen Thi Kim Anh at Hanoi University of Science and Technology. The research addresses the challenges posed by rapid urbanization and varying data formats in urban mapping, proposing a flexible deep learning framework that adapts to different urban environments. It utilizes the Potsdam and Vaihingen datasets to validate the methodology, focusing on the impact of domain shifts on segmentation accuracy and the effectiveness of various deep learning models in mitigating these shifts.

Uploaded by

Võ Minh Trí
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views44 pages

Deep Learning UDA

The project report details a study on unsupervised domain adaptation using Generative Adversarial Networks (GANs) for semantic segmentation of aerial images, supervised by Prof. Nguyen Thi Kim Anh at Hanoi University of Science and Technology. The research addresses the challenges posed by rapid urbanization and varying data formats in urban mapping, proposing a flexible deep learning framework that adapts to different urban environments. It utilizes the Potsdam and Vaihingen datasets to validate the methodology, focusing on the impact of domain shifts on segmentation accuracy and the effectiveness of various deep learning models in mitigating these shifts.

Uploaded by

Võ Minh Trí
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Hanoi University of Science and

Technology
School of Information and Communication
Technology

Project Report - Deep Learning

Unsupervised Domain Adaptation


Using Generative Adversarial
Networks for Semantic Segmentation
of Aerial Images

Supervisor

Prof. Nguyen Thi Kim Anh

Students

Vo Minh Tri - Student’s ID : 20210817


Vu Trinh Kim - Student’s ID : 20215409
Nguyen Duc Thinh - Student’s ID : 20210817
Class: 144936

HaNoi, 12/2023
Contents

1 Introduction 3
1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Dataset 4
2.1 Potsdam and Vaihingen datasets . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Domain Shift Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Preliminaries 7
3.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Attention and Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 The Structure of GANs . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Algorithm Proposal 14
4.1 Unet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 UnetPlusPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 UperNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 BANet (Bilatural Awareness Network) . . . . . . . . . . . . . . . . . . . . . 19
4.5 MANet (Multi-Attention-Network) . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 UnetFormer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7.1 Adversarial Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7.2 Cycle Consistency Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7.3 Generator network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.4 Discriminator network . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Experimental Results 36
5.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Segmentation on Potsdam datasets . . . . . . . . . . . . . . . . . . . 37
5.2.2 Segmentation on Vaihingen datasets . . . . . . . . . . . . . . . . . . 39

6 Conclusion and Discussion 41

7 Appendix 42

2
1 Introduction

1.1 Problem description


In the era when manual cartography shaped city maps organically, reflecting the grad-
ual expansion of urban landscapes, the current trend of rapid urbanization necessitates a
departure from traditional hand-drawn maps. To meet this challenge, we advocate for an
automated mapping approach driven by extensive city development data. However, the dy-
namic nature of urban growth poses a potential risk, challenging the generalization capacity
of conventional deep learning models over time. Recognizing this, we propose the implemen-
tation of adaptive models capable of dynamically adjusting to the evolving urban fabric.

This adaptive paradigm ensures the sustained accuracy and relevance of our mapping
systems, effectively capturing the rapid transformations inherent in modern urban environ-
ments. It provides a robust foundation for future urban planning endeavors. The continuous
development of cities results in ongoing changes in urban data, presenting diverse data for-
mats across different urban settings, which in turn gives rise to challenges. Additionally,
smaller regions often confront data limitations, creating complexity when training models for
specific datasets. The scarcity of data in these areas adds intricacy to the training process,
complicating the management and deployment of models.

To address this multifaceted challenge, we propose a flexible solution: employing a


model with the capability to adapt seamlessly to various data formats. This adaptive ap-
proach aims to streamline the training process, mitigate complexities associated with data
scarcity, and enhance the overall efficiency in managing and deploying urban mapping sys-
tems. By embracing flexibility in model compatibility, we strive to establish a resilient and
adaptable framework that can effectively address the dynamic nature of urban landscapes,
providing a solid foundation for future urban planning initiatives.

In our pursuit of an effective solution for modern urban mapping, we deliberately


choose to harness the power of deep learning, specifically leveraging segmentation models and
unsupervised adaptation techniques. Deep learning, as a field within artificial intelligence, has
exhibited remarkable capabilities in handling complex and intricate data patterns, making it
particularly well-suited for the nuanced intricacies of urban landscapes.

Segmentation models, a subset of deep learning, prove invaluable in the context of


urban mapping due to their ability to categorize and identify distinct features within an
image, such as roads, buildings, and green spaces. By employing segmentation models,
we aim to enhance the precision and granularity of our mapping process, ensuring a more

3
accurate representation of the evolving urban fabric. This approach allows for a detailed and
comprehensive understanding of the urban environment, enabling better-informed decision-
making in urban planning.

Unsupervised adaptation models further contribute to the efficacy of our mapping ap-
proach. The dynamic nature of urban growth often leads to variations in data distributions
across different cities and regions. Unsupervised adaptation techniques enable our models to
autonomously adjust and learn from the local characteristics of specific urban areas without
the need for extensive labeled datasets. This adaptability is crucial in maintaining the gener-
alization capacity of our models over time, mitigating the risks associated with the evolving
nature of urban landscapes.

By combining deep learning, segmentation models, and unsupervised adaptation tech-


niques, we create a robust and flexible framework that not only accurately captures the
intricate details of urban environments but also adapts to the unique characteristics of each
city. This approach ensures that our mapping systems remain relevant and effective, even in
the face of rapid urbanization and diverse data challenges, providing a solid foundation for
informed urban planning endeavors.

2 Dataset

2.1 Potsdam and Vaihingen datasets


To validate our methodology, we used the ISPRS (WGII/4) 2D semantic segmentation
benchmark dataset [9]. It is afforded by the ISPRS 2D semantic labeling challenge that
currently provides the best platform to evaluate semantic segmentation algorithms for aerial
images. We used the Vaihingen and Potsdam datasets, which are publicly available to the
community. Although digital surface model (DSM) data are provided for every image, we
only used the image data as we were targeting domain adaptation using only image data.
Both datasets contain very-high resolution images with a resolution of 9 cm for Vaihingen
images and 5 cm for Potsdam images. Note that the resolutions are different in both datasets,
and this represents one of the factors that require domain adaptation. These resolutions are
categorized in aerial imagery as very high resolution (VHR) and are helpful in recognizing
objects clearly. In addition, this helps to maximize the intraclass variance and minimize
the interclass variance by providing more details about objects. All images in both datasets
are provided with their semantic segmentation labels, which comprise six classes of ground
objects: building, tree, car, impervious surfaces, low vegetation, and clutter/background.

4
Impervious surfaces indicate a paved area with no building on it. The clutter/background
category refers to all the ground objects that are not included in the other five categories. The
Vaihingen dataset includes 33 TOP images with sizes near to 2000 × 2000 pixels. All these
33 TOP images are released with the ground truth. The TOP file contains three channels:
Infrared, red, and green bands. Among the 33 TOP images, 27 TOP images were used for
training, and 6 images were used for the test. The Potsdam dataset is a larger dataset that
contains 38 TOP images with a fixed size of 6000 × 6000 pixels. All these images are released
with their ground truth. The TOP files for Potsdam contain 3 different spectral channels:
red, green, and blue. Among the 38 TOP images, 32 images were used for the training, and
6 images were used for the test. To train the segmentation model, we divided the images and
their labels into squares of a size of 256 x 256 and fed the network with uniform patches of
a size of 256 x 256.

Some of the images in the dataset includes:

The datasets consists of 6 classes, which of them are: Impervious surfaces, Buildings,
Low vegetation, Trees, Cars, Clutter. The distribution of the classes are displayed as follows:

Category Percentage On Dataset (%)


Impervious surfaces 29.9
Buildings 28.2
Low vegetation 20.9
Tree 14.4
Cars 1.7
Clutter 4.8

2.2 Domain Shift Analysis


Digging deeper, the shift in domains from the source domain (Potsdam) to the target
domain (Vaihingen) is influenced by three key factors. First and foremost is the imaging
sensor factor; Vaihingen’s images are captured using a 3-band sensor (IRRG - infrared, red,
green), while Potsdam’s images use an RGB sensor (red, green, blue). Notably, the repre-
sentation of the class vegetation and trees, characterized by green color in the RGB sensor of
the Potsdam dataset, undergoes a transformation to red in the Vaihingen dataset due to the
change in sensor. This alteration significantly impacts the segmentation model’s accuracy,
contributing to a pronounced domain shift.

The second factor is the resolution, with Vaihingen’s images having a resolution of 9
cm per pixel and Potsdam’s images having a higher resolution of 5 cm per pixel. This change

5
in resolution can affect the segmentation model’s ability to precisely identify classes, thereby
introducing a domain shift.

The third factor contributing to the domain shift is the structural representation of
classes. Differences emerge in the representation of various classes when transitioning from
the Potsdam dataset to the Vaihingen dataset. For instance, buildings in both datasets are
similar, reflecting the modern German town architecture. However, classes like low vegetation
and trees display clear distinctions, particularly due to Vaihingen containing agricultural
areas, a feature absent in Potsdam. The types of trees and vegetation also differ between the
two datasets, with the contrast more evident in the low vegetation class.

The combined impact of these factors results in a domain shift between Potsdam and
Vaihingen, prompting an exploration of the effect of our proposed algorithm in mitigating
domain shifts related to each factor. Table 2 summarizes the influence of these factors on
the domain shift for each class, and our analysis is based on a thorough examination of each
class in both domains.

It is noteworthy that the resolution has a low effect on the domain shift for all classes,
as the model’s feature extraction layers can effectively handle the scale difference. The sensor
factor significantly influences the building class, making it an ideal case to study the impact
of our algorithm on reducing domain shifts caused by the sensor factor alone. Similarly, the
trees class, affected by both the sensor factor and class representation factor, will be studied
to assess the algorithm’s effectiveness. The impervious surfaces and cars classes, minimally
impacted by the three factors, serve as cases to explore the algorithm’s effect on classes
unaffected by domain shifts. Lastly, the low vegetation and clutter classes, highly influenced
by the sensor factor and class representation factor, are examined to evaluate the algorithm’s
efficacy in reducing domain shifts associated with these combined factors.

Factor of Domain Shift Resolution Sensor Class Representation


Impervious Surfaces low low low
Buildings low high low
Low Vegetation low high high
Trees low high medium
Cars low low low
Clutter low high high

Table 1: Effect of Domain Shift Factors on Each Class (Potsdam to Vaihingen)

6
3 Preliminaries

3.1 CNN
Convolutional Neural Networks (CNNs) are a deep learning algorithm specifically de-
signed for processing grid-like data such as images, audio signals, and time series data. The
CNN architecture used for image classification comprises convolutional layers, max-pooling
layers, and fully connected layers. The following provides a general description of each layer
in a CNN:

Convolutional layers Sequence of convolutional layers that apply a set of filters to


the input image. The purpose of these convolutional layers is to extract significant features
from the input, including edges, lines, textures, and shapes. Each filter, represented as a
small matrix of weights, is convolved with the input image, resulting in a feature map.

Figure 1: Convolutional layer

Filter A filter of size f ˆ f applied to an input with C channels has a total size of
f ˆ f ˆ C and performs convolution on an input of size H ˆ W ˆ C, resulting in a feature
map of size P ˆ Q ˆ 1.

With k filters of size f ˆ f , the result is a feature map of size P ˆ Q ˆ k.

Stride For convolution or pooling, the step s represents the number of pixels the
window will move after each calculation.

Where: s ą 1, the resulting image is smaller than the input image.

f ě s, the image is completely covered.

Pooling layers

7
In addition to convolutional layers, pooling layers are commonly incorporated in CNNs
to downsample the feature maps produced by the convolutional layer. This downsampling
process aids in reducing the dimensionality of the feature maps.

Figure 2: Pooling Layer

Fully connected layer The fully connected layer receives flattened data as input,
each of which is connected to all neurons. In CNNs, fully connected layers are often used at
the end of the network and are used to optimize network goals such as layer accuracy.

Figure 3: Fully Connected Layer

* Common activation functions

ReLU (Rectified Linear Unit) is an activation function f used on all components


to increase the nonlinearity of the network

Figure 4: ReLU illustration

8
Creating nonlinearity in the model is necessary, for example with data sets with complex
distributions, using a linear function is not enough to represent

Softmax can be thought of as a general logistic function that takes as input a vector
of values x P Rn and outputs a vector of probabilities p P Rn .
¨ ˛
p1
In which: p “ ˝ ‚ with pi “ nexi
vdotsn ÿ
e xj
j“1

The Softmax function limits the output value to the range (0, 1), this helps prevent the
results of the operations from being very large or very small, causing computational problems
and the network may have difficulty converging.

* Popular techniques

Batch normalization Controlling input distribution between network layers can sig-
nificantly speed up training and improve overall performance. Accordingly, the distribution
of the input of layer pσ, µq is normalized such that it has zero mean and unit standard
deviation.

In which: y “ BN pxq “ γ ?x´µ


σ 2 `ϵ

Where: ϵ is a small constant to avoid numerical problems

γ, β are the parameters learned by the network during training where γ


helps correct the distribution variance and β is the bias ( bias) helps shift the distribution
left or right.

Dropout

Dropout is a technique to overcome overfitting. During the training process, some


output features may be randomly omitted through the dropout layer with a random dropout
rate of p. This reduces the network’s dependence on features that have much greater value
than the remaining features, which will be considered important information and assigned
high weight, while other information has will be considered noise and not learned, leading to
the network learning by memorizing examples and being less general.

9
Figure 5: Dropout illustration

3.2 Attention and Self-attention


Attention is the process of computing a context vector for the next decoder step that
contains the most relevant information from all of the encoder hidden states by performing
a weighted average on the encoder hidden states. How much each encoder state contributes
to the weighted average is determined by an alignment score between that encoder state and
the previous hidden state of the decoder.

More generally, we can consider the previous decoder state as the query vector, and
the encoder hidden states as key and value vectors. The output is a weighted average of the
value vectors, where the weights are determined by the compatibility function between the
query and the keys. Note that the keys and values can be different sets of vectors.

The above can be summarized by the following equations. Given a query q, values (v1 ,
..., vn ), and keys (k1 , ..., kn ) we compute output z:

n
ÿ
z“ αj ¨ vj (1)
j“1

exppf pkj , qqq


αj “ ř n (2)
i“1 exppf pki , qqq

αj is computed using the softmax function where f(ki , q) is the compatibility score between
ki and q.

For the compatibility function, we will be using the scaled dot-product function

pkqpqqT
f pk, qq “ ? (3)
dk

where dk is the dimension of the key vectors. This scaling is done to improve numerical
stability as the dimension of keys, values, and queries grows.

In the process of development, attention mechanisms in computer vision have evolved

10
into 2 main categories:

• Soft attention: The model assigns a weight to each element in the input sequence,
and the final output is a weighted sum of these elements. The weights are determined
by a softmax function, ensuring that the weights sum to 1.

• Hard attention: The model selects a subset of elements from the input sequence, and
the output is based only on this selected subset. Unlike soft attention, hard attention
involves discrete decisions.

Figure 6: Soft attention vs Hard attention

Self-attention is the process of applying the attention mechanism outlined above


to every position of the source sequence. This is done by creating three vectors (query,
key, value) for each sequence position, and then applying the attention mechanism for each
position xi , using the xi query vector and key and value vectors for all other positions. As
a result, an input sequence X = (x1 , x2 , ..., xn ) of words is transformed into a sequence Y
= (y1 , y2 , ..., yn ) where yi incorporates the information of xi as well as how xi relates to
all other positions in X. The (query, key, value) vectors can be created by applying learned
linear projections or using feed-forward layers. This computation can be done for the entire
source sequence in parallel by grouping the queries, keys and values in Q, K, V matrices:
˜ ¸
QK T
AttentionpQ, K, V q “ softmax ? V (4)
dk

Furthermore, instead of performing self-attention once for (Q,K,V) of dimension dmodel ,


multi-head attention performs attention h times on projected (Q,K,V) matrices of dimension

11
dmodel /h. For each head, the (Q,K,V) matrices are uniquely projected to dimension dmodel /h
and self-attention is performed to yield an output of dimension dmodel /h. The outputs of each
head then are concatenated and once again a linear projection layer is applied, resulting in
an output of same dimensionality as performing self-attention once on the original (Q,K,V)
matrices. This process is described by the following formulas:

MultiHeadpQ, K, V q “ Concatphead1 , . . . , headh qW O (5)

where
headi “ AttentionpQWiQ , KWiK , V WiV q (6)

where the projections are parameter matrices WiQ P Rdmodel ˆdk , WiK P Rdmodel ˆdk , WiV P
Rdmodel ˆdv and W O P Rhdv ˆdmodel .

Figure 7: Multi-head attention

3.3 GAN
Generative Adversarial Networks (GANs) are a powerful class of neural networks that
are used for unsupervised learning. GANs are made up of two neural networks, a discrimina-
tor and a generator. They use adversarial training to produce artificial data that is identical
to actual data. The Generator attempts to fool the Discriminator, which is tasked with
accurately distinguishing between produced and genuine data, by producing random noise
samples. Realistic, high-quality samples are produced as a result of this competitive inter-
action, which drives both networks toward advancement. GANs are proving to be highly
versatile artificial intelligence tools, as evidenced by their extensive use in image synthesis,
style transfer, and text-to-image synthesis.

12
3.3.1 The Structure of GANs

Generative Adversarial Networks (GANs) can be broken down into three parts:

Figure 8: GAN struture

• Generative: To learn a generative model, which describes how data is generated in


terms of a probabilistic model.

• Adversarial: The word adversarial refers to setting one thing up against another. This
means that, in the context of GANs, the generative result is compared with the actual
images in the data set. A mechanism known as a discriminator is used to apply a model
that attempts to distinguish between real and fake images.

• Discriminator: Try to evaluate the output generated by the generator with the original
sample, and outputs a value between 0 and 1. If the value is close to 0, then the
generated sample is fake, and if the value is close to 1 then the generated sample is
real.

13
4 Algorithm Proposal
We first started with training a segmentation model on the source dataset. We chose
Potsdam as the source dataset because it is far greater than the Vaihingen dataset. In fact,
in real scenarios, target datasets are smaller and less structured than the source datasets.
We performed the segmentation using popular segmentation models then training CyCGan
model by using two datasets: one for Potsdam and the other for Vaihingen.

Once the training of the proposed GAN architecture was done, we used it to translate
the full dataset of the source domain (Potsdam) to the target domain (Vaihingen). We note
that the global style of the translated image is imitating the style of the target domain. The
images generated are similar to what we can get as new images of the Potsdam town using
the IRRG sensor used for Vaihingen images.

When the translated dataset was ready, we used it to fine-tune the trained segmentation
model . We did the fine-tuning process epoch by epoch, and we tested the model on the
target dataset after every epoch to measure the improvement of average accuracy on the
target dataset.

4.1 Unet
UNet is a convolutional neural network (CNN) architecture designed for semantic seg-
mentation tasks in image processing. Developed for biomedical image segmentation originally,
UNet has found widespread applications in various domains. Its distinctive architecture com-
prises a contracting path to capture context and a symmetric expansive path to enable precise
localization.

The primary strength of UNet lies in its ability to handle image segmentation, a task
where the objective is to classify each pixel in an image into a specific class. This architecture
excels in capturing intricate details and fine structures within images, making it particularly
well-suited for tasks such as medical image segmentation, where precise delineation of struc-
tures like organs, tumors, or cells is crucial.

The UNet architecture has been adapted and extended in numerous ways to address
specific challenges and requirements in different applications. Its versatility has led to its use
in diverse fields beyond medical imaging, such as satellite image analysis, scene segmentation
in autonomous vehicles, and more.

14
Figure 9: UNet architecture

The architecture of U-Net is unique in that it consists of a contracting path and an


expansive path. The contracting path contains encoder layers that capture contextual in-
formation and reduce the spatial resolution of the input, while the expansive path contains
decoder layers that decode the encoded data and use the information from the contracting
path via skip connections to generate a segmentation map.

The contracting path in UNet is responsible for identifying the relevant features in
the input image. The encoder layers perform convolutional operations that reduce the spatial
resolution of the feature maps while increasing their depth, thereby capturing increasingly
abstract representations of the input. This contracting path is similar to the feedforward
layers in other convolutional neural networks.

On the other hand, the expansive path works on decoding the encoded data and
locating the features while maintaining the spatial resolution of the input. The decoder
layers in the expansive path upsample the feature maps, while also performing convolutional
operations. The skip connections from the contracting path help to preserve the spatial
information lost in the contracting path, which helps the decoder layers to locate the features
more accurately.

4.2 UnetPlusPlus
U-Net++ or Nested U-Net is a deep learning architecture that was introduced in 2019.
In UNet, the encoder part captures high-level features from the input image through a series of
convolutional and pooling layers, while the decoder part upsamples these features to generate
a dense segmentation map. However, there can be a semantic gap between the encoder and

15
decoder features, meaning that the decoder may struggle to reconstruct fine-grained details
and produce accurate segmentation.

UNet++ introduces the concept of nested skip pathways to bridge this semantic gap.
It adds additional skip connections between the encoder and decoder blocks at multiple res-
olutions. These connections allow the decoder to access and incorporate both low-level and
high-level features from the encoder, providing a more detailed and comprehensive under-
standing of the image. They improved the traditional U-Net architecture by redesigning the
skip connections and introducing a deeply supervised nested encoder-decoder network.

Figure 10: UNet++ architecture

Instead of a traditional skip connection, the feature map from the lower level is also
convoluted with the upper-level feature and then the new combined feature data is then
passed further. The basic idea behind UNet++ is to bridge the semantic gap between the
feature maps of the encoder and decoder before the fusion.

4.3 UperNet
The UPerNet model is designed for the task of segmentation, which involves parsing
and comprehending various visual elements within a given scene. The model leverages the
concept of a unified framework that combines bottom-up and top-down pathways for feature
extraction and semantic segmentation. By integrating multi-scale features and contextual
information, UPerNet aims to achieve better performance in segmentation tasks.

UPerNet was introduced as an extension of the popular Fully Convolutional Network


(FCN) architecture, incorporating additional modules such as Feature Pyramid Network
(FPN) and Pyramid Pooling Module (PPM):

• Visual backbone: UPerNet framework is compatible with multiple visual backbone


as simple as Resnet-17 or modern backbone like Swim or Visual Transformer. In this

16
Figure 11: UperNet general framework

paper we used Resnet-50 which is relatively lightweight compare to other network.

• PPM Head: The Pyramid Pooling Module (PPM) is a feature extraction component
commonly used in computer vision models. First introduced in the PSPNet architec-
ture, the PPM aims to capture multi-scale contextual information from an input feature
map.

• Feature Pyramid Network: The Feature Pyramid Network was introduced the
paper ”Feature Pyramid Networks for Object Detection”. To address the challenge of
detecting objects at different scales, the FPN create a feature pyramid that combines
features from different levels of spatial resolution while preserving semantic information.

The Structure of Resnet-50

ResNet-50 is a specific variant of the Residual Network (ResNet) architecture, which is


a deep convolutional neural network (CNN) that achieved remarkable performance in various
computer vision tasks. Here are some key characteristics and components of ResNet-50:

• Skip Connections: ResNet-50 introduces skip connections (also known as shortcut con-
nections or identity mappings) to address the vanishing gradient problem. By allowing
the flow of gradients directly from earlier layers to later layers, skip connections enable
the network to better retain and propagate information through the network, reducing
the degradation of performance as the network gets deeper.

• Bottleneck Architecture: ResNet-50 uses a bottleneck architecture in the residual


blocks, which consists of three convolutional layers: 1x1, 3x3, and 1x1. The 1x1 convo-
lutions are used to reduce and then restore the dimensions of the input feature maps,
reducing computational complexity while maintaining representational capacity.

Using as backbone, instead of classify input image, the model output multiple feature
map with different size when downsampling the image.

17
Figure 12: Resnet-50 backbone

Figure 13: Pyramid Pooling Module

Pyramid Pooling Module

The PPM aims to capture multi-scale contextual information from an input feature
map. It divides the feature map into multiple levels, with each level representing a different
scale. The PPM then performs pooling operations within each level to aggregate information
and generate fixed-length representations.

The Pyramid Pooling Module consists of the following steps:

• Dividing the feature map: The input feature map is divided into several levels, each
having a different spatial resolution. These levels are created by applying pooling
operations with different kernel sizes (1, 2, 3, 6).

• Convolution layer: Each level then passed through a convolution layer to capture in-
formation from neighbor nodes.

• Upsampling and concatenation: The representations from different levels are then up-
sampled to the original spatial resolution and concatenated together. This combination
of features from multiple scales allows for the integration of both local and global con-
textual information.

18
Feature Pyramid Network

The FPN introduces a top-down pathway to facilitate the integration of high-resolution


features with semantically strong, low-resolution features. This pathway involves upsampling
the feature maps from higher levels and fusing them with the corresponding feature maps
from lower levels through lateral connections. The fused feature maps form a feature pyramid
that includes features at multiple scales.

Those feature map then rescaled and concatenated to become a final feature map, this
feature map then passed through a segmentation head to output final result.

4.4 BANet (Bilatural Awareness Network)


BANet innovatively tackles the complexities of urban scene segmentation by construct-
ing two distinctive feature extraction paths. The first, a texture path, utilizes stacked con-
volution layers to adeptly extract textural features (TF) from the images. In parallel, a
dependency path employs Transformer blocks to capture long-range dependency features
(LDF), enhancing the network’s ability to understand intricate urban scenes. To harness the
complementary advantages provided by these two feature paths, BANet uses Feature Aggre-
gation Module (FAM). FAM incorporates a linear attention mechanism, mitigating fitting
residuals in the fused features and thereby fortifying the network’s generalization capabilities.
Furthermore, the uniquely designed bilateral structure proves to be a versatile solution with
applications extending beyond semantic segmentation. Its adaptability makes it well-suited
for tasks such as object detection and change detection, marking a significant stride in the
integration of deep learning techniques within the remote sensing domain.

19
Figure 14: BANet architecture

Dependency path: employs a stem block and four transformer stages (i.e., Stage 1-4)
to extract long-range dependent features (LDF). Each stage consists of two efficient trans-
former blocks (ETB). In particular, Stage 2, Stage 3, and Stage 4 involve patch embedding
(PE) operations additionally. Proceed by the dependency path, two long range dependent
features (i.e., LDF3 and LDF4) are generated.
Stem block: The stem block aims to shrink the height and width dimension and ex-
pand the channel dimension. To capture low-level information effectively, it introduces three
3x3 convolution layers with strides of [2, 1, 2]. The first two convolution layers are followed
by batch norm (BN) and ReLU activation. Proceed by the stem block, the spatial resolution
is downscaled by a factor of 4, and the channel dimension is extended from 3 to 64.
Patch embedding (PE): The patch embedding aims to down-sampled the feature
map for hierarchical feature [Link] output for each patch embedding can be
formalized as:
` ˘
P EpX1 q “ Sigmoid DWConvpX1 q ¨ X1 (7)

X1 “ BNpWs ¨ Xq (8)

where Ws represents a convolution layer with a kernel size of s+1 and a stride of s. Here s is
set as 2. DWConv denotes a 3x3 depth-wise convolution with a stride of 1.
Efficient transformer block (ETB): Each efficient transformer is composed of effi-

20
cient multihead self-attention (EMSA),multilayer perceptron (MLP) and layer norm (LN).The
output for each efficient transformer block can be formalized as:

ET BpXq “ GpXq ` pMLPpLNpGpXqqqq (9)

GpXq “ X ` EMSApQ, K, Vq (10)


QKT
EM SApQ, K, Vq “ LPpINp Softmaxp Convp ? qqq ¨ Vq (11)
m

Figure 15: EMSA flowchart

Texture path: is a lightweight convolutional network, which builds four diverse con-
volutional layers to capture textural information. Each layer has a batch normalization and a
ReLu activation. The convolutional layer of T1 has a kernel size of 7 and a stride of 2, which
expands the channel dimension from 3 to 64. For T2 and T3, the kernel size and stride are
3 and 2, respectively. The channel dimension is kept as 64. For T4, the convolutional layer
is a standard 11 convolution with a stride of 1, expanding the channel dimension from 64 to

21
128. Thus, the output textural feature is downscaled 8 times and has a channel dimension
of 128.
Feature aggregation module (FAM): FAM aims to leverage the benefits of the
dependent features and texture features comprehensively for powerful feature representation.
The input features for the FAM include the LDF3, LDF4 and TF. To fuse those features, we
first employ an attentional embedding module (AEM) to merge the LDF3 and LDF4. There-
after, the merged feature is upsampled to concatenate with the TF, obtaining the aggregated
feature. Finally, the linear attention module is deployed to reduce the fitting residual of the
aggregated feature (AF).

Figure 16: Feature aggregation module

Linear attention module (LAM): In the FAM, we employ LAM to enhance the
spatial relationships of AF, thereby suppressing the fitting residual. Then, a convolutional
layer with BN and ReLU is deployed to obtain the attention map. Finally, we apply a matrix
multiplication operation between AF and the attention map to obtain the attentional AF.

Figure 17: Linear attention

Attentional embedding module (AEM): The AEM adopts the LAM to enhance

22
the spatial relationships of LDF4. Then, we apply a matrix multiplication operation between
the upsampling attention map of LDF4 and LDF3 to produce the attentional LDF3. Finally,
we use an addition operation to fuse the original LDF3 and the attentional LDF3.

Figure 18: Attentional embedding module

4.5 MANet (Multi-Attention-Network)


Semantic segmentation plays a crucial role in various remote sensing applications like
land resource management and urban planning. Deep convolutional neural networks (CNNs)
have significantly improved segmentation accuracy, but they often struggle with domain
adaptation, meaning they perform poorly when applied to data from a different source (e.g.,
satellite vs. aerial images). MANet tackles this challenge by leveraging the power of attention
mechanisms. Attention helps the network focus on relevant contextual information within
the image, leading to more accurate and adaptable segmentation results.

MANet addresses the aforementioned limitations by incorporating two powerful atten-


tion mechanisms within a ResNet50-based architecture:

• Kernel Attention Mechanism (KAM): Overcoming the quadratic complexity of


traditional dot-product attention, KAM utilizes kernel smoothers to efficiently cap-
ture long-range dependencies between pixels within the image, fostering robust feature
representations across domains.

– Kernel attention is a novel attention mechanism designed for deep learning models,
particularly in computer vision tasks like semantic segmentation. Its key strengths
are:

∗ Linear complexity: Unlike standard self-attention, which has quadratic com-


plexity, kernel attention exhibits linear complexity in terms of computational
cost. This significantly reduces processing time and allows application to
larger datasets and models.

∗ Global dependency capture: While standard attention focuses on local re-

23
lationships between elements, kernel attention effectively captures long-range
dependencies within the data. This enables it to better understand the overall
context and relationships between distant parts of the image.

Working principles:

1. Input Feature Maps: The model starts with a set of feature maps extracted
from the input data.

2. Kernel Transformation: Each feature map is transformed using a small ker-


nel (e.g., 1x1 convolution). This kernel helps capture local information and
relationships within the data.

3. Similarity Matrix Calculation: The transformed feature maps are then used
to calculate a similarity matrix. This matrix reflects the degree of similarity
between every pair of elements in the data.

4. Attention Weights: The similarity matrix is further processed through a non-


linearity (e.g., ReLU) and scaled to generate attention weights. These weights
represent the importance of each element in the data.

5. Weighted Sum: Finally, the original feature maps are weighted by the atten-
tion weights and summed to create a new set of enhanced feature maps. These
new maps incorporate both local and global information, better representing
the context and relationships within the data.

Benefits:

∗ Improved accuracy: By capturing long-range dependencies and enhancing


feature representation, kernel attention leads to more accurate segmentation
results.

∗ Reduced computational cost: The linear complexity makes it significantly


faster than standard attention, allowing for efficient processing of large datasets.

∗ Wider applicability: This efficiency opens doors for utilizing kernel attention
in resource-constrained environments and larger models.

• Channel Attention Mechanism (CAM): CAM dynamically weights the impor-


tance of different feature channels, directing the model’s focus towards relevant infor-
mation and suppressing domain-specific artifacts. This enhances the discriminative
power of the extracted features and improves domain adaptation capability.

– Channel attention focuses on dynamically reweighting the importance of individual

24
channels within a feature map. This improves feature representation by amplifying
informative channels and suppressing less relevant ones.

Working principles:

1. Squeeze: The first step is to ”squeeze” the feature map, typically using global
average pooling or a small 1x1 convolution. This operation reduces the spatial
dimensionality of the map, resulting in a single channel representing the overall
information across all spatial locations.

2. Excitation: The squeezed channel is then passed through a feed-forward net-


work, often composed of two fully connected layers and a non-linearity. This
network ”excites” the channel by adjusting its importance.

3. Scale: The output of the excitation network is typically a sigmoid or softmax


activation function, generating ”scale” factors between 0 and 1. These factors
determine the final weight of each channel in the original feature map.

4. Rescale: Finally, the original feature map is multiplied by the channel-wise


scale factors, effectively adjusting the importance of each channel based on
its global information content.

Benefits:

∗ Improved feature representation: By emphasizing informative channels and


suppressing less relevant ones, channel attention enhances the overall quality
and discriminativeness of the extracted features.

∗ Increased accuracy: This improved feature representation often translates to


better performance in downstream tasks like image classification and segmen-
tation.

∗ Lightweight and efficient: Channel attention adds minimal computational


overhead, making it easy to integrate into existing deep learning architec-
tures.

Dot-product-based attention With strong capabilities to capture long-range depen-


dencies, dot-product attention mechanisms have been applied in vision and natural language
processing tasks.. A dot-product-based attention modified for computer vision, has shown
great potential in image classification, object detection, semantic segmentation and panoptic
segmentation. Utilization of the dot-product attention mechanism often comes with signif-
icant memory and computational costs, which increase quadratically with the size of the

25
input over space and time. It remains an intractable problem to model global dependency on
large-scale inputs, such as fine-resolution images. To alleviate the substantial computational
requirement, using a sparse factorization of the attention matrix and reduced the complexity
?
from OpN 2 q to OpN N q. Using self-attention as a linear dot-product of kernel feature maps
to further reduce the complexity to O(N)

With MANet, not only dramatically decrease the complexity, but also amply exploit
the potential of the attention mechanism by designing a multilevel framework. Specifically,
we reduce the complexity of the dot-product attention mechanism to O(N) by treating atten-
tion as a kernel function. As the complexity of attention is reduced dramatically by kernel
attention, we propose a Multi-Attention-Network (MANet) with a ResNeXt-101 backbone
which explores the complex combinations between attention mechanisms and deep networks
for the task of semantic segmentation using fine-resolution remote sensing images. The ma-
jor contributions of this research include: 1) A novel attention mechanism involving kernel
attention with linear complexity is proposed to alleviate the attention module’s huge compu-
tational demand. 2) To extract refined dense features, we replace ResNet with ResNeXt-101
as the backbone, enhancing the ability for feature extraction. 3) Based on kernel attention
and ResNeXt-101, we propose a Multi-Attention-Network (MANet) by extracting contextual
dependencies using multi-kernel attention

Multi-Attention-Network

For the spatial dimension, as the computational complexity of the dot-product attention
mechanism exhibits a quadratic relationship with the size of the input pN “ H ˆ W q, we use
an attention mechanism based on kernel attention, named KAM. For the channel dimension,
the number of input channels C is normally far less than the number of pixels contained
in the feature maps (i.e., C ! N q. Therefore, the complexity of the softmax function for
` ˘
channels (i.e., O C 2 ), is not large according to equation. Thus, we utilize the channel
attention mechanism based on the dot-product, named CAM. Like the dot-product attention
mechanism, there exists a residual connection in the KAM and CAM, adding output with
the input features directly.

Using the kernel attention mechanism (KAM) and channel attention mechanism (CAM)
which model the long-range dependencies of positions and channels, respectively, we design
an attention block to enhance the discriminative ability of feature maps extracted by each
layer. Features generated by the ResBlock are fed into the KAM and CAM to refine the
information in positions and channels, respectively. Thereafter, the refined feature maps are

26
Figure 19: Details of the channel attention mechanism

Figure 20: Illustration of the attention block.

added directly to obtain the output of the corresponding attention block whose structure can
be seen in Fig. 18b.

The structure of the proposed Multi-Attention-Network is illustrated in Fig. 18. Specif-


ically, five feature maps at different scales acquired from the outputs of [Conv, ResBlock-
1, ResBlock-2, ResBlock-3, ResBlock-4] are adopted. The lowest level feature Res-4 is up-
sampled directly by the DeBlock 4 which is comprised of a 3 x 3 deconvolution layer with
stride = 2 and two 1 x 1 convolution layers before and after the deconvolution layer. The
feature maps generated by ResBlocks are then refined by corresponding attention blocks
and added with the up-sampled lower feature maps. Subsequently, the fused features are
up-sampled by the DeBlocks correspondingly. Finally, the output of the last DeBlock is
up-sampled to the identical spatial resolution of the input by employing a deconvolution
operation and fed into the final convolution layer to obtain the predicted segmentation map.

4.6 UnetFormer
The proposed UNetFormer is built with a CNN-based encoder and a Transformer-based
decoder, as shown in Fig. 19. Each component is explained in detail in the sections that
follow.

27
Figure 21: The structure of (a) the proposed MANet, (b) the Attention block, (c) the Res-
Block, and (d) the DeBlock.

28
Figure 22: An overview of the UNetFormer

CNN-based encoder Here, we choose to use the pre-trained ResNet18 as the en-
coder in order to extract semantic features at a cheap computationally cost. Each level of
ResNet18’s four-stage Resblocks downsamples the feature map by a scale factor of two. The
proposed UNetFormer uses a 1x1 convolution with a channel dimension of 64, or the skip
connection, to fuse the feature maps produced by each stage with the corresponding feature
maps of the decoder.

Global-local Transformer block (GLTB) first use a standard 1x1 convolution to


expand the channel dimension of the input 2D feature map RB.C.H.W to three times. Then,
we apply the window partition operation to split the 1D sequence into the query (Q), key
(K) and value (V) vectors. The channel dimension C is set to 64. The window size w and
the number of heads h are both set to 8.

29
Figure 23: a global-local Transformer block.

Feature refinement head (FRH) the channel path employs a global average pooling
layer to generate a channel-wise attentional map C P R1ˆ1ˆc where c denotes the channel
dimension. The reduce and expand operation contains two 1x1 convolutional layers, which
first reduces the channel dimension c by a factor of 4 and then expands it to the original.
The spatial path utilizes a depth-wise convolution to produce a spatial-wise attentional map
S P Rhˆwˆ1 , where h and w represent the spatial resolution of the feature map. The
attentional features generated by the two paths are further fused using a sum operation.
Finally, a post-processing 1x1 convolutional layer and an upsampling operation are applied
to produce the final segmentation map.

30
Figure 24: Feature refinement head

4.7 CycleGAN
We all known the normal GAN models have input is paired. That make an disavantage
to model get high performance of generalization and synthesis ability. Usually model need
a large scale dataset to reach generating the fake image as real. But there are some data is
scarce and hard to be found, for example: the pictures from the dead artist.

To overcome this disavantage, CycleGAN introduce unpaired image to image transla-


tion learning that supports translating the general characteristics from source to target set
without requiring them related to in pair.

Because of unpair source-target couple, we do not exactly know what the specific
source image maps to target image. But in the other hand, model learn mapping at the
set level between set X to set Y under the unsupervised learning. Our goal is to learn
mapping functions between two domains X and Y given training samples txi uN
i“1 , where
xi P X, and tyj uM 1
j“1 , where yj P Y . We denote the data distribution as x „ pdata pxq and
y „ pdata pyq. As illustrated in Figure 26(a), our model includes two mappings G : X Ñ Y
and F : Y Ñ X. In addition, we introduce two adversarial discriminators DX and DY ,
where DX aims to distinguish between images txu and translated images tF pyqu; similarly,
DY aims to discriminate between tyu and tGpxqu. Our objective contains two types of terms:

31
Figure 25: Paired (in the left) vs Unpaired (in the right) image to image translation. Paired
training dataset exists the correspondance of each input image xi and yi . In contrast, unpaired
training dataset have input is both sets X “ xi M M
i“1 and Y “ yi i“1 with no information of
mapping to inside them.

adversarial losses for matching the distribution of generated images to the data distribution
in the target domain, and cycle consistency losses to prevent the learned mappings G and F
from contradicting each other.

4.7.1 Adversarial Loss

We apply adversarial losses to both mapping functions. For the mapping function
G : X Ñ Y and its discriminator DY , we express the objective as:

LGAN pG, DY , X, Y q “ Ey„pdata pyq rlog DY pyqs ` Ex„pdata pxq rlogp1 ´ DY pGpxqqqs (12)

where G tries to generate images Gpxq that look similar to images from domain Y , while DY
aims to distinguish between translated samples Gpxq and real samples y. G aims to minimize
this objective against an adversary DY that tries to maximize it, i.e., minG maxDY LGAN pG, DY , X, Y q.
We introduce a similar adversarial loss for the mapping function F : Y Ñ X and its discrim-
inator DX as well:

LGAN pF, DX , Y, Xq “ Ex„pdata pxq rlog DX pxqs ` Ey„pdata pyq rlogp1 ´ DX pF pyqqqs (13)

where F tries to generate images F pyq that look similar to images from domain X, and DX
aims to distinguish between translated samples F pyq and real samples x. F aims to minimize

32
Figure 26: Two mapping functions G : X Ñ Y and F : Y Ñ X, and associated adversarial
discriminators DY and DX , DY encourages G to translate X into outputs indistinguishable
from domain Y , and vice versa for DX and F . To further regularize the mappings, we
introduce two cycle consistency losses that capture the intuition that if we translate from
one domain to the other and back again we should arrive at where we started: (b) forward
cycle-consistency loss: x Ñ Gpxq Ñ F pGpxqq « x, and (c) backward cycle-consistency loss:
y Ñ F pyq Ñ GpF pyqq « y.

this objective against an adversary DX that tries to maximize it, i.e., minF maxDX LGAN pF, DX , Y, Xq.

4.7.2 Cycle Consistency Loss

Adversarial training can, in theory, learn mappings G and F that produce outputs
identically distributed as target domains Y and X respectively (strictly speaking, this requires
G and F to be stochastic functions). However, with large enough capacity, a network can
map the same set of input images to any random permutation of images in the target domain,
where any of the learned mappings can induce an output distribution that matches the target
distribution. Thus, adversarial losses alone cannot guarantee that the learned function can
map an individual input xi to a desired output yi . To further reduce the space of possible
mapping functions, we argue that the learned mapping functions should be cycle-consistent:
as shown in Figure 26(b), for each image x from domain X, the image translation cycle
should be able to bring x back to the original image, i.e., x Ñ Gpxq Ñ F pGpxqq « x. We
call this forward cycle consistency. Similarly, as illustrated in Figure 26(c), for each image
y from domain Y , G and F should also satisfy backward cycle consistency: y Ñ F pyq Ñ
GpF pyqq « y. We incentivize this behavior using a cycle consistency loss:

Lcyc pG, F q “ Ex„pdatapxq r}F pGpxqq ´ x}1 s ` Ey„pdatapyq r}GpF pyqq ´ y}1 s (14)

In preliminary experiments, we also tried replacing the L1 norm in this loss with an
adversarial loss between F pGpxqq and x, and between GpF pyqq and y, but did not observe
improved performance.

The behavior induced by the cycle consistency loss can be observed in Figure 27: the

33
Figure 27: The input images x, output images Gpxq, and the reconstructed images F pGpxqq
from various experiments. From top to bottom: photo Ø Cezanne, horses Ø zebras, winter
Ñ summer Yosemite, aerial photos Ø Google maps. ).

reconstructed images F pGpxqq end up matching closely to the input images x.

Our full objective is:

LpG, F, DX , DY q “ LGAN pG, DY , X, Y q ` LGAN pF, DX , Y, Xq ` λLcyc pG, F q (15)

where λ controls the relative importance of the two objectives. We aim to solve:

G˚ , F ˚ “ arg min max LpG, F, DX , DY q (16)


G,F DX ,DY

34
4.7.3 Generator network

We using a common architecture for the CycleGAN generator is the U-Net. U-Net
is a network which consists of a sequence of downsampling blocks followed by a sequence
of upsampling blocks, giving it the U-shaped architecture. In the upsampling path, we
concatenate the outputs of the upsampling blocks and the outputs of the downsampling
blocks symmetrically. This can be seen as a kind of skip connection, facilitating information
flow in deep networks and reducing the impact of vanishing gradients

4.7.4 Discriminator network

Unlike conventional networks that output a single probability of the input image being
real or fake, CycleGAN uses the PatchGAN discriminator that outputs a matrix of values.
Intuitively, each value of the output matrix checks the corresponding portion of the input
image. Values closer to 1 indicate real classification and values closer to 0 indicate fake
classification.

Figure 28: Discriminator architecture

35
5 Experimental Results
In this section we conduct extensive experiments on the algorithms proposed.

5.1 Experimental settings


Segmentation model

For the settings, we have the training set and test set each having 5000 and 1400
256x256 images, respectively.

Due to time and resource constraints, our model tuning efforts are focused on key pa-
rameters within the baseline Unet architecture. We specifically target adjustments to the
learning rate, number of layers, filter ratio, and the incorporation of dropout and normal-
ization techniques. Optimizing for the best performance, we have configured the following

Figure 29: Performance with different learning rate, number of layers, filter ratio, and the
incorporation of dropout and normalization techniques

parameters for our model: we set the learning rate to 8e-3, filter ratio to 1, using both
dropout and normalization techniques and the models are trained until convergence. The
loss we use are joint loss of Cross Entropy Loss and Dice Loss.

Cross Entropy Loss is a widely used loss function in classification tasks, particularly
in the context of machine learning and deep learning. It measures the dissimilarity between
the predicted probability distribution and the true probability distribution of the classes in a
classification problem. The cross-entropy loss is designed to penalize the model more heavily
when it makes confident incorrect predictions.

In a binary classification scenario, where there are only two classes (commonly denoted

36
as 0 and 1), the cross-entropy loss for a single sample can be defined as:

Lpy, ŷq “ ´ry ¨ logpŷq ` p1 ´ yq ¨ logp1 ´ ŷqs (17)

where L is the cross-entropy loss, y is the true label (0 or 1) and ŷ is the predicted probability
of belonging to class 1.

For a multi-class classification problem with C classes, the cross-entropy loss is gener-
alized as:
C
ÿ
Lpy, ŷq “ ´ yi ¨ logpŷi q (18)
i“1

where L is the cross-entropy loss, yi is the true probability of class i and ŷi is the predicted
probability of belonging to class i.

Dice Loss, also known as Sorensen-Dice coefficient or F1 score loss, is often used in
image segmentation tasks to measure the overlap between the predicted segmentation masks
and the ground truth masks. The formula for Dice Loss is defined as follows:
řN
2 i pi ¨ gi
Dice Loss “ 1 ´ řN 2
řN 2 (19)
i p i ` i gi

where N is the total number of pixels in the images, pi is the predicted probability of a pixel
being part of the object in the segmentation mask and qi is the ground truth binary label
indicating whether a pixel belongs to the object.

We leverage the use of Dice Loss and Cross Entropy Loss for this particular problem,
where we take the weighted sum of the losses,and the weight of each loss is 1.

CyCGan model Regarding the configurations, we randomly selected 400 images, each
sized at 256 x 256, from the original TOP images dataset. These images were then divided
into two subsets: a training subset comprising 300 images and a test subset consisting of 100
images, applied to both the Potsdam and Vaihingen datasets. For optimization, we opted
for Adam optimization, setting the learning rate to 2e-4. This choice is influenced by the
positive performance observed with this learning rate in similar datasets.

5.2 Experimental results

5.2.1 Segmentation on Potsdam datasets

For result evaluation, we test the model on two settings: without tuning and with
tuning. For tuning we use data augmentation and Xavier weight initialization techniques in

37
the model. The result can be observed as follows:

Model Without tuning With tuning


BANet 75.9 76.0
Unet 76.6 76.7
UperNet 79.5 81.1
Unet++ 85.7 86.32
MANet 87.8 88.0
UnetFormer 91.0 91.7

Table 2: The result of the experiments

We can see that the baseline model UNet achieve a competitive performance of 76.7%.
We tried to achieve better results by testing BANet, UperNet, Unet++ and MANet, and
UnetFormer. BANet with the utilization of efficient transformers met an unexpected failure
of 76.0% on test set, slightly underperform than Unet. The segmentation head seems to un-
able to replace a decoder block which makes the model may not be efficient comparing to the
encoder-decoder traditional model for image segmentation. Traditional encoder-decoder im-
ages requires upsampling, which implies that we need to use more parameters in upsampling
to maintain the efficiency of the model. The UperNet has achieved a better performance than
UNet with 81.1% due to its efficient leverage with the Pyramid Pooling Module heads feature
extraction. Meanwhile, we can observe that the UNetPlusPlus and the MANet has a superior
performance comparing to other models, the UNetPlusPlus has leverage the strong power of
nested blocks of CNNS, which efficiently made the semantic gap between the feature maps of
the encoder and decoder closer before the mix. The MANet has the best performance among
the test models, which makes the model really efficient because of the powerful use of the
attention that captures the global information of the feature maps before the deconvolution
block of each encoding block that makes the model really reliable and make the accuracy
really high on the test set. Finally, the simple fusion between ResNet block encoder and the
Transformers decoder makes the UnetFormer have a really strong decoder that learns the
global context efficiently and that makes the model perform really well.

There are some key features of the data which led to the decent results of the solutions:

• The data has a good amount and balanced distribution of classes.

• The images overall have good and distinguishable features, e.g the tree curves are not
too hard to extract and the road and houses are really in squared-shape positions which
makes the model realise the significant features that is in the image.

38
5.2.2 Segmentation on Vaihingen datasets

For the evaluation of results, we conduct testing on two distinct scenarios: the original
models and the models after the fine-tuning process. During the fine-tuning phase, we train
our models until they converge, ensuring that the training process captures the intricacies of
the data and reaches a stable state. The result can be observed as follows:

Model Before tuning With tuning


BANet 26.84 68.3
Unet 27.19 68.2
UperNet 27.89 70.4
Unet++ 28.76 72.4
MANet 29.62 73.5
UnetFormer 31.6 75.3

Table 3: The result of the experiments

We can observe a significant improvement in the results of the algorithms after fine-
tuning with the translated dataset, rising from approximately 26-30% to 68-75%. The accu-
racy of the models remains consistent when segmenting the Potsdam dataset, indicating a
substantial impact of the network architecture. This suggests that with stronger networks,
the performance is better.

Model Before After


Building 23 64.5
Tree 4 46.3
Impervious surfaces 47.5 48.2
Car 34.6 36.8
Clutter background 78 83.6
Low vegetation 31.7 28.4

Table 4: Accuracy of the segmentation on every class before and after .

Delving into a more in-depth analysis, we conducted a thorough examination of the


impact of our algorithm on individual classes, utilizing the results from the baseline Unet for
estimation and analysis (refer to Table 4). Two distinct effects are identified in this study.

Firstly, there are classes where our algorithm substantially increased model accuracy
(specifically, classes building and tree). A comparison with Table 1 reveals that these classes
are predominantly characterized by a domain shift primarily related to the sensor factor. In
cases where the domain shift is primarily linked to the sensor factor, our algorithm proves

39
highly effective in boosting model accuracy. For instance, the class building is solely affected
by the sensor factor, leading to a notable increase in average accuracy from 0.27 to 0.68.
Similarly, for the class tree, our algorithm demonstrates efficiency in enhancing accuracy,
albeit with some limitations attributed to other domain shift factors.

On the other hand, for classes like impervious surfaces, cars, clutter background, and
low vegetation, our algorithm exhibits no practical effect in altering accuracy; it effectively
conserves accuracy. As indicated in Table 1, these classes are either not affected by any
domain shift factor (e.g., classes cars and impervious surfaces) or are highly influenced by a
factor other than the sensor (e.g., clutter background or low vegetation).

In essence, our algorithm maintains model accuracy when there is either no domain
shift or when the domain shift is predominantly related to a factor other than the sensor. This
characteristic is valuable as it allows for the integration of our algorithm with other techniques
that may target different domain shift factors. The algorithm successfully addresses the
elimination of the sensor factor without adversely affecting other factors. If the domain shift
primarily stems from the sensor factor between source and target datasets, our algorithm
proves capable of significantly improving accuracy, comparable to training the model on a
fully labeled target dataset, as exemplified in the class building.

40
6 Conclusion and Discussion
In this report, we introduced some supervised and unsupervised learning algorithm.
The overall performance of the method tested are decent, and the bestperformance was from
the Unetformer model with 91.7% accuracy on Potsdam datasets, the performance of all
models are increase after using Unsupervised Domain Adaptation on Vaihingen datasets and
after conducting experiment we analysed the disadvantages and also the advantages of each
method.

During the process of working on a project, the group has accomplished the following
tasks:

• Researched the semantic segmentation problem.

• Studied how to apply deep learning approaches to solve the problem.

• Conducted experiments to measure the effectiveness of the proposed models.

• Analyzed and discussed the results achieved.

In the future, the group wants to experiment with the models used on more advanced
techniques and problems on deep learning, especially semantic segmentation and GAN (aAp-
plying a pretrained U-Net model as the generator network for a Conditional Cycle-Consistent
Generative Adversarial Network (CyCGAN) for some particular problems, ...) while also ex-
ploring other approaches.

We want to take a moment to express our heartfelt gratitude for all the support and
guidance the teacher have provided us throughout the challenging topics of this course. Your
expertise and dedication have truly made a difference in our journey, and we are incredibly
grateful for your unwavering commitment to our education.

41
7 Appendix
Here are some demo inferences of our work:

42
References
[1] Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre. Beyond RGB: very high
resolution urban remote sensing with multimodal deep networks. CoRR, abs/1711.08681,
2017.

[2] Yifeng Chen, Guangchen Lin, Songyuan Li, Omar El Farouk Bourahla, Yiming Wu,
Fangfang Wang, Junyi Feng, Mingliang Xu, and Xi Li. Banet: Bidirectional aggregation
network with occlusion handling for panoptic segmentation. CoRR, abs/2003.14031, 2020.

[3] Rui Li, Shunyi Zheng, Ce Zhang, Chenxi Duan, Jianlin Su, Libo Wang, and Peter M
Atkinson. Multiattention network for semantic segmentation of fine-resolution remote
sensing images. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2021.

43
[4] Libo Wang, Shenghui Fang, Ce Zhang, Rui Li, and Chenxi Duan. Efficient hy-
brid transformer: Learning global-local context for urban sence segmentation. CoRR,
abs/2109.08937, 2021.

[5] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual
parsing for scene understanding. CoRR, abs/1807.10221, 2018.

[6] Cheng Zhang, Wanshou Jiang, Yuan Zhang, Wei Wang, Qing Zhao, and Chenjie Wang.
Transformer and cnn hybrid deep neural network for semantic segmentation of very-
high-resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote
Sensing, 60:1–20, 2022.

[7] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming
Liang. Unet++: A nested u-net architecture for medical image segmentation. CoRR,
abs/1807.10165, 2018.

44

You might also like