Deep Learning UDA
Deep Learning UDA
Technology
School of Information and Communication
Technology
Supervisor
Students
HaNoi, 12/2023
Contents
1 Introduction 3
1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Dataset 4
2.1 Potsdam and Vaihingen datasets . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Domain Shift Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Preliminaries 7
3.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Attention and Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 GAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.1 The Structure of GANs . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Algorithm Proposal 14
4.1 Unet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 UnetPlusPlus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 UperNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 BANet (Bilatural Awareness Network) . . . . . . . . . . . . . . . . . . . . . 19
4.5 MANet (Multi-Attention-Network) . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 UnetFormer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7.1 Adversarial Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7.2 Cycle Consistency Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.7.3 Generator network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7.4 Discriminator network . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Experimental Results 36
5.1 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Segmentation on Potsdam datasets . . . . . . . . . . . . . . . . . . . 37
5.2.2 Segmentation on Vaihingen datasets . . . . . . . . . . . . . . . . . . 39
7 Appendix 42
2
1 Introduction
This adaptive paradigm ensures the sustained accuracy and relevance of our mapping
systems, effectively capturing the rapid transformations inherent in modern urban environ-
ments. It provides a robust foundation for future urban planning endeavors. The continuous
development of cities results in ongoing changes in urban data, presenting diverse data for-
mats across different urban settings, which in turn gives rise to challenges. Additionally,
smaller regions often confront data limitations, creating complexity when training models for
specific datasets. The scarcity of data in these areas adds intricacy to the training process,
complicating the management and deployment of models.
3
accurate representation of the evolving urban fabric. This approach allows for a detailed and
comprehensive understanding of the urban environment, enabling better-informed decision-
making in urban planning.
Unsupervised adaptation models further contribute to the efficacy of our mapping ap-
proach. The dynamic nature of urban growth often leads to variations in data distributions
across different cities and regions. Unsupervised adaptation techniques enable our models to
autonomously adjust and learn from the local characteristics of specific urban areas without
the need for extensive labeled datasets. This adaptability is crucial in maintaining the gener-
alization capacity of our models over time, mitigating the risks associated with the evolving
nature of urban landscapes.
2 Dataset
4
Impervious surfaces indicate a paved area with no building on it. The clutter/background
category refers to all the ground objects that are not included in the other five categories. The
Vaihingen dataset includes 33 TOP images with sizes near to 2000 × 2000 pixels. All these
33 TOP images are released with the ground truth. The TOP file contains three channels:
Infrared, red, and green bands. Among the 33 TOP images, 27 TOP images were used for
training, and 6 images were used for the test. The Potsdam dataset is a larger dataset that
contains 38 TOP images with a fixed size of 6000 × 6000 pixels. All these images are released
with their ground truth. The TOP files for Potsdam contain 3 different spectral channels:
red, green, and blue. Among the 38 TOP images, 32 images were used for the training, and
6 images were used for the test. To train the segmentation model, we divided the images and
their labels into squares of a size of 256 x 256 and fed the network with uniform patches of
a size of 256 x 256.
The datasets consists of 6 classes, which of them are: Impervious surfaces, Buildings,
Low vegetation, Trees, Cars, Clutter. The distribution of the classes are displayed as follows:
The second factor is the resolution, with Vaihingen’s images having a resolution of 9
cm per pixel and Potsdam’s images having a higher resolution of 5 cm per pixel. This change
5
in resolution can affect the segmentation model’s ability to precisely identify classes, thereby
introducing a domain shift.
The third factor contributing to the domain shift is the structural representation of
classes. Differences emerge in the representation of various classes when transitioning from
the Potsdam dataset to the Vaihingen dataset. For instance, buildings in both datasets are
similar, reflecting the modern German town architecture. However, classes like low vegetation
and trees display clear distinctions, particularly due to Vaihingen containing agricultural
areas, a feature absent in Potsdam. The types of trees and vegetation also differ between the
two datasets, with the contrast more evident in the low vegetation class.
The combined impact of these factors results in a domain shift between Potsdam and
Vaihingen, prompting an exploration of the effect of our proposed algorithm in mitigating
domain shifts related to each factor. Table 2 summarizes the influence of these factors on
the domain shift for each class, and our analysis is based on a thorough examination of each
class in both domains.
It is noteworthy that the resolution has a low effect on the domain shift for all classes,
as the model’s feature extraction layers can effectively handle the scale difference. The sensor
factor significantly influences the building class, making it an ideal case to study the impact
of our algorithm on reducing domain shifts caused by the sensor factor alone. Similarly, the
trees class, affected by both the sensor factor and class representation factor, will be studied
to assess the algorithm’s effectiveness. The impervious surfaces and cars classes, minimally
impacted by the three factors, serve as cases to explore the algorithm’s effect on classes
unaffected by domain shifts. Lastly, the low vegetation and clutter classes, highly influenced
by the sensor factor and class representation factor, are examined to evaluate the algorithm’s
efficacy in reducing domain shifts associated with these combined factors.
6
3 Preliminaries
3.1 CNN
Convolutional Neural Networks (CNNs) are a deep learning algorithm specifically de-
signed for processing grid-like data such as images, audio signals, and time series data. The
CNN architecture used for image classification comprises convolutional layers, max-pooling
layers, and fully connected layers. The following provides a general description of each layer
in a CNN:
Filter A filter of size f ˆ f applied to an input with C channels has a total size of
f ˆ f ˆ C and performs convolution on an input of size H ˆ W ˆ C, resulting in a feature
map of size P ˆ Q ˆ 1.
Stride For convolution or pooling, the step s represents the number of pixels the
window will move after each calculation.
Pooling layers
7
In addition to convolutional layers, pooling layers are commonly incorporated in CNNs
to downsample the feature maps produced by the convolutional layer. This downsampling
process aids in reducing the dimensionality of the feature maps.
Fully connected layer The fully connected layer receives flattened data as input,
each of which is connected to all neurons. In CNNs, fully connected layers are often used at
the end of the network and are used to optimize network goals such as layer accuracy.
8
Creating nonlinearity in the model is necessary, for example with data sets with complex
distributions, using a linear function is not enough to represent
Softmax can be thought of as a general logistic function that takes as input a vector
of values x P Rn and outputs a vector of probabilities p P Rn .
¨ ˛
p1
In which: p “ ˝ ‚ with pi “ nexi
vdotsn ÿ
e xj
j“1
The Softmax function limits the output value to the range (0, 1), this helps prevent the
results of the operations from being very large or very small, causing computational problems
and the network may have difficulty converging.
* Popular techniques
Batch normalization Controlling input distribution between network layers can sig-
nificantly speed up training and improve overall performance. Accordingly, the distribution
of the input of layer pσ, µq is normalized such that it has zero mean and unit standard
deviation.
Dropout
9
Figure 5: Dropout illustration
More generally, we can consider the previous decoder state as the query vector, and
the encoder hidden states as key and value vectors. The output is a weighted average of the
value vectors, where the weights are determined by the compatibility function between the
query and the keys. Note that the keys and values can be different sets of vectors.
The above can be summarized by the following equations. Given a query q, values (v1 ,
..., vn ), and keys (k1 , ..., kn ) we compute output z:
n
ÿ
z“ αj ¨ vj (1)
j“1
αj is computed using the softmax function where f(ki , q) is the compatibility score between
ki and q.
For the compatibility function, we will be using the scaled dot-product function
pkqpqqT
f pk, qq “ ? (3)
dk
where dk is the dimension of the key vectors. This scaling is done to improve numerical
stability as the dimension of keys, values, and queries grows.
10
into 2 main categories:
• Soft attention: The model assigns a weight to each element in the input sequence,
and the final output is a weighted sum of these elements. The weights are determined
by a softmax function, ensuring that the weights sum to 1.
• Hard attention: The model selects a subset of elements from the input sequence, and
the output is based only on this selected subset. Unlike soft attention, hard attention
involves discrete decisions.
11
dmodel /h. For each head, the (Q,K,V) matrices are uniquely projected to dimension dmodel /h
and self-attention is performed to yield an output of dimension dmodel /h. The outputs of each
head then are concatenated and once again a linear projection layer is applied, resulting in
an output of same dimensionality as performing self-attention once on the original (Q,K,V)
matrices. This process is described by the following formulas:
where
headi “ AttentionpQWiQ , KWiK , V WiV q (6)
where the projections are parameter matrices WiQ P Rdmodel ˆdk , WiK P Rdmodel ˆdk , WiV P
Rdmodel ˆdv and W O P Rhdv ˆdmodel .
3.3 GAN
Generative Adversarial Networks (GANs) are a powerful class of neural networks that
are used for unsupervised learning. GANs are made up of two neural networks, a discrimina-
tor and a generator. They use adversarial training to produce artificial data that is identical
to actual data. The Generator attempts to fool the Discriminator, which is tasked with
accurately distinguishing between produced and genuine data, by producing random noise
samples. Realistic, high-quality samples are produced as a result of this competitive inter-
action, which drives both networks toward advancement. GANs are proving to be highly
versatile artificial intelligence tools, as evidenced by their extensive use in image synthesis,
style transfer, and text-to-image synthesis.
12
3.3.1 The Structure of GANs
Generative Adversarial Networks (GANs) can be broken down into three parts:
• Adversarial: The word adversarial refers to setting one thing up against another. This
means that, in the context of GANs, the generative result is compared with the actual
images in the data set. A mechanism known as a discriminator is used to apply a model
that attempts to distinguish between real and fake images.
• Discriminator: Try to evaluate the output generated by the generator with the original
sample, and outputs a value between 0 and 1. If the value is close to 0, then the
generated sample is fake, and if the value is close to 1 then the generated sample is
real.
13
4 Algorithm Proposal
We first started with training a segmentation model on the source dataset. We chose
Potsdam as the source dataset because it is far greater than the Vaihingen dataset. In fact,
in real scenarios, target datasets are smaller and less structured than the source datasets.
We performed the segmentation using popular segmentation models then training CyCGan
model by using two datasets: one for Potsdam and the other for Vaihingen.
Once the training of the proposed GAN architecture was done, we used it to translate
the full dataset of the source domain (Potsdam) to the target domain (Vaihingen). We note
that the global style of the translated image is imitating the style of the target domain. The
images generated are similar to what we can get as new images of the Potsdam town using
the IRRG sensor used for Vaihingen images.
When the translated dataset was ready, we used it to fine-tune the trained segmentation
model . We did the fine-tuning process epoch by epoch, and we tested the model on the
target dataset after every epoch to measure the improvement of average accuracy on the
target dataset.
4.1 Unet
UNet is a convolutional neural network (CNN) architecture designed for semantic seg-
mentation tasks in image processing. Developed for biomedical image segmentation originally,
UNet has found widespread applications in various domains. Its distinctive architecture com-
prises a contracting path to capture context and a symmetric expansive path to enable precise
localization.
The primary strength of UNet lies in its ability to handle image segmentation, a task
where the objective is to classify each pixel in an image into a specific class. This architecture
excels in capturing intricate details and fine structures within images, making it particularly
well-suited for tasks such as medical image segmentation, where precise delineation of struc-
tures like organs, tumors, or cells is crucial.
The UNet architecture has been adapted and extended in numerous ways to address
specific challenges and requirements in different applications. Its versatility has led to its use
in diverse fields beyond medical imaging, such as satellite image analysis, scene segmentation
in autonomous vehicles, and more.
14
Figure 9: UNet architecture
The contracting path in UNet is responsible for identifying the relevant features in
the input image. The encoder layers perform convolutional operations that reduce the spatial
resolution of the feature maps while increasing their depth, thereby capturing increasingly
abstract representations of the input. This contracting path is similar to the feedforward
layers in other convolutional neural networks.
On the other hand, the expansive path works on decoding the encoded data and
locating the features while maintaining the spatial resolution of the input. The decoder
layers in the expansive path upsample the feature maps, while also performing convolutional
operations. The skip connections from the contracting path help to preserve the spatial
information lost in the contracting path, which helps the decoder layers to locate the features
more accurately.
4.2 UnetPlusPlus
U-Net++ or Nested U-Net is a deep learning architecture that was introduced in 2019.
In UNet, the encoder part captures high-level features from the input image through a series of
convolutional and pooling layers, while the decoder part upsamples these features to generate
a dense segmentation map. However, there can be a semantic gap between the encoder and
15
decoder features, meaning that the decoder may struggle to reconstruct fine-grained details
and produce accurate segmentation.
UNet++ introduces the concept of nested skip pathways to bridge this semantic gap.
It adds additional skip connections between the encoder and decoder blocks at multiple res-
olutions. These connections allow the decoder to access and incorporate both low-level and
high-level features from the encoder, providing a more detailed and comprehensive under-
standing of the image. They improved the traditional U-Net architecture by redesigning the
skip connections and introducing a deeply supervised nested encoder-decoder network.
Instead of a traditional skip connection, the feature map from the lower level is also
convoluted with the upper-level feature and then the new combined feature data is then
passed further. The basic idea behind UNet++ is to bridge the semantic gap between the
feature maps of the encoder and decoder before the fusion.
4.3 UperNet
The UPerNet model is designed for the task of segmentation, which involves parsing
and comprehending various visual elements within a given scene. The model leverages the
concept of a unified framework that combines bottom-up and top-down pathways for feature
extraction and semantic segmentation. By integrating multi-scale features and contextual
information, UPerNet aims to achieve better performance in segmentation tasks.
16
Figure 11: UperNet general framework
• PPM Head: The Pyramid Pooling Module (PPM) is a feature extraction component
commonly used in computer vision models. First introduced in the PSPNet architec-
ture, the PPM aims to capture multi-scale contextual information from an input feature
map.
• Feature Pyramid Network: The Feature Pyramid Network was introduced the
paper ”Feature Pyramid Networks for Object Detection”. To address the challenge of
detecting objects at different scales, the FPN create a feature pyramid that combines
features from different levels of spatial resolution while preserving semantic information.
• Skip Connections: ResNet-50 introduces skip connections (also known as shortcut con-
nections or identity mappings) to address the vanishing gradient problem. By allowing
the flow of gradients directly from earlier layers to later layers, skip connections enable
the network to better retain and propagate information through the network, reducing
the degradation of performance as the network gets deeper.
Using as backbone, instead of classify input image, the model output multiple feature
map with different size when downsampling the image.
17
Figure 12: Resnet-50 backbone
The PPM aims to capture multi-scale contextual information from an input feature
map. It divides the feature map into multiple levels, with each level representing a different
scale. The PPM then performs pooling operations within each level to aggregate information
and generate fixed-length representations.
• Dividing the feature map: The input feature map is divided into several levels, each
having a different spatial resolution. These levels are created by applying pooling
operations with different kernel sizes (1, 2, 3, 6).
• Convolution layer: Each level then passed through a convolution layer to capture in-
formation from neighbor nodes.
• Upsampling and concatenation: The representations from different levels are then up-
sampled to the original spatial resolution and concatenated together. This combination
of features from multiple scales allows for the integration of both local and global con-
textual information.
18
Feature Pyramid Network
Those feature map then rescaled and concatenated to become a final feature map, this
feature map then passed through a segmentation head to output final result.
19
Figure 14: BANet architecture
Dependency path: employs a stem block and four transformer stages (i.e., Stage 1-4)
to extract long-range dependent features (LDF). Each stage consists of two efficient trans-
former blocks (ETB). In particular, Stage 2, Stage 3, and Stage 4 involve patch embedding
(PE) operations additionally. Proceed by the dependency path, two long range dependent
features (i.e., LDF3 and LDF4) are generated.
Stem block: The stem block aims to shrink the height and width dimension and ex-
pand the channel dimension. To capture low-level information effectively, it introduces three
3x3 convolution layers with strides of [2, 1, 2]. The first two convolution layers are followed
by batch norm (BN) and ReLU activation. Proceed by the stem block, the spatial resolution
is downscaled by a factor of 4, and the channel dimension is extended from 3 to 64.
Patch embedding (PE): The patch embedding aims to down-sampled the feature
map for hierarchical feature [Link] output for each patch embedding can be
formalized as:
` ˘
P EpX1 q “ Sigmoid DWConvpX1 q ¨ X1 (7)
X1 “ BNpWs ¨ Xq (8)
where Ws represents a convolution layer with a kernel size of s+1 and a stride of s. Here s is
set as 2. DWConv denotes a 3x3 depth-wise convolution with a stride of 1.
Efficient transformer block (ETB): Each efficient transformer is composed of effi-
20
cient multihead self-attention (EMSA),multilayer perceptron (MLP) and layer norm (LN).The
output for each efficient transformer block can be formalized as:
Texture path: is a lightweight convolutional network, which builds four diverse con-
volutional layers to capture textural information. Each layer has a batch normalization and a
ReLu activation. The convolutional layer of T1 has a kernel size of 7 and a stride of 2, which
expands the channel dimension from 3 to 64. For T2 and T3, the kernel size and stride are
3 and 2, respectively. The channel dimension is kept as 64. For T4, the convolutional layer
is a standard 11 convolution with a stride of 1, expanding the channel dimension from 64 to
21
128. Thus, the output textural feature is downscaled 8 times and has a channel dimension
of 128.
Feature aggregation module (FAM): FAM aims to leverage the benefits of the
dependent features and texture features comprehensively for powerful feature representation.
The input features for the FAM include the LDF3, LDF4 and TF. To fuse those features, we
first employ an attentional embedding module (AEM) to merge the LDF3 and LDF4. There-
after, the merged feature is upsampled to concatenate with the TF, obtaining the aggregated
feature. Finally, the linear attention module is deployed to reduce the fitting residual of the
aggregated feature (AF).
Linear attention module (LAM): In the FAM, we employ LAM to enhance the
spatial relationships of AF, thereby suppressing the fitting residual. Then, a convolutional
layer with BN and ReLU is deployed to obtain the attention map. Finally, we apply a matrix
multiplication operation between AF and the attention map to obtain the attentional AF.
Attentional embedding module (AEM): The AEM adopts the LAM to enhance
22
the spatial relationships of LDF4. Then, we apply a matrix multiplication operation between
the upsampling attention map of LDF4 and LDF3 to produce the attentional LDF3. Finally,
we use an addition operation to fuse the original LDF3 and the attentional LDF3.
– Kernel attention is a novel attention mechanism designed for deep learning models,
particularly in computer vision tasks like semantic segmentation. Its key strengths
are:
23
lationships between elements, kernel attention effectively captures long-range
dependencies within the data. This enables it to better understand the overall
context and relationships between distant parts of the image.
Working principles:
1. Input Feature Maps: The model starts with a set of feature maps extracted
from the input data.
3. Similarity Matrix Calculation: The transformed feature maps are then used
to calculate a similarity matrix. This matrix reflects the degree of similarity
between every pair of elements in the data.
5. Weighted Sum: Finally, the original feature maps are weighted by the atten-
tion weights and summed to create a new set of enhanced feature maps. These
new maps incorporate both local and global information, better representing
the context and relationships within the data.
Benefits:
∗ Wider applicability: This efficiency opens doors for utilizing kernel attention
in resource-constrained environments and larger models.
24
channels within a feature map. This improves feature representation by amplifying
informative channels and suppressing less relevant ones.
Working principles:
1. Squeeze: The first step is to ”squeeze” the feature map, typically using global
average pooling or a small 1x1 convolution. This operation reduces the spatial
dimensionality of the map, resulting in a single channel representing the overall
information across all spatial locations.
Benefits:
25
input over space and time. It remains an intractable problem to model global dependency on
large-scale inputs, such as fine-resolution images. To alleviate the substantial computational
requirement, using a sparse factorization of the attention matrix and reduced the complexity
?
from OpN 2 q to OpN N q. Using self-attention as a linear dot-product of kernel feature maps
to further reduce the complexity to O(N)
With MANet, not only dramatically decrease the complexity, but also amply exploit
the potential of the attention mechanism by designing a multilevel framework. Specifically,
we reduce the complexity of the dot-product attention mechanism to O(N) by treating atten-
tion as a kernel function. As the complexity of attention is reduced dramatically by kernel
attention, we propose a Multi-Attention-Network (MANet) with a ResNeXt-101 backbone
which explores the complex combinations between attention mechanisms and deep networks
for the task of semantic segmentation using fine-resolution remote sensing images. The ma-
jor contributions of this research include: 1) A novel attention mechanism involving kernel
attention with linear complexity is proposed to alleviate the attention module’s huge compu-
tational demand. 2) To extract refined dense features, we replace ResNet with ResNeXt-101
as the backbone, enhancing the ability for feature extraction. 3) Based on kernel attention
and ResNeXt-101, we propose a Multi-Attention-Network (MANet) by extracting contextual
dependencies using multi-kernel attention
Multi-Attention-Network
For the spatial dimension, as the computational complexity of the dot-product attention
mechanism exhibits a quadratic relationship with the size of the input pN “ H ˆ W q, we use
an attention mechanism based on kernel attention, named KAM. For the channel dimension,
the number of input channels C is normally far less than the number of pixels contained
in the feature maps (i.e., C ! N q. Therefore, the complexity of the softmax function for
` ˘
channels (i.e., O C 2 ), is not large according to equation. Thus, we utilize the channel
attention mechanism based on the dot-product, named CAM. Like the dot-product attention
mechanism, there exists a residual connection in the KAM and CAM, adding output with
the input features directly.
Using the kernel attention mechanism (KAM) and channel attention mechanism (CAM)
which model the long-range dependencies of positions and channels, respectively, we design
an attention block to enhance the discriminative ability of feature maps extracted by each
layer. Features generated by the ResBlock are fed into the KAM and CAM to refine the
information in positions and channels, respectively. Thereafter, the refined feature maps are
26
Figure 19: Details of the channel attention mechanism
added directly to obtain the output of the corresponding attention block whose structure can
be seen in Fig. 18b.
4.6 UnetFormer
The proposed UNetFormer is built with a CNN-based encoder and a Transformer-based
decoder, as shown in Fig. 19. Each component is explained in detail in the sections that
follow.
27
Figure 21: The structure of (a) the proposed MANet, (b) the Attention block, (c) the Res-
Block, and (d) the DeBlock.
28
Figure 22: An overview of the UNetFormer
CNN-based encoder Here, we choose to use the pre-trained ResNet18 as the en-
coder in order to extract semantic features at a cheap computationally cost. Each level of
ResNet18’s four-stage Resblocks downsamples the feature map by a scale factor of two. The
proposed UNetFormer uses a 1x1 convolution with a channel dimension of 64, or the skip
connection, to fuse the feature maps produced by each stage with the corresponding feature
maps of the decoder.
29
Figure 23: a global-local Transformer block.
Feature refinement head (FRH) the channel path employs a global average pooling
layer to generate a channel-wise attentional map C P R1ˆ1ˆc where c denotes the channel
dimension. The reduce and expand operation contains two 1x1 convolutional layers, which
first reduces the channel dimension c by a factor of 4 and then expands it to the original.
The spatial path utilizes a depth-wise convolution to produce a spatial-wise attentional map
S P Rhˆwˆ1 , where h and w represent the spatial resolution of the feature map. The
attentional features generated by the two paths are further fused using a sum operation.
Finally, a post-processing 1x1 convolutional layer and an upsampling operation are applied
to produce the final segmentation map.
30
Figure 24: Feature refinement head
4.7 CycleGAN
We all known the normal GAN models have input is paired. That make an disavantage
to model get high performance of generalization and synthesis ability. Usually model need
a large scale dataset to reach generating the fake image as real. But there are some data is
scarce and hard to be found, for example: the pictures from the dead artist.
Because of unpair source-target couple, we do not exactly know what the specific
source image maps to target image. But in the other hand, model learn mapping at the
set level between set X to set Y under the unsupervised learning. Our goal is to learn
mapping functions between two domains X and Y given training samples txi uN
i“1 , where
xi P X, and tyj uM 1
j“1 , where yj P Y . We denote the data distribution as x „ pdata pxq and
y „ pdata pyq. As illustrated in Figure 26(a), our model includes two mappings G : X Ñ Y
and F : Y Ñ X. In addition, we introduce two adversarial discriminators DX and DY ,
where DX aims to distinguish between images txu and translated images tF pyqu; similarly,
DY aims to discriminate between tyu and tGpxqu. Our objective contains two types of terms:
31
Figure 25: Paired (in the left) vs Unpaired (in the right) image to image translation. Paired
training dataset exists the correspondance of each input image xi and yi . In contrast, unpaired
training dataset have input is both sets X “ xi M M
i“1 and Y “ yi i“1 with no information of
mapping to inside them.
adversarial losses for matching the distribution of generated images to the data distribution
in the target domain, and cycle consistency losses to prevent the learned mappings G and F
from contradicting each other.
We apply adversarial losses to both mapping functions. For the mapping function
G : X Ñ Y and its discriminator DY , we express the objective as:
LGAN pG, DY , X, Y q “ Ey„pdata pyq rlog DY pyqs ` Ex„pdata pxq rlogp1 ´ DY pGpxqqqs (12)
where G tries to generate images Gpxq that look similar to images from domain Y , while DY
aims to distinguish between translated samples Gpxq and real samples y. G aims to minimize
this objective against an adversary DY that tries to maximize it, i.e., minG maxDY LGAN pG, DY , X, Y q.
We introduce a similar adversarial loss for the mapping function F : Y Ñ X and its discrim-
inator DX as well:
LGAN pF, DX , Y, Xq “ Ex„pdata pxq rlog DX pxqs ` Ey„pdata pyq rlogp1 ´ DX pF pyqqqs (13)
where F tries to generate images F pyq that look similar to images from domain X, and DX
aims to distinguish between translated samples F pyq and real samples x. F aims to minimize
32
Figure 26: Two mapping functions G : X Ñ Y and F : Y Ñ X, and associated adversarial
discriminators DY and DX , DY encourages G to translate X into outputs indistinguishable
from domain Y , and vice versa for DX and F . To further regularize the mappings, we
introduce two cycle consistency losses that capture the intuition that if we translate from
one domain to the other and back again we should arrive at where we started: (b) forward
cycle-consistency loss: x Ñ Gpxq Ñ F pGpxqq « x, and (c) backward cycle-consistency loss:
y Ñ F pyq Ñ GpF pyqq « y.
this objective against an adversary DX that tries to maximize it, i.e., minF maxDX LGAN pF, DX , Y, Xq.
Adversarial training can, in theory, learn mappings G and F that produce outputs
identically distributed as target domains Y and X respectively (strictly speaking, this requires
G and F to be stochastic functions). However, with large enough capacity, a network can
map the same set of input images to any random permutation of images in the target domain,
where any of the learned mappings can induce an output distribution that matches the target
distribution. Thus, adversarial losses alone cannot guarantee that the learned function can
map an individual input xi to a desired output yi . To further reduce the space of possible
mapping functions, we argue that the learned mapping functions should be cycle-consistent:
as shown in Figure 26(b), for each image x from domain X, the image translation cycle
should be able to bring x back to the original image, i.e., x Ñ Gpxq Ñ F pGpxqq « x. We
call this forward cycle consistency. Similarly, as illustrated in Figure 26(c), for each image
y from domain Y , G and F should also satisfy backward cycle consistency: y Ñ F pyq Ñ
GpF pyqq « y. We incentivize this behavior using a cycle consistency loss:
Lcyc pG, F q “ Ex„pdatapxq r}F pGpxqq ´ x}1 s ` Ey„pdatapyq r}GpF pyqq ´ y}1 s (14)
In preliminary experiments, we also tried replacing the L1 norm in this loss with an
adversarial loss between F pGpxqq and x, and between GpF pyqq and y, but did not observe
improved performance.
The behavior induced by the cycle consistency loss can be observed in Figure 27: the
33
Figure 27: The input images x, output images Gpxq, and the reconstructed images F pGpxqq
from various experiments. From top to bottom: photo Ø Cezanne, horses Ø zebras, winter
Ñ summer Yosemite, aerial photos Ø Google maps. ).
where λ controls the relative importance of the two objectives. We aim to solve:
34
4.7.3 Generator network
We using a common architecture for the CycleGAN generator is the U-Net. U-Net
is a network which consists of a sequence of downsampling blocks followed by a sequence
of upsampling blocks, giving it the U-shaped architecture. In the upsampling path, we
concatenate the outputs of the upsampling blocks and the outputs of the downsampling
blocks symmetrically. This can be seen as a kind of skip connection, facilitating information
flow in deep networks and reducing the impact of vanishing gradients
Unlike conventional networks that output a single probability of the input image being
real or fake, CycleGAN uses the PatchGAN discriminator that outputs a matrix of values.
Intuitively, each value of the output matrix checks the corresponding portion of the input
image. Values closer to 1 indicate real classification and values closer to 0 indicate fake
classification.
35
5 Experimental Results
In this section we conduct extensive experiments on the algorithms proposed.
For the settings, we have the training set and test set each having 5000 and 1400
256x256 images, respectively.
Due to time and resource constraints, our model tuning efforts are focused on key pa-
rameters within the baseline Unet architecture. We specifically target adjustments to the
learning rate, number of layers, filter ratio, and the incorporation of dropout and normal-
ization techniques. Optimizing for the best performance, we have configured the following
Figure 29: Performance with different learning rate, number of layers, filter ratio, and the
incorporation of dropout and normalization techniques
parameters for our model: we set the learning rate to 8e-3, filter ratio to 1, using both
dropout and normalization techniques and the models are trained until convergence. The
loss we use are joint loss of Cross Entropy Loss and Dice Loss.
Cross Entropy Loss is a widely used loss function in classification tasks, particularly
in the context of machine learning and deep learning. It measures the dissimilarity between
the predicted probability distribution and the true probability distribution of the classes in a
classification problem. The cross-entropy loss is designed to penalize the model more heavily
when it makes confident incorrect predictions.
In a binary classification scenario, where there are only two classes (commonly denoted
36
as 0 and 1), the cross-entropy loss for a single sample can be defined as:
where L is the cross-entropy loss, y is the true label (0 or 1) and ŷ is the predicted probability
of belonging to class 1.
For a multi-class classification problem with C classes, the cross-entropy loss is gener-
alized as:
C
ÿ
Lpy, ŷq “ ´ yi ¨ logpŷi q (18)
i“1
where L is the cross-entropy loss, yi is the true probability of class i and ŷi is the predicted
probability of belonging to class i.
Dice Loss, also known as Sorensen-Dice coefficient or F1 score loss, is often used in
image segmentation tasks to measure the overlap between the predicted segmentation masks
and the ground truth masks. The formula for Dice Loss is defined as follows:
řN
2 i pi ¨ gi
Dice Loss “ 1 ´ řN 2
řN 2 (19)
i p i ` i gi
where N is the total number of pixels in the images, pi is the predicted probability of a pixel
being part of the object in the segmentation mask and qi is the ground truth binary label
indicating whether a pixel belongs to the object.
We leverage the use of Dice Loss and Cross Entropy Loss for this particular problem,
where we take the weighted sum of the losses,and the weight of each loss is 1.
CyCGan model Regarding the configurations, we randomly selected 400 images, each
sized at 256 x 256, from the original TOP images dataset. These images were then divided
into two subsets: a training subset comprising 300 images and a test subset consisting of 100
images, applied to both the Potsdam and Vaihingen datasets. For optimization, we opted
for Adam optimization, setting the learning rate to 2e-4. This choice is influenced by the
positive performance observed with this learning rate in similar datasets.
For result evaluation, we test the model on two settings: without tuning and with
tuning. For tuning we use data augmentation and Xavier weight initialization techniques in
37
the model. The result can be observed as follows:
We can see that the baseline model UNet achieve a competitive performance of 76.7%.
We tried to achieve better results by testing BANet, UperNet, Unet++ and MANet, and
UnetFormer. BANet with the utilization of efficient transformers met an unexpected failure
of 76.0% on test set, slightly underperform than Unet. The segmentation head seems to un-
able to replace a decoder block which makes the model may not be efficient comparing to the
encoder-decoder traditional model for image segmentation. Traditional encoder-decoder im-
ages requires upsampling, which implies that we need to use more parameters in upsampling
to maintain the efficiency of the model. The UperNet has achieved a better performance than
UNet with 81.1% due to its efficient leverage with the Pyramid Pooling Module heads feature
extraction. Meanwhile, we can observe that the UNetPlusPlus and the MANet has a superior
performance comparing to other models, the UNetPlusPlus has leverage the strong power of
nested blocks of CNNS, which efficiently made the semantic gap between the feature maps of
the encoder and decoder closer before the mix. The MANet has the best performance among
the test models, which makes the model really efficient because of the powerful use of the
attention that captures the global information of the feature maps before the deconvolution
block of each encoding block that makes the model really reliable and make the accuracy
really high on the test set. Finally, the simple fusion between ResNet block encoder and the
Transformers decoder makes the UnetFormer have a really strong decoder that learns the
global context efficiently and that makes the model perform really well.
There are some key features of the data which led to the decent results of the solutions:
• The images overall have good and distinguishable features, e.g the tree curves are not
too hard to extract and the road and houses are really in squared-shape positions which
makes the model realise the significant features that is in the image.
38
5.2.2 Segmentation on Vaihingen datasets
For the evaluation of results, we conduct testing on two distinct scenarios: the original
models and the models after the fine-tuning process. During the fine-tuning phase, we train
our models until they converge, ensuring that the training process captures the intricacies of
the data and reaches a stable state. The result can be observed as follows:
We can observe a significant improvement in the results of the algorithms after fine-
tuning with the translated dataset, rising from approximately 26-30% to 68-75%. The accu-
racy of the models remains consistent when segmenting the Potsdam dataset, indicating a
substantial impact of the network architecture. This suggests that with stronger networks,
the performance is better.
Firstly, there are classes where our algorithm substantially increased model accuracy
(specifically, classes building and tree). A comparison with Table 1 reveals that these classes
are predominantly characterized by a domain shift primarily related to the sensor factor. In
cases where the domain shift is primarily linked to the sensor factor, our algorithm proves
39
highly effective in boosting model accuracy. For instance, the class building is solely affected
by the sensor factor, leading to a notable increase in average accuracy from 0.27 to 0.68.
Similarly, for the class tree, our algorithm demonstrates efficiency in enhancing accuracy,
albeit with some limitations attributed to other domain shift factors.
On the other hand, for classes like impervious surfaces, cars, clutter background, and
low vegetation, our algorithm exhibits no practical effect in altering accuracy; it effectively
conserves accuracy. As indicated in Table 1, these classes are either not affected by any
domain shift factor (e.g., classes cars and impervious surfaces) or are highly influenced by a
factor other than the sensor (e.g., clutter background or low vegetation).
In essence, our algorithm maintains model accuracy when there is either no domain
shift or when the domain shift is predominantly related to a factor other than the sensor. This
characteristic is valuable as it allows for the integration of our algorithm with other techniques
that may target different domain shift factors. The algorithm successfully addresses the
elimination of the sensor factor without adversely affecting other factors. If the domain shift
primarily stems from the sensor factor between source and target datasets, our algorithm
proves capable of significantly improving accuracy, comparable to training the model on a
fully labeled target dataset, as exemplified in the class building.
40
6 Conclusion and Discussion
In this report, we introduced some supervised and unsupervised learning algorithm.
The overall performance of the method tested are decent, and the bestperformance was from
the Unetformer model with 91.7% accuracy on Potsdam datasets, the performance of all
models are increase after using Unsupervised Domain Adaptation on Vaihingen datasets and
after conducting experiment we analysed the disadvantages and also the advantages of each
method.
During the process of working on a project, the group has accomplished the following
tasks:
In the future, the group wants to experiment with the models used on more advanced
techniques and problems on deep learning, especially semantic segmentation and GAN (aAp-
plying a pretrained U-Net model as the generator network for a Conditional Cycle-Consistent
Generative Adversarial Network (CyCGAN) for some particular problems, ...) while also ex-
ploring other approaches.
We want to take a moment to express our heartfelt gratitude for all the support and
guidance the teacher have provided us throughout the challenging topics of this course. Your
expertise and dedication have truly made a difference in our journey, and we are incredibly
grateful for your unwavering commitment to our education.
41
7 Appendix
Here are some demo inferences of our work:
42
References
[1] Nicolas Audebert, Bertrand Le Saux, and Sébastien Lefèvre. Beyond RGB: very high
resolution urban remote sensing with multimodal deep networks. CoRR, abs/1711.08681,
2017.
[2] Yifeng Chen, Guangchen Lin, Songyuan Li, Omar El Farouk Bourahla, Yiming Wu,
Fangfang Wang, Junyi Feng, Mingliang Xu, and Xi Li. Banet: Bidirectional aggregation
network with occlusion handling for panoptic segmentation. CoRR, abs/2003.14031, 2020.
[3] Rui Li, Shunyi Zheng, Ce Zhang, Chenxi Duan, Jianlin Su, Libo Wang, and Peter M
Atkinson. Multiattention network for semantic segmentation of fine-resolution remote
sensing images. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2021.
43
[4] Libo Wang, Shenghui Fang, Ce Zhang, Rui Li, and Chenxi Duan. Efficient hy-
brid transformer: Learning global-local context for urban sence segmentation. CoRR,
abs/2109.08937, 2021.
[5] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual
parsing for scene understanding. CoRR, abs/1807.10221, 2018.
[6] Cheng Zhang, Wanshou Jiang, Yuan Zhang, Wei Wang, Qing Zhao, and Chenjie Wang.
Transformer and cnn hybrid deep neural network for semantic segmentation of very-
high-resolution remote sensing imagery. IEEE Transactions on Geoscience and Remote
Sensing, 60:1–20, 2022.
[7] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming
Liang. Unet++: A nested u-net architecture for medical image segmentation. CoRR,
abs/1807.10165, 2018.
44