2020 An Ensemble Architecture of Deep Convolutional
2020 An Ensemble Architecture of Deep Convolutional
To cite this article: Abolfazl Abdollahi , Biswajeet Pradhan & Abdullah M. Alamri (2020):
An Ensemble Architecture of Deep Convolutional Segnet and Unet Networks for Building
Semantic Segmentation from High-resolution Aerial Images, Geocarto International, DOI:
10.1080/10106049.2020.1856199
Article views: 7
t
Department of Energy and Mineral Resources Engineering, Sejong University,
ip
Choongmu-gwan, 209 Neungdong-ro, Gwangjingu, Seoul 05006, Korea
4
Earth Observation Center, Institute of Climate Change, Universiti Kebangsaan Malaysia,
cr
43600 UKM, Bangi, Selangor, Malaysia
us
5
Dept. of Geology & Geophysics, College of Science, King Saud Univ., P.O. Box 2455,
Riyadh, 11451, Saudi Arabia
an
*Correspondence: [email protected] or [email protected]
Abstract
M
Building objects is one of the principal features that are essential for updating the geospatial database. Extracting
building features from high-resolution imagery automatically and accurately is challenging because of the existence of
ed
some obstacles in these images, such as shadows, trees, and cars. Although deep learning approaches have shown
significant improvements in the results of image segmentation in recent years, most deep neural networks still cannot
achieve highly accurate results with correct segmentation map when processing high-resolution remote sensing images.
Therefore, we implemented a new deep neural network named Seg–Unet method, which is a composition of Segnet and
pt
Unet techniques, to exploit building objects from high-resolution aerial imagery. Results obtained 92.73% accuracy
carried on the Massachusetts building dataset. The proposed technique improved the performance to 0.44%, 1.17%, and
0.14% compared with fully convolutional neural network (FCN), Segnet, and Unet methods, respectively. Results also
ce
1. Introduction
Highly accurate feature extraction from high-resolution remote sensing imagery produces reliable
information for various applications (Shrestha and Vanneschi 2018). The extraction of small ground
objects, such as building objects from the imagery of the surface of the earth, can be a potential
application (Krizhevsky et al. 2012). High-precision building extraction from high-resolution
satellite images can perform an essential task in several applications, such as disaster management,
geospatial database updating, urban planning, and navigation (Mayer 1999). Raw data should be
converted into sensible information by using geospatial information system (GIS) to enable the
quantification process. The time-consuming and labor-intensive data interpretation and digitization
are often required for this transformation. Although Yuan (2017) introduced a source called
1
volunteered geographic information (VGI) as an alternative option, its availability is restricted due to
the differences in positional and completeness accuracy. Participation inequality, in terms of varying
impressions, cultures, and judgments, can be the principal reasons for the aforementioned issue
(Shrestha and Vanneschi 2018), thereby restricting the accessibility of dependable and up-to-date
building maps. Automatic building extraction using remote sensing imagery needs a promising
approach that remains underdeveloped in spite of a decade of research in this field (Marcu and
Leordeanu 2016). The main elements that make this process challenging are the wide changes in
building appearances in images because of various building features, such as shadows, cars,
structures, various roofing materials, and illumination statuses, which are formed by buildings (Yuan
and Cheriyadat 2014). Traditional methods have been mixed with genetic algorithms (Sumer and
Turker 2013) and support vector machine (SVM) method (Inglada 2007) to detect buildings. Other
characteristics, such as multi-spectral features, textures (Levitt and Aghdasi 1998), and shadow
properties (Peng and Liu 2005); local structures, such as corners, lines, and edges (Huertas and
Nevatia 1988) of remote sensing images, have been utilized as main factors for extracting building
objects. The efficiency of these types of approaches is restricted due to the dependence of the method
performance on low-level local characteristics. Thus, to well distinguish the features, the utilization
and exploitation of representative high-level features that play a principal role in image segmentation
t
ip
are favorable.
cr
network (CNN), have demonstrated that they can achieve reliable results in image classification for
computer vision (He et al. 2015, Szegedy et al. 2015) and feature semantic segmentation
us
(Vakalopoulou et al. 2015, Alshehhi et al. 2017, Abdollahi et al. 2020). The CNN model is efficient
in image processing because of its capability to learn from raw images without following pre-
processing steps. In addition, deep convolutional network (DCNN) has become a promising
an
technique in image processing because of its ability to efficiently mix spatial and spectral features on
the basis of raw input data without preprocessing (Alshehhi et al. 2017). Recent works have revealed
that different kinds of deep learning approaches, which are based on CNNs, such as deep
M
convolutional encode–decoder architecture and fully convolutional network (FCN), have shown
significant improvements in the remote sensing field. In terms of computational proficiency and
accuracy, FCN is the most proficient approach for pixel-wise semantic segmentation. However,
several problems restrict model performance in detection, leading to failure in generating inadequate
ed
or redundant prediction detection and in identifying numerous objects (Shrestha and Vanneschi
2018, Abdollahi et al. 2020). In the next section, previous studies related to applying promising CNN
methods for remote sensing image classification and building semantic segmentation are discussed.
pt
Deep neural network features have illustrated their ability in semantic segmentation (Long et al.
ce
2015, Chen et al. 2017), object detection (Girshick et al. 2014), and visual identification (Sharif
Razavian et al. 2014, Audebert et al. 2016). Deep convolutional frameworks can be utilized in
different remote sensing tasks, such as data merging (Kussul et al. 2016), image classification (Yang
Ac
et al. 2018), and detection (Audebert et al. 2017). These networks have been successfully utilized to
label and classify high-resolution remote sensing images (Penatti et al. 2015). Marmanis et al. (2018)
introduced a deep neural network on the basis of an end-to-end trainable network (DCNN) for
detecting boundaries and improving semantic image segmentation. Farabet et al. (2012) mixed
conditional random fields (CRFs) with multi-scale CNNs to classify dense street scenes.
Vakalopoulou et al. (2015) implemented a deep convolutional model to identify building features
from high-resolution multi-spectral images. Previous works have confirmed that the results of remote
sensing imagery classification cannot be decisive (Wilkinson 2005) because improving the resolution
of remote sensing images is more useful in the identification and detection of different features on
the ground. However, the separation of certain objects with the same spectral values has become
difficult due to these improvements, leading to the decrease of the inter-class difference and increase
of the intra-class difference of objects such as cars, shadows, streets, and buildings (Paisitkriangkrai
1
et al. 2016). That is, extracting sensible spatial features to solve the pixel classification in building
extraction has become challenging because various objects may represent similar spectral classes in
remote sensing images. Reliable results have been recently achieved by FCN for semantic image
segmentation (Fu et al. 2017). The method can identify various object classes, including their shapes,
such as trees, road objects, and building curves. The model can not only identify the structures of
spatial objects but also learn how to categorize pixels and detect what they are (Audebert et al.
2016). However, the outcomes are visually degraded during image classification and segmentation
when using FCN. The reason is that the model cannot detect objects with multiple borders or small
objects because object boundaries are blurred (Maggiori et al. 2017). The structures of deep
convolutional frameworks have been developed in certain research either by utilizing CRFs mixed
with dilated convolution (Chen et al. 2014) or by appending skip-layer structure after up-sampling to
regenerate high-frequency and comprehensive image information (Marmanis et al. 2016), thereby
leading to the performance improvement of semantic segmentation and accuracy improvement of
image classification (Sherrah 2016).
Recent works have attempted to boost precision in areas such as pixel labeling; feature extraction
from raw data; image encoding, specifically for high-resolution remote sensing imagery on the basis
t
ip
of deep convolutional techniques, such as FCN and CNN (Volpi and Tuia 2016). However,
impervious and building objects extracted from high-resolution remote sensing images are difficult
to handle due to the presence of various geometric shapes and spatial and spectral features. That is,
cr
similar objects in urban areas have various spectral values because high-resolution remote sensing
images are usually restricted to three or four channels, and these spectral characteristics may lack the
us
capability to recognize objects. Various objects may also present the same spectral values (e.g., roofs
and roads) (Bakhtiari et al. 2017, Abdollahi et al. 2018).
an
Although prior scholars have presented helpful insights into different approaches, which can be
utilized in pixel labeling, these approaches misclassify certain pixels with the same spectral values
and lack the capability to eliminate salt-and-pepper classification noise and to clearly identify object
M
boundary. To solve these issues, we present a new deep neural network called SegUnet, a
combination of Segnet and Unet architectures for building objects extraction by using high-
resolution aerial imagery. The proposed network is dedicated to restoring pixel position information
and produces a high-resolution segmentation map. The model has an encoder–decoder architecture
ed
that incorporates index pooling (Segnet) and skip connection (Unet) to generate and disseminate
image spatial information. As can be seen in the aforementioned literature review, the proposed
method has not been used before, and this study is the first to propose this kind of approach for a
pt
given task. The proposed approach is compared with other state-of-the-art deep learning-based
techniques, such as FCN (Long et al. 2015), Segnet (Badrinarayanan et al. 2017), and Unet
ce
(Ronneberger et al. 2015) on the basis of a similar dataset to demonstrate the ability of the method in
building extraction. Such outcomes prove that the new proposed network is efficient in building
extraction. The remainder of the paper is organized in the following manner. Section 2 outlines the
Ac
methodology of the suggested SegUnet approach. Section 3 highlights the results and discussion.
Section 4 provides the conclusion.
1
2.1. Unet architecture
The Unet model is an elegant DCNN that can yield accurate image segmentations. The main
concept of the Unet model is the replacement of pooling layers with up-sampling operators to
complete a typical contracting network by continuous layers, followed by the enhancement of output
layer resolution. For localization, the high-resolution features of the contracting part are mixed with
up-sampled output. Finally, continuous convolution layer can be used to assemble an accurate
outcome on the basis of this information (Long et al. 2015). One significant factor in the Unet model
is the several feature channels in the up-sampling section where the network can spread context
information to layers with high resolution. The Unet deep learning model comprises two principal
sections: expansive part (right side) and contracting part (left side). Given that the contracting and
expansive parts are symmetric, a U-shaped CNN is formed. The model only utilizes the right part of
every convolution and does not have any fully connected layers. For example, considering that the
segmentation map only contains pixels, the entire context is accessible in the input image. Therefore,
an overlap-tile strategy is utilized to provide a monolithic and random segmentation of large images.
For extrapolating the missing context and foretelling pixels in the border section of images, input
t
image mirroring is also utilized. The resolution can be restricted by the GPU memory unless the
ip
tiling strategy implements the network to extensive images.
cr
The generic framework of a Unet model is followed by a contracting path that includes two
repeated convolution layers of 3×3 window size, followed by a down-sampling layer of 2×2 window
size. Activation function (1), which is a kind of transformation function, is used in the convolution
us
process. Assuming that a weight vector is w; a bias vector is b; xk (ii, jj) is the input of activation
function and the output of convolution operation, respectively.
an
k
Z (x (ii , jj )) f ( x (ii , jj ) w b ) Z f (X W b )
k k k k
k 1 . (1)
M
For f () , activation function (2), that is, rectified linear unit (ReLU) is used in the Unet model.
Neurons do not confront the gradient vanishing issue, which arises when the gradient norm declines
ed
after sequential updates in the back-propagation process. Neurons also efficiently operate with
rectified function because this function encourages sparsity in the hidden layers and prevents
saturation during the learning process (Zhou et al. 2014). In each down-sampling stage, the number
of feature channels is doubled. As previously mentioned, max-pooling layers are utilized to decrease
pt
image size, parameter number, and network computing. In the down-sampling method, images are
sampled using their principle local correlations. This approach retains efficient information while
ce
lessening data processing and allowing the features taken through convolution to have spatial
uniformity (Maggiori et al. 2017).
Ac
An up-sampling, followed by a convolution with a stride of 2×2 that halves the number of feature
channels, is used in each step of the expansive path. Two convolution layers of 3×3 kernel size,
followed by the ReLU activation function and a concatenation with the correspondingly cropped
feature map from the contracting path, are utilized in the expansive path. Eventually, a convolution
layer of 1×1 window size and a sigmoid function (3) are utilized for mapping every 32-component
feature vector to the desired number of classes (road and non-road) and for mapping the predicted
values to probabilities, respectively (Hu et al. 2015). The generic framework of the Unet model is
illustrated in Figure 2.
1
where xk (ii, jj) is utilized as the input to the activation function and the output of convolution
operation, respectively.
1
S (z )
1 e z , (3)
t
convolutional layers in the model named the VGG16 network (Simonyan and Zisserman 2014),
ip
which is outlined for feature classification. A multi-class classifier named Softmax (Equation 4) is
fed into the last decoder network to generate independent classification possibilities for individual
cr
pixels. Softmax output transforms into possibility dispensation as it always ranges [0–1] and adds up
to 1. The n channel of image possibility is the output of the Softmax classifier, where n presents the
us
number of classes, x is the output vector of the model, and index i is in the range (0, …, n-1).
e xi
s (x i ) n . (4)
e xi
an
j 1
For producing and batch normalizing a collection of feature maps, every encoder in the encoder
M
network implements a filter bank with a convolution. Subsequently, ReLU is utilized as an activation
function, followed by max-pooling layers with a kernel size of 2×2. Using a factor of 2, the
outcoming output is sub-sampled. For achieving translation invariance over tiny spatial changes in
ed
the input data, max-pooling layers are utilized. Although additional translation invariance for strong
classification can be obtained by multiple map-pooling layers, a corresponding spatial resolution loss
of feature maps occurs. Therefore, before implementing sub-sampling, storing, and capturing,
pt
boundary information is essential in encoder feature maps. For up-sampling the input feature maps in
the decoder network, the memorized sub-sampling indices from the corresponding encoder feature
maps are utilized. Dense feature maps are produced by convolving a trainable decoder filter back
ce
with these feature maps. Subsequently, a batch normalization step is implemented to each map. The
whole feature map in the Unet model (Ronneberger et al. 2015) is first transferred to the
corresponding decoders, and then is concatenated to up-sample decoder feature maps (using
Ac
deconvolution), whereas the Segnet model reutilizes pooling indices. In addition, the Segnet model
utilizes the whole weights of the pre-trained convolutional layer from the VGG network as pre-
trained weights, whereas no max-pool 5 block and conv 5 exist in the Unet model as in the
architecture of the VGG network.
t
ip
2.4. Dataset
cr
To apply the proposed SegUnet model for building extraction, the Massachusetts building dataset
(Mnih 2013) is used. Given the computational restriction, the original dataset that contains 10, 4, and
137 aerial images for test, validation, and training with a spatial dimension of 1500×1500 pixel
us
dimension is divided into the size of 384×384, respectively. The total number of images used in this
study is 1,564, where 1,532, 24, and 8 images are considered for training, validation, and test,
respectively. Certain samples of a building dataset with various scenes are depicted in Figure 5.
an
Figure 5. around here
M
score (6), recall (7), and precision (8) are utilized on the basis of the confusion matrix (Ghasemkhani
et al. 2020) with four main factors, such as false negative (FN), false positive (FP), true negative
(TN), and true positive (TP), to assess the model performance for extracting building features from
pt
high-resolution aerial imagery. OA is specified as the sum of rightly identified pixels divided by the
entire number of pixels. Precision is calculated as a percentage of precisely identified pixels among
the identified pixels of the building. Meanwhile, F1 score is the combination of recall and precision
ce
metrics. Recall is specified as a percentage of correctly predicted pixels among all the actual pixels
of building, whereas F1 score is the combination of recall and precision (Wang et al. 2020).
Ac
TP TN
OA
TP TN FP FN (5)
2 Pr ecision Re call
F1
Pr ecision Re call (6)
TP
Re call
TP FN (7)
1
TP
Pr ecision
TP FP (8)
t
ip
parameters and framework of the proposed approach, such as the number of blocks and size of each
block, are illustrated in Figure 4. For updating the parameters of the proposed model and minimizing
the energy function while training the network, an exceptional optimization algorithm is needed.
cr
Therefore, in our network, we utilized one of the most common optimizers called adaptive moment
estimation (Adam) to update parameters such as biases and weights and to lessen the losses. We set the
us
learning rate of the SegUnet network to 1e 4 during training to speed up the processing and achieve
an improved performance. In this study, the whole process of the introduced network for extracting
building features from aerial imagery was performed on a GPU Nvidia Quadro P5000 with a
an
computation capacity of 6.1 and a memory of 16 GB under the framework of Keras with TensorFlow
backend.
M
Figure 6 depicts the results of two images obtained by the proposed approach for building
extraction. The figure is presented in three columns and four rows. The first, second, and third columns
respectively represent the original image, the ground truth image, and the building segmentation
ed
results, which were obtained by the SegUnet model. Meanwhile, the second and fourth rows represent
the zoomed results. Figure 6 shows that the proposed SegUnet model achieves the OA of 92.33% and
91.3% for Image 1 and Image 2, respectively, proving that the model can generally extract buildings
from high-resolution aerial images accurately. However, the FN (illustrated as blue pixel) and FP
pt
(illustrated as green pixel) of the identified pixels illustrate multiple failures for our suggested
approach and show multiple issues with the data. The proposed approach can identify a building where
ce
tiny nearby buildings emerge as a joined area, which increases the FP between the spaces of the
building. However, the proposed model cannot make a right building prediction where no building is
found in the label image, but one exists in the original image that appears as an FP prediction.
Ac
3.2. Discussion
To verify the performance of the proposed SegUnet technique for extracting building objects from
high-resolution remotely sensing aerial imagery, we compared the method with other DCNNs.
Specifically, we compared the suggested SegUnet model with the deep convolutional encoder–
decoder approach called the Segnet model, FCN technique, and deep convolutional Unet model. By
comparing the results achieved via Segnet and Unet models with the results of the proposed SegUnet
model, the difference in the accuracy for building extraction can be witnessed.
1
The visual outcomes of building extraction by using the suggested SegUnet model and other
comparative techniques for calculating the efficiency of the SegUnet approach in building extraction
are illustrated in Figure 7. The obtained outcomes demonstrate that the influence of shortcomings
can be reduced to a specific degree by using the proposed methods because these methods consider
the spatial information for semantic segmentation. However, FCN and Segnet approaches predict
additional FNs and FPs, which are depicted by blue and green colors, respectively. Thus, these
methods cannot precisely preserve and achieve boundary information, leading to the detection of
FNs and FPs and production of a low-resolution segmentation map. The Unet model, which utilizes
deconvolution layers and skip connection, can also achieve and preserve boundary information with
higher accuracy than FCN and Segnet methods, thus obtaining a correct segmentation map. By
contrast, the proposed SegUnet model, which utilizes skip connection (Unet) and index pooling
(Segnet), can predict fewer FNs and FPs, preserve boundary information, and produce a correct
segmentation map compared to other comparative approaches.
To test the efficiency of the introduced SegUnet approach for building extraction in comparison
t
ip
with other DCNNs, we demonstrated the quantitative results of the techniques in Table 1. The first
eight rows of Table 1 present the quantitative accuracy of the four main metrics achieved by the
comparative approaches for the eight images, whereas the last row presents the average accuracy of
cr
the metrics. As shown in Table 1, the FCN model can obtain higher accuracy for the recall factor
than other methods because the model predicts many FNs. By contrast, the Unet method can obtain
us
higher accuracy for precision and OA factors than FCN and Segnet methods. Moreover, the Unet
method is the second-best approach in building extraction and can obtain a correct segmentation
map. Finally, the average accuracy for F1 score and OA factors achieved by the proposed SegUnet
an
model is higher than those by other techniques with almost 0.14%, 1.17%, and 0.44% higher than the
Unet, Segnet, and FCN approaches, respectively. These results indicate that the proposed model can
improve the results and exceed other state-of-the-art techniques in building extraction from high-
M
resolution remote sensing imagery. Figure 8 plots the clear differences between the introduced
SegUnet model and other deep learning approaches for building object segmentation. Figure 8 also
illustrates that the proposed SegUnet network achieves higher precision for the OA factor than other
techniques.
ed
4. Conclusion
For extracting building objects from high-resolution aerial imagery, we presented a new deep
Ac
neural network called the SegUnet model, which is a combination of Segnet and Unet techniques, in
this work. We applied the proposed model on the Massachusetts building dataset. After training and
validating the method, we utilized four accuracy metrics to assess the efficiency of the indicated
technique in building extraction, which achieved a 92.73% accuracy on average for OA. This result
indicated that the proposed model can produce a correct segmentation map and can accurately extract
building objects. Furthermore, we compared the visual and quantitative results of the proposed
SegUnet model with those of other deep learning techniques, such as Segnet, FCN, and Unet models,
to show its effectiveness. The results confirmed that the proposed method obtained the best
quantitative and visual performances and outperformed other DCNNs in building extraction from
high-resolution aerial imagery.
1
Author Contributions: Conceptualization, A.A. and B.P.; methodology and formal analysis, A.A.;
data curation, A.A.; writing—original draft preparation, A.A.; writing—review and editing, B.P.;
supervision, B.P.; funding, B.P. and A.A.
Funding: This research is supported by the Centre for Advanced Modelling and Geospatial
Information Systems (CAMGIS), Faculty of Engineering and IT, the University of Technology
Sydney (UTS). This research is also supported by Researchers Supporting Project (RSP) number
RSP-2020/14, King Saud University, Riyadh, Saudi Arabia.
References
1. Abdollahi, A., Bakhtiari, H.R.R. & Nejad, M.P., 2018. Investigation of svm and level set
interactive methods for road extraction from google earth images. Journal of the Indian Society
of Remote Sensing, 46 (3), 423-430.
2. Abdollahi, A., Pradhan, B., Shukla, N., Chakraborty, S. & Alamri, A., 2020. Deep learning
t
ip
approaches applied to remote sensing datasets for road extraction: A state-of-the-art review.
Remote Sensing, (12), 1444.
cr
3. Alshehhi, R., Marpu, P.R., Woon, W.L. & Dalla Mura, M., 2017. Simultaneous extraction of
roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS
us
Journal of Photogrammetry Remote Sensing, 130, 139-149.
4. Audebert, N., Boulch, A., Lagrange, A., Le Saux, B. & Lefevre, S., 2016. Deep learning for
remote sensing. Technical Report. DOI: 10.1109/JURSE.2017.7924536.
an
5. Audebert, N., Boulch, A., Randrianarivo, H., Le Saux, B., Ferecatu, M., Lefèvre, S. & Marlet,
R., 2017. Deep learning for urban remote sensing. Joint Urban Remote Sensing Event
(JURSE)IEEE, 1-4. DOI: 10.1109/JURSE.2017.7924536.
M
6. Audebert, N., Le Saux, B. & Lefèvre, S., 2017. Semantic segmentation of earth observation data
using multimodal and multi-scale deep networks. Asian Conference on Computer Vision, 180-
196. DOI: 10.1007/978-3-319-54181-5_12.
ed
7. Badrinarayanan, V., Kendall, A. & Cipolla, R., 2017. Segnet: A deep convolutional encoder-
decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis Machine
pt
digital images. The Egyptian Journal of Remote Sensing and Space Science, 20 (1), 117-123
Available from: http://www.sciencedirect.com/science/article/pii/S1110982317300820.
9. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K. & Yuille, A.L., 2014. Semantic image
Ac
segmentation with deep convolutional nets and fully connected crfs. 834 - 848. DOI:
10.1109/TPAMI.2017.2699184.
10. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.J.I.T.O.P.A. & Intelligence,
M., 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous
convolution, and fully connected crfs. 40 (4), 834-848.
11. Do, N.-T., Joo, S.-D., Yang, H.-J., Jung, S.T. & Kim, S.-H., 2019. Knee bone tumor
segmentation from radiographs using seg-unet with dice loss. 25th International Workshop on
Frontiers of Computer Vision (IW-FCV), Gangneung, South Korea.
12. Farabet, C., Couprie, C., Najman, L. & Lecun, Y., 2012. Learning hierarchical features for scene
labeling. IEEE Transactions on Pattern Analysis Machine Intelligence, 35 (8), 1915-1929.
1
13. Fu, G., Liu, C., Zhou, R., Sun, T. & Zhang, Q., 2017. Classification for high resolution remote
sensing imagery using a fully convolutional network. Remote Sensing, 9 (5), 498.
14. Ghasemkhani, N., Vayghan, S.S., Abdollahi, A., Pradhan, B. & Alamri, A., 2020. Urban
development modeling using integrated fuzzy systems, ordered weighted averaging (owa), and
geospatial techniques. Sustainability, 12 (3), 809.
15. Girshick, R., Donahue, J., Darrell, T. & Malik, J., 2014. Rich feature hierarchies for accurate
object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 580-587. Available from: https://arxiv.org/abs/1311.2524.
16. He, K., Zhang, X., Ren, S. & Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level
performance on imagenet classification. Proceedings of the IEEE International Conference on
Computer Vision, 1026-1034. Available from: https://arxiv.org/abs/1502.01852.
17. Hu, F., Xia, G.-S., Hu, J. & Zhang, L., 2015. Transferring deep convolutional neural networks
for the scene classification of high-resolution remote sensing imagery. Remote Sensing, 7 (11),
14680-14707.
t
18. Huertas, A. & Nevatia, R., 1988. Detecting buildings in aerial images. Computer Vision,
ip
Graphics, Image Processing, 41 (2), 131-152.
19. Inglada, J., 2007. Automatic recognition of man-made objects in high resolution optical remote
cr
sensing images by svm classification of geometric image features. ISPRS Journal of
Photogrammetry Remote Sensing, 62 (3), 236-248.
us
20. Krizhevsky, A., Sutskever, I. & Hinton, G.E., 2012. Imagenet classification with deep
convolutional neural networks. Advances in Neural Information Processing Systems, 1097-1105.
DOI: 10.1145/3065386.
an
21. Kussul, N., Shelestov, A., Lavreniuk, M., Butko, I. & Skakun, S., 2016. Deep learning approach
for large scale land cover mapping based on remote sensing data fusion. IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), 198-201. DOI:
M
10.1109/IGARSS.2016.7729043.
22. Levitt, S. & Aghdasi, F., 1998. An investigation into the use of wavelets and scaling for the
ed
extraction of buildings in aerial imagesed. Proceedings of the 1998 South African Symposium on
Communications and Signal Processing-COMSIG'98 (Cat. No. 98EX214), 133-138. DOI:
10.1109/COMSIG.1998.736936.
pt
23. Long, J., Shelhamer, E. & Darrell, T., 2016. Fully convolutional networks for semantic
segmentationed. Proceedings of the IEEE Conference on Computer Vision and Pattern
ce
1
28. Mayer, H., 1999. Automatic object extraction from aerial imagery—a survey focusing on
buildings. Computer Vision Image Understanding, 74 (2), 138-149.
29. Mnih, V., 2013. Machine learning for aerial image labeling, ph.D. Dissertation, dept. Comput.
Sci., univ. Toronto, Canada.
30. Paisitkriangkrai, S., Sherrah, J., Janney, P. & Van Den Hengel, A., 2016. Semantic labeling of
aerial and satellite imagery. IEEE Journal of Selected Topics in Applied Earth Observations
Remote Sensing, 9(7), 2868-2881.
31. Penatti, O.A., Nogueira, K. & Dos Santos, J.A., 2015. Do deep features generalize from
everyday objects to remote sensing and aerial scenes domains? Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops, 44-51. DOI:
10.1109/CVPRW.2015.7301382.
32. Peng, J. & Liu, Y., 2005. Model and context‐driven building extraction in dense urban aerial
images. International Journal of Remote Sensing, 26 (7), 1289-1307.
33. Ronneberger, O., Fischer, P. & Brox, T., 2015. U-net: Convolutional networks for biomedical
t
image segmentationed. International Conference on Medical Image Computing and Computer-
ip
assisted Intervention, 234-241. DOI: 10.1007/978-3-319-24574-4_28.
34. Sharif Razavian, A., Azizpour, H., Sullivan, J. & Carlsson, S., 2015. Cnn features off-the-shelf:
cr
An astounding baseline for recognitioned. Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, 806-813. DOI: 10.1109/CVPRW.2014.131.
us
35. Sherrah, J., 2016. Fully convolutional networks for dense semantic labelling of high-resolution
aerial imagery. Available from: https://arxiv.org/abs/1606.02585.
36. Shrestha, S. & Vanneschi, L., 2018. Improved fully convolutional network with conditional
an
random fields for building extraction. Remote Sensing, 10 (7), 1135.
37. Simonyan, K. & Zisserman, A., 2014. Very deep convolutional networks for large-scale image
recognition. 1-14. Available from: https://arxiv.org/abs/1409.1556.
M
38. Sumer, E. & Turker, M., 2013. An adaptive fuzzy-genetic algorithm approach for building
detection using high-resolution satellite images. Computers, Environment Urban Systems, 39,
ed
48-62.
39. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. &
Rabinovich, A., 2015. Going deeper with convolutionsed. Proceedings of the IEEE Conference
pt
very high resolution multispectral data with deep learning features. IEEE International
Geoscience and Remote Sensing Symposium (IGARSS)IEEE, 1873-1876. DOI:
10.1109/IGARSS.2015.7326158.
Ac
41. Volpi, M. & Tuia, D., 2016. Dense semantic labeling of subdecimeter resolution images with
convolutional neural networks. IEEE Transactions on Geoscience Remote Sensing, 55 (2), 881-
893.
42. Wang, S., Hou, X. & Zhao, X., 2020. Automatic building extraction from high-resolution aerial
imagery via fully convolutional encoder-decoder network with non-local block. IEEE Access, 8,
7313-7322.
43. Wilkinson, G.G., 2005. Results and implications of a study of fifteen years of satellite image
classification experiments. IEEE Transactions on Geoscience Remote Sensing, 43 (3), 433-440.
1
44. Yang, X., Ye, Y., Li, X., Lau, R.Y., Zhang, X. & Huang, X., 2018. Hyperspectral image
classification with deep learning models. IEEE Transactions on Geoscience Remote Sensing, 56
(9), 5408-5423.
45. Yuan, J., 2017. Learning building extraction in aerial scenes with convolutional networks. IEEE
Transactions on Pattern Analysis Machine Intelligence, 40 (11), 2793-2798.
46. Yuan, J. & Cheriyadat, A.M., 2014. Learning to count buildings in diverse aerial scenes.
Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in
Geographic Information Systems, 271-280. DOI: 10.1145/2666310.2666389.
47. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A. & Oliva, A., 2014. Learning deep features for
scene recognition using places database. 27th International Conference on Neural Information
Processing Systems, Montreal, Canada, 1, 487-495.
t
ip
cr
us
an
M
ed
pt
ce
Ac
1
Table 1. Quantitative results obtained by the suggested Seg-Unet approach and other deep-based
neural networks.
t
Accuracy 0.9179 0.9055 0.9227 0.9209
ip
Recall 0.8029 0.7848 0.7859 0.8037
Image4
cr
F1 0.8256 0.8192 0.8255 0.8353
Accuracy 0.914 0.9121 0.9157 0.9196
Recall 0.8195 0.7838
us0.8154 0.8292
Image5
1
t
ip
cr
us
an
Figure 1. Overall methodology of the proposed Seg-Unet model for building extraction.
M
ed
pt
ce
Ac
1
t
ip
cr
Figure 2. Architecture of Unet CNN.
us
an
M
ed
pt
ce
Ac
1
t
ip
cr
Figure 3. Architecture of Segnet CNN.
us
an
M
ed
pt
ce
Ac
1
t
ip
cr
Figure 4. Architecture of the proposed Seg-Unet model with a combination of Segnet (pooling
indices) and Unet (skip connection) networks.
us
an
M
ed
pt
ce
Ac
1
t
ip
cr
us
an
M
ed
pt
ce
Ac
Figure 5. Image samples in the Massachusetts building dataset; the first and second columns present
the main imagery and corresponding label imagery, respectively.
1
t
ip
cr
us
an
M
ed
pt
ce
Ac
Figure 6. Extracted buildings achieved by the proposed Seg-Unet method. The zoomed outcomes of
first and third rows are shown in the second and fourth rows.
1
t
ip
cr
us
an
Figure 7. Visual comparison for outcome obtained via the proposed Seg-Unet model and other
M
techniques such as Segnet, FCN and Unet. The original imagery is presented in the first row. The
obtained outcomes by FCN, Segnet, and Unet methods are depicted in the second, third, and fourth
ed
rows, respectively. The last row shows the results of the suggested Seg-Unet model. The black
(background), blue, and green colors illustrate TNs, FNs, and FPs, respectively.
pt
ce
Ac
1
t
ip
Figure 8. Evaluation factors of the suggested Seg-Unet technique and other deep learning techniques
for building extraction.
cr
us
an
M
ed
pt
ce
Ac