Based On Deep Learning
Based On Deep Learning
com/scientificreports
Keywords Medical image segmentation, Dual-branch feature encoding sub-network, Gating mechanism,
Attention mechanism
According to the survey report by Sabina Umirzakova’s team1, in the domain of medical image analysis, achieving
high-precision image restoration remains a significant challenge. Image segmentation, in particular, is a critical
and highly challenging research direction that holds substantial practical application value. This includes tumor
image segmentation, heart image segmentation, abdominal multi-organ image segmentation, etc. In today’s
rapidly advancing computer technology era, harnessing the power of image segmentation to assist physicians in
diagnosis not only enhances diagnostic speed but also augments the accuracy of diagnostic outcomes.
With the continuous advancement of modern computing power, deep learning is constantly evolving and
innovating, particularly in the field of image segmentation where deep convolutional neural networks (CNN) have
found widespread applications. Medical image segmentation is no exception to this trend. Since the introduction
of U-Net2, numerous models based on its enhancements have emerged. This can be attributed primarily to the
typical encoder-decoder structure and skip connections in U-Net2, which facilitate the propagation of fine-
grained information and prevent information loss, thereby enhancing the accuracy of model segmentation.
Building upon these advantages, several models have been developed including AFFU-Net3, OAU-Net4, ISTD-
UNet5, MultiResU-Net6, SAU-Net7, KiU-Net8, and UCR-Net9. However, due to inherent limitations within
CNN itself such as convolutional kernel sizes and strides settings, it can only capture local features in input
data and cannot directly establish long-range dependencies for global information processing. To address this
issue, researchers have proposed improved network structures and methods like CAT10 with a cross-attention
1Key Laboratory of Advanced Manufacturing Technology of the Ministry of Education, Guizhou University, Guiyang
550025, China. 2Guiyang First People’s Hospital, Guiyang 550002, China. email: [email protected];
[email protected]
mechanism and LMISA11 enhanced based on gradient magnitude to enhance CNN models’ ability to capture
global information. Nevertheless, when it comes to handling volumetric medical images specifically, ineffective
establishment of long-range dependencies continues to hinder CNN’s development in volumetric medical image
processing.
With the successful application of the Transformer12 architecture in the image domain by Visual Transformer
(VIT)13, which replaces convolution operations in CNN with self-attention mechanisms, models are now capable
of establishing long-range dependencies between pixels globally. This enables a more effective capture of global
information in images. Consequently, there has been an increasing number of researchers exploring segmentation
models based on the Transformer architecture. For instance, RT-UNet14 introduces a Skip-Transformer that
utilizes a multi-head self-attention mechanism to mitigate the impact of shallow features on overall network
performance. SWTRU15 proposes a star-shaped window self-attention mechanism that expands the attention
region further to achieve a global attention effect while maintaining computational efficiency. Zhu Zhiqin et al.16
introduced a multi-modal spatial information enhancement and boundary shape correction method, thereby
developing an end-to-end brain tumor segmentation model. This model consists of three essential components:
information extraction, spatial information enhancement, and boundary shape correction. Furthermore, the
LMIS17 framework features a streamlined core feature extraction network and establishes a multi-scale feature
interactive guidance architecture. CFNet18 presents the CFF method, which balances spatial-level effectiveness
by calculating multi-view features and optimizing semantic gap between shallow and deep features. DFTNet19
designs a dual-path feature exploration network that incorporates both shallow and deep features using a
sliding window approach to access deeper information. SDV-TUNet20 introduces an advanced sparse dynamic
volume TransUNet architecture, integrating Sparse Dynamic (SD) and Multi-Level Edge Feature Fusion (MEFF)
modules to achieve precise segmentation. Zhu Zhiqin’s team21 proposed a brain tumor segmentation method
that leverages deep semantic and edge information from multimodal MRI. This approach incorporates a Multi-
Feature Inference Block (MFIB), which utilizes graph convolution for efficient feature inference and fusion.
Additionally, there exist hybrid image segmentation techniques such as UNETR22, which transform 3D medical
images into a sequence and convert the segmentation task into a sequence prediction problem. It leverages
the Transformer12 architecture to acquire multi-scale global information and employs a U-shaped network
with encoder and decoder for pixel-level prediction. UNETR++23 introduces an efficient Enhanced Pairwise
Attention (EPA) block comprising spatial attention modules and channel attention modules. In contrast to
CBAM24, where these modules are concatenated, UNETR++23 processes them in parallel, sharing weights for
attention calculations, thereby reducing overall network parameters and significantly decreasing computational
complexity while maintaining exceptional model performance.
However, current hybrid image segmentation methods still possess certain limitations. Firstly, in the context
of multi-scale image fusion, relying solely on convolution operations for local feature extraction may result in
the loss of global information. Additionally, convolutions are insensitive to scale variations and thus ineffective
in handling objects with different scales, potentially leading to inaccurate segmentation results. Secondly, when
it comes to feature extraction based on Transformer12 architectures, the significance of spatial and channel
layers is often overlooked. While certain models do incorporate both the spatial and channel dimensions,
their utilization remains relatively simple, thus failing to fully leverage the potential of these layers. Moreover,
given the diverse types and sizes of numerous organs within the human body, the resulting model may exhibit
inadequate robustness, thereby complicating the successful completion of multi-organ segmentation tasks.
To tackle these challenges, this paper presents a 3d medical image segmentation network based on gated
attention blocks and dual-scale cross-attention mechanism. Building upon the UNETR++23 architecture, this
method introduces a dual-branch feature encoding sub-network that aggregates feature information from
both coarse and fine scales. These features are then merged through convolution to capture context-rich spatial
information more effectively. To encode feature information, an improved gated shared weighted Paired attention
(G-SWPA) block is utilized. Additionally, a gated double-scale cross attention module (G-DSCAM) is designed
for the bottleneck part of the network to facilitate dual scale feature fusion of volume images. The performance
of DS-UNETR++ is thoroughly evaluated on four public medical datasets. In particular, on the BraTS dataset,
HD95 decreased by 0.96 and DSC increased by 0.7% compared to the baseline model, resulting in values of 4.98
and 83.19%, respectively; On the Synapse dataset, HD95 achieved a value of 6.67 while DSC reached 87.75%,
representing reductions of 0.86 and improvements of 0.53% compared to the baseline model respectively. These
results demonstrate that DS-UNETR++ outperforms other models in terms of segmentation effectiveness as
illustrated in Fig. 1 where visual segmentation results closely resemble ground truth labels. The contributions
made by this paper can be summarized into three main aspects:
(1) The DS-UNETR++ method is introduced, featuring a dual-branch feature encoding subnetwork. This ap-
proach uniquely segments the input image into coarse-grained and fine-grained features during preproc-
essing, constructing specialized encoding blocks for each scale. In medical imaging, disease or lesion areas
exhibit distinct characteristics at varying scales, such as the fine structures at tumor edges and the overall
tumor morphology. This design not only captures both macroscopic structures and microscopic details si-
multaneously but also enhances the model’s capability to detect target features across different scales while
preserving spatial continuity and correlation.
(2) Constructing the Gated Shared Weighted Paired Attention (G-SWPA) Block: This design incorporates par-
allel spatial and channel attention mechanisms, both of which are regulated by a gating mechanism. This
allows the model to dynamically adjust the emphasis on spatial versus channel attention based on the char-
acteristics of the input data. For instance, in certain image cases, channel information may be more critical,
whereas in others, spatial details may hold greater importance. Consequently, the G-SWPA block enhances
Fig. 1. The segmented images in the BraTS dataset are visualized and compared with the corresponding
ground truth labels. The red area corresponds to Edema, the green area represents Non-Enhanced Tumor, and
the blue area represents Enhanced Tumor.
the efficiency and quality of feature extraction, thereby improving the accuracy and reliability of segmenta-
tion outcomes.
(3) Design of a Gated Dual-Scale Cross-Attention Module (G-DSCAM): To address feature fusion at the bot-
tleneck layer, we propose a dimensionality-reduction cross-attention mechanism to perform self-attention
calculations. Additionally, a gating mechanism is introduced to dynamically adjust the proportion of two
types of feature information. Given that boundary information is typically multi-scale, with different scales
collectively determining the accurate placement of segmentation boundaries, this approach enhances sen-
sitivity to detailed features and significantly improves segmentation boundary accuracy.
Related work
This section primarily examines various segmentation methods for medical images. In accordance with the
main research focus of this article, it predominantly analyzes image segmentation models based on multi-scale
feature extraction. Additionally, it explores diverse attention mechanisms for integrating multi-scale context and
provides comparative explanations alongside the model proposed in this article.
Single-branch encoder networks are more lightweight, offering enhanced efficiency in both training and
inference processes. For instance, V-Net25 leverages multi-scale features to improve model perception and
employs the Dice loss function for measuring similarity between predicted segmentation results and ground
truth labels. nnUNet26 argues against overly complex model structures that can lead to overfitting, proposing
an adaptive architecture capable of extracting features at multiple scales while enhancing model generalization
through cross-validation. DeepLab27 utilizes dilated convolution operations to expand the receptive field and
introduces spatial pyramid pooling (SPP Layer) for the first time in image segmentation tasks. DeepLabv3+24
incorporates dilated convolutions at different rates for feature map extraction and further integrates local and
global features.
Multiple-branch encoders demonstrate enhanced adaptability for segmentation tasks in complex scenes. For
instance, DS-TransUnet29 employs dual-scale encoders to extract feature information across different semantic
scales and incorporates a Feature Interaction Fusion (TIF) module, effectively establishing global dependency
relationships between features of varying scales. DHRNet30 proposes a hybrid reinforcement learning dual-
branch network, primarily consisting of a multi-scale feature extraction branch and a global context detail
enhancement branch. ESDMR-Net31 utilizes squeeze operations and dual-scale residual connections to design
an efficient lightweight network. SBCNet32 introduces both a context encoding module and a boundary
enhancement module, employing advanced multi-scale adaptive kernels for tumor identification from two
branches. MILU-Net33 designs a dual-path upsampling and multi-scale adaptive detail feature fusion module
to minimize information loss. MP-LiTS34 introduces an innovative multi-phase channel-stacked dual-attention
module integrated into a multi-scale architecture. BSCNet35 designs a dual-resolution lightweight network
capable of capturing top-level low-resolution branch’s multi-scale semantic context.
However, these networks still possess certain limitations. Firstly, the complexity and migration challenges
associated with the two-branch encoders designed for these networks remain significant. Secondly, some
encoders are only suitable for specific datasets and lack robust generalizability. In contrast, DS-UNETR++
proposed in this paper adopts a simpler and more efficient dual-branch encoder that exhibits wide applicability
and can seamlessly integrate as a plug-and-play module within various networks.
Method
Overall architecture
The overall network architecture of DS-UNETR++ is illustrated in Fig. 2, resembling the architecture of
UNETR++23. The primary structure of this study maintains a U-shaped design, comprising three main
components: encoder, bottleneck, and decoder. Initially, the input medical three-dimensional images are
categorized into coarse-grained and fine-grained feature inputs before entering the encoder section. The encoder
consists of three stages, each with two branches containing two types of encoding blocks for handling coarse-
grained and fine-grained feature inputs. Each type of encoding block comprises a downsampling layer and
two consecutive Gate-Shared Weight Pair Attention (G-SWPA) sub-modules. Following the initial encoding
in each stage, a Convolutional Fusion module (Conv-Fusion) further consolidates information from both
scales of features to capture comprehensive contextual information in adjacent spatial positions. The bottleneck
section includes a single stage processed by two types of encoding blocks for coarse-grained and fine-grained
feature inputs while introducing the Gate-Dual-Scale Cross Attention Module (G-DSCAM) for processing. In
the decoder section, a symmetric structure to the encoder is adopted with three stages, each composed of two
consecutive G-SWPA blocks and upsampling layers. To further restore image details and integrate deeper feature
Fig. 2. Overall network architecture diagram. A subnetwork for dual-branch feature encoding is designed to
extract feature information from two scales simultaneously, effectively capturing target features at different
levels. An introduction of a gated shared weighted pairwise attention (G-SWPA) block utilizes a gating
mechanism to automatically adjust the influence of spatial and channel attention on feature extraction,
resulting in more effective feature information. Finally, a gated dual-scale cross-attention module (G-DSCAM)
enhances the effect of feature fusion.
information, we adopt skip connection methods similar to U-Net2, followed by carrying out two convolution
operations to fully recover image size and obtain segmentation image. The main logical relationship is shown
in Fig. 3.
The subsequent sections of this paper primarily demonstrate the processing techniques employed on the
Medical Segmentation Decathlon-Heart segmentation dataset. By leveraging the image segmentation process
utilized in the Heart dataset, these techniques can be readily extended to handle other datasets.
3D patch partitioning
The input image X ∈ RH×W ×D×C to DS-UNETR++ is a three-dimensional image obtained after preprocessing,
primarily involving steps such as cropping and resampling, where H, W, D, and C respectively refer to the height,
width, depth, and number of channels of the input image.
For the Heart dataset, the three-dimensional patch partition mainly performs projections at two scales:
coarse-grained and fine-grained. These are transformed into two different high-dimensional tensors, denoted
as high-dimensional tensors X1 ∈ R(H/P1 ×W/P2 ×D/P3 )×C1 and X2 ∈ R(H/Q1 ×W/Q2 ×D/Q3 )×C1 , where
H/P1 × W/P2 × D/P3 and H/Q1 × W/Q2 × D/Q3 represent the output shapes of tensors X1 and X2
respectively, and P1 ∽ P2 and Q1 ∽ Q2 denote scaling factors for the different side lengths of the two scales.
C1 represents the sequence length of X1 and X2 .It is worth noting that setting the sequence length to be the
same facilitates subsequent fusion processing, which enhances efficiency in data integration. For different
datasets, fine-tuning of these symbols may be necessary due to variations in specific values according to each
dataset’s characteristics. To optimize computational complexity, the three-dimensional patch partition primarily
comprises a convolutional layer and a normalization layer. In practice, the convolutional stride in the embedding
block will also vary accordingly based on input patch sizes.
Encoder
The encoder is mainly divided into three stages, each consisting of two encoding blocks and one feature fusion
block. The two encoding blocks receive high-dimensional tensors X1 and X2 of different scales from the three-
Fig. 3. Logical diagram. It primarily demonstrates the feature extraction and fusion logic of the encoder
component, as well as the connectivity mechanism through jump connections in the decoder component.
dimensional patch partition. The downsampling layers inside the encoding blocks are used to capture the
hierarchical object characteristics of different scales. Then, through the two gate-shared-weight paired attention
(G-SWPA) submodules inside the encoding blocks, long-range dependency relationships between patches are
captured. After the two encoding blocks undergo the above operations, the high-dimensional tensors X1 and
X2 complete the preliminary acquisition of long-range dependency relationships and are entangled with the
initial high-dimensional tensors. Then, the high-dimensional tensors X1 and X2 are output to the convolution
fusion block (Conv-Fusion). The Conv-Fusion block downsamples the feature information of one scale, then
concatenates the two kinds of feature information, and finally completes the fusion of the two-scale feature
information through two convolution operations, outputting it to the decoder stage.
Where X s and Xc represent spatial and channel attention graphs, respectively. SAM and CAM represent
space attention module and channel attention module respectively. QSW and KSW represent matrices of shared
queries and shared keys, respectively, and VSpatial and VChannel represent matrices of spatial value layers and
channel value layers, respectively. GSA indicates the gating parameter for space attention, and GCA indicates
the gating parameter for channel attention.
It is worth noting that the EPA23 module effectively employs parallel processing of spatial attention and
channel attention, utilizing shared parameters to significantly reduce parameter count. However, in the output
stage of this module, the results of spatial attention and channel attention are not weighted but rather added and
merged together. This operational approach hinders the model’s ability to discern which component contributes
more prominently to the final outcome. To address this limitation, gate valves GSA and GCA are introduced
in the output stages of spatial attention and channel attention respectively, enabling a weighting mechanism for
their respective results. Both GSA and GCA serve as learnable parameters that can be automatically generated
based on current size during calculations, with gradients computed during backpropagation for parameter
updates. The gating mechanism assigns higher weights to one party if it performs better in feature extraction
while assigning lower weights to the other party accordingly. Through this process, the model learns about the
contributions made by spatial attention and channel attention, thereby achieving weight measurement. Below
Fig. 4. G-SWPA network architecture. The parallel integration of spatial attention and channel attention is
achieved through a gating mechanism, which dynamically regulates the impact of both types of attention on
feature extraction to enhance the effectiveness of feature information acquisition. Both G_SA and G_CA serve
as learnable parameters that can be automatically generated based on current size during calculations, with
gradients computed during backpropagation for parameter updates.
provides a detailed explanation of both types of attention calculation processes along with an elaboration on the
gating mechanism.
Where, WQ , Wk and WV −Spatial are the projection weight matrices of QSW , KSW and VSpatial , respectively.
The specific calculation of spatial attention involves the projection of KSW and VSpatial onto m dimensions
to obtain KP roj and VP roj , followed by transposing KP roj and multiplying it with QSW ∼
. Afterward, Sof tmax
is applied, followed by multiplication with VP roj to derive the spatial attention force X s . Finally, multiplication
s . The specific calculation formula is as follows:
with GSA yields the ultimate spatial attention force X
∼ QSW × KPT roj ∼
X s = Sof tmax √ s = X s × GSA (4)
× VP roj X
d
Where, QSW , KP roj , and VP roj represent the matrix of the shared query, projection shared key, and projection
space value layer respectively, d is the size of each vector, and GSA represents the spatial attention gating
parameter.
Where, QSW , KSW , and VChannel represent the matrix of the shared query, shared key, and channel value layer
respectively, d is the size of each vector, and GCA represents the channel attention gating parameter.
After calculating the spatial attention and channel attention, it is essential to combine the results of both
attentions. Since the final results of these calculations have identical shapes, this paper employs a direct addition
method for integration. This approach effectively demonstrates the role of gating parameters set in this study
and ensures that the final output attention results are not weakened to any extent. Once fusion is completed, the
original input is merged based on the ResNet45 structure. Subsequently, two convolutional layers are applied to
transform the merged result again, facilitating a more comprehensive entanglement of both types of attention
information. The specific calculation formula is as follows:
∼ ∼ ∼
X= X
s + X = Conv 2 (Conv 1 ( X )) + X (7)
c + X X
Where, X s , Xc and X represent spatial attention graph, channel attention graph and input feature map
respectively, and Conv 1 and Conv 2 are convolution blocks of and 3 × 3 × 3 and 1 × 1 × 1, respectively.
Where Downsample is the downsampling layer, and the maximum pooling is used to scale in the three
directions of width, height and depth, the specific scaling ratio needs to be determined according to the data set,
Cat is the concatenation operation, and Conv 3 and Conv 4 are two 3 × 3 × 3 convolution blocks.
thereby achieving a more effective long-term dependency relationship. To regulate the impact of these two types
of feature information on the final output after self-attention calculation, two gated valves are designed in this
study. Similar to those in G-SWPA, these gated valves automatically adjust the proportion of each type of feature
information to enhance model generalization. Furthermore, inspired by ResNet45, this paper ensures that newly
obtained features are at least as good as the original ones by adding them back to the input after completing
self-attention calculation, thus preserving the validity of feature information. Finally, the two features of the self-
attention calculation are spliced and output after two convolution layers.
The specific architecture of G-DSCAM is illustrated in Fig. 5. The proposed G-DSCAM in this paper facilitates
the interaction between two branches with different scales. Subsequently, this paper elaborates on the interactive
operation of the coarse-grained feature information branch as an exemplification, which can also be applied
to the fine-grained feature information branch. During actual operation, both branches execute operations
simultaneously. In this study, we denote the input fine-grained feature information as F1 ∈ R(H/2×W/2×D/2)×C
and the coarse-grained feature ∼
information as F2 ∈ RH×W ×D×C . Initially, F1 is average pooled and then
flattened to obtain a vector F 1 ∈ R1×1×1×C that represents the entire fine-grained feature information. The
specific calculation formula is presented below:
∼
F 1 = F latten (Avgpool( F1 ))(9)
Where Avgpool
∼
refers to the average pooling operation, F latten refers to the flattening operation, and the
resulting F 1 is a global abstract information representing F1 that will interact with F2 at the pixel level.
∼ ∼ ∼
Next, F2 is flattened from a 3D structure to a 1D structure resulting in F 2 . F 1 is then spliced onto F 2 and fed
into the Transformer
∼
for self-attention calculation. The gating
∼
parameter G2 , designed in this paper, is multiplied
∼ ∼
to obtain a new F 2 ∈ R(1+H×W ×D)×C which undergoes F 1 removal after calculation to yield the final F 2 . F 2
is then restored back to its original 3D structure resulting in feature information F2 after interaction with the
input feature. This information is added back to the original input feature using the following specific formula:
∼
∼ ∼
F 2 = T ransf ormer Cat F 1 , F latten (F2 ) × G2 F2 = Reshape F 2 + F2 (10)
Where G2 refers to the gate parameter of the coarse-grained branch, F latten refers to the flattening operation,
Cat refers to the splicing operation, T ransf ormer refers to the self-attention calculation, Reshape∼ restores
the feature information from one dimension to three dimensions, F2 is the original input, and F 2 is the
intermediate variable after the self-attention calculation, F2 is the final output. After the aforementioned
sequence of operations, coarse-grained features are able to acquire information from fine-grained branches, and
vice versa. Consequently, the G-DSCAM module proposed in this study facilitates effective fusion of multi-scale
branch features, thereby achieving enhanced segmentation performance. The proposed gating parameters G1
and G2 , similar to the aforementioned gating mechanism, are learnable parameters that can be automatically
updated during backpropagation. It is worth noting that if one party demonstrates superior feature extraction
performance, the gating mechanism will assign a higher weight to it while assigning a lower weight to the other
party, thereby achieving weight measurement.
Finally, the coarse-grained feature information is downsampled to match the size of the fine-grained feature
information. The two types of features are concatenated and passed through two convolutional layers for final
fusion. The specific calculation process follows the Conv-Fusion method described, which will not be reiterated
here.
Fig. 5. G-DSCAM network architecture. By reducing the dimensionality of coarse-grained and fine-grained
features, self-attention calculation is performed, while utilizing two gates to dynamically adjust the relative
proportions of these feature types. Both G1 and G2 serve as learnable parameters that can be automatically
generated based on current size during calculations, with gradients computed during backpropagation for
parameter updates.
Decoder
In the decoder part, this paper adopts a symmetric structure with the encoder, which also consists of three
stages. Each stage comprises a decoding block composed of an upper sampling layer and a G-SWPA submodule.
Unlike the lower sampling layer, deconvolution is employed in the upper sampling layer to progressively restore
low-resolution features to high resolution. Additionally, skip connections are utilized to facilitate cross-level
information transfer for preserving more details from low-level features, mitigating gradient vanishing issues and
addressing sample imbalance problems. This approach aims to enhance model performance and generalization
ability by promoting better preservation of fine-grained segmentation details. Following the incorporation of
skip connections and upsampling operations, the G-SWPA submodule is employed again in order to extract
feature information while effectively integrating it with that from the encoder part.
Finally, to further integrate the high-resolution information from the original input with the semantic
information extracted by the decoder, this paper employs a direct convolution of the original input and
subsequently concatenates it with the output of the decoder. This is followed by two convolution operations to
generate the final mask prediction.
Loss function
The loss function in this paper combines the soft dice loss25 and the cross entropy loss, thereby integrating both
local pixel-level similarity and global category information. By incorporating these two loss functions, the model
can effectively leverage contextual information surrounding each pixel for accurate classification and precise
assignment of category labels. The specific calculation method is as follows:
∑I
2∑ 1 ∑∑
J I J
G P
i=1 i,j i,j
L (G, P ) = 1 − ∑I ∑ − Gi,j logPi,j (11)
2
+ i=1 Pi,j
2
J I I
j=1
G
i=1 i,j i=1 j=1
Where, G refers to the set of real results, P refers to the set of predicted results, Pi,j and Gi,j represent the
probability output of class i at voxel j and the one-hot coded true value respectively. I is the number of voxels;
J is the number of classes.
Experiments
In order to validate the performance of the DS-UNETR++ model proposed in this paper, a comparative analysis
is conducted with convolution-based image segmentation models including U-Net2, nnUNet26, 3D UX-Net43,
as well as Transformer-based image segmentation models such as SETR47, TransUNet42, TransBTS48, CoTr49,
UNETR22, UNETR++23, IMS2Trans50, Hsienchih Ting’s Model51, Swin-UNet37, MISSFormer52, Swin-UNETR44,
LeVit-UNet53, PCCTrans54, Guanghui Fu’s Model55, MedNeXt56 and nnFormer39. The experiments are carried
out on four diverse datasets, namely Brain Tumor Segmentation Task (BraTS), Multi-organ CT Segmentation
Task (Synapse), automatic cardiac Diagnostic Segmentation Task (ACDC), and Medical Image Segmentation
Decathlon (MSD) Heart Segmentation task (Heart). The datasets used are briefly introduced in this section,
followed by an explanation of the experimental methodology employed in this paper, and finally an outline of
the metrics adopted.
Dataset introduction
Brain tumor segmentation dataset (BraTS)57
The dataset comprises 484 MRI images, each containing four modalities: FLAIR, T1w, T1gd, and T2w. There are
three label categories: edema (ED) as label 1, non-enhanced tumor (NET) as label 2, and enhanced tumor (ET)
as label 3. The competition necessitates region-based segmentation with the identification of three sub-regions:
whole tumor (WT), which includes labels 1, 2, and 3; tumor core (TC), comprising labels 1 and 3; and enhanced
tumor (ET), consisting of label 3 exclusively. For data partitioning purposes in this study, it is divided into two
parts with a ratio of training set to test set being 80%,20%, respectively. Evaluation metrics employed include
Dice similarity coefficient (DSC) and the Hausdorff distance at the threshold of the topmost five percentiles.
Evaluation
indicators
Dataset name Sample quantity (examples) Split categories (types) Image modality (type) Training set (example) HD95 DSC
BraTS 484 3 4 387 ⎫ ⎫
Synapse 30 8 1 18 ⎫ ⎫
ACDC 150 3 1 80 - ⎫
Heart 30 1 1 20 - ⎫
Table 1. Main indicators of the four types of data sets. There are six categories in total, and the detailed data
for each category is shown in the following table.
System environment GPU PyTorch Learning rate Attenuation rate loss function
Ubuntu 20.04 NVIDIA 4090 24G GPU torch1.11.0 + cu113 0.01 3e-5 Dice + CE Loss
Table 2. Main experimental environment settings. There are six categories in total, and the detailed data for
each category is shown in the following table.
The indicators of the above four data sets are elucidated in Table 1, aiming to facilitate the visualization of the
primary indicator data for each dataset.
Evaluation indicators
The paper introduces two primary evaluation indices: Dice Similarity Coefficient (DSC) and Hausdorff Distance
_95 (HD95). Among them, HD95 is employed to quantify the dissimilarity between the two sets, with smaller
values indicating higher similarity. The choice of 95% is made to exclude outliers comprising 5% of the data.
HD95 can be computed as follows:
( )
HD95 G′ , P ′ = max (dG′ P ′ , dP ′ G′ ) = max{ max
′
min′ d || g ′ − p′ || , max
′ ′ ′
min′ d || p′ − g ′ ||}(12)
′ ′
g ∈G p ∈P p ∈P g ∈G
Where, G′ and P ′ represent the true label set and the predicted label set for all voxels, respectively. dG′ P ′ is
the maximum 95th percentile Hausdorff distance from the predicted value to the true value side, dP ′ G′ is the
maximum 95th percentile Hausdorff distance from the true value to the predicted value side, g ′ is the single
voxel true value, p′ is the single voxel predicted value, and d is the distance.
The Dice similarity coefficient is primarily utilized for quantifying the similarity between two sets, with a
value range of [0, 1]. The higher the value, the greater the resemblance between the two sets. The calculation
formula for DSC is as follows:
∩ ∑I
2 × |G P | 2 i=1 Gi Pi
DSC (G, P ) = ∪ = ∑I ∑I (13)
|G| |P | G + i=1 Pi
i=1 i
Where, G refers to the set of real results, P refers to the set of predicted results, Gi and Pi represent the true and
predicted values of voxel i, respectively, and I is the number of voxels.
Results
In this paper, the experimental results of four datasets are summarized and analyzed. Subsequently, a
comparative analysis was conducted on various convolution-based image segmentation models including
U-Net2, nnUNet26, 3D UX-Net43, as well as Transformer-based image segmentation models such as SETR47,
TransUNet42, TransBTS48, CoTr49, UNETR22, UNETR++23, IMS2Trans50, Hsienchih Ting’s Model51, Swin-
UNet37, MISSFormer52, Swin-UNETR44, LeVit-UNet53, PCCTrans54, Guanghui Fu’s Model55, MedNeXt56 and
nnFormer39. Ablation experiments were then performed to further validate the effectiveness of the enhanced
module proposed in this paper. Finally, visual segmentation results were presented to intuitively demonstrate
the model’s performance.
Synapse dataset
Table 4 presents a comparative analysis of various models based on the DSC and the HD95 metrics on the
Synapse dataset, including the overall average performance across eight segmentation categories. Due to the
diversity of segmentation categories, different models exhibit varying degrees of effectiveness across these
categories. Notably, UNETR++23 demonstrates superior performance across multiple categories, achieving the
highest average DSC and HD95 values, recorded at 87.22% and 7.53, respectively. In comparison, DS-UNETR++
surpasses these results, attaining an average DSC of 87.75%, which is a 0.53% improvement over the baseline
model, and an average HD95 of 6.67, indicating a 0.86 reduction compared to the baseline. Furthermore, in
the evaluation of specific category metrics, DS-UNETR++ excels in six categories: right kidney, left kidney,
gallbladder, liver, aorta, and pancreas. Particularly, the segmentation performance for the right kidney and
pancreas shows significant improvement, with increases of 4.87% and 1.05%, respectively, over the baseline
model. These findings underscore the enhanced segmentation capabilities of the DS-UNETR++ model on the
Synapse dataset, highlighting its superiority and robustness in medical image segmentation tasks.
ACDC dataset
Table 5 presents the DSC index results of various models on the ACDC dataset and calculates their final
average values for three categories. The data in the table indicates that each model performs well overall, with
good segmentation performance across all categories. Notably, nnFormer39 and UNETR++23 stand out with
UNETR++23 achieving an impressive average DSC of 92.83%. DS-UNETR++, however, surpasses both models
with an average DSC of 93.03%, which is 0.2% higher than the baseline model’s score. In terms of specific
category segmentation, DS-UNETR++ improves right ventricular segmentation by 0.34% compared to the
baseline model, demonstrating more effective performance in this area. Figure 6 compares training curves
between UNETR++23 and DS-UNETR++ using ACDC data set (Figure (a) and Figure (b), respectively). As
shown in the figure, DS-UNETR++ exhibits a more stable training process with relatively small fluctuations after
approximately 500 rounds.
Heart dataset
The results of DSC indexes for various models on the Heart dataset are presented in Table 6, along with the
calculation of the final average per category. It is evident from the data that nnUNet26 and UNETR++23 exhibit
excellent performance, achieving DSC indexes of 93.87% and 94.41%, respectively. However, DS-UNETR++
outperforms them all with a remarkable DSC index of 94.51%. Despite being a single-category segmentation
WT ET TC Average
Methos HD95 DSC HD95 DSC HD95 DSC HD95 DSC
SETR NUP47 14.419 69.70 11.72 54.40 15.19 66.90 13.78 63.70
SETR PUP47 15.245 69.60 11.76 54.90 15.023 67.00 14.01 63.80
SETR MLA47 15.503 69.80 10.24 55.40 14.72 66.50 13.49 63.90
TransUNet42 14.03 70.60 10.42 54.20 14.50 68.40 12.98 64.40
TransBTS48 10.03 77.90 9.97 57.40 8.95 73.50 9.65 69.60
CoTr w/o CNN encoder49 11.49 71.20 9.59 52.30 12.58 69.80 11.22 64.40
CoTr49 9.20 74.60 9.45 55.70 10.45 74.80 9.70 68.30
UNETR22 8.27 78.90 9.35 58.50 8.85 76.10 8.82 71.10
UNETR++23 4.76 91.07 4.86 78.31 8.19 78.08 5.94 82.49
IMS2Trans50 – 86.57 – 58.28 – 75.67 – 73.51
Hsienchih Ting’s Model51 – 87.24 – 62.47 – 77.90 – 75.87
Our model 4.07 91.68 4.08 78.58 6.80 79.30 4.98 83.19
Table 3. The analysis of the BraTS validation dataset presents the results based on the evaluation indexes
DSC and HD95. Here, WT, ET, and TC respectively represent enhanced tumor, whole tumor, and tumor core.
The reported results represent the average performance across all samples in the validation set, with the best
outcomes highlighted in bold.
Average
Methods Spl RKid LKid Gal Liv Sto Aor Pan HD95 DSC
U-Net2 86.67 68.60 77.77 69.72 93.43 75.58 89.07 53.98 – 76.85
TransUNet42 85.08 77.02 81.87 63.16 94.08 75.62 87.23 55.86 31.69 77.49
Swin-UNet37 90.66 79.61 83.28 66.53 94.29 76.60 85.47 56.58 21.55 79.13
UNETR22 85.00 84.52 85.60 56.30 94.57 70.46 89.80 60.47 18.59 78.35
MISSFormer52 91.92 82.00 85.21 68.65 94.41 80.81 86.99 65.67 18.20 81.96
Swin-UNETR44 95.37 86.26 86.99 66.54 95.72 77.01 91.12 68.80 10.55 83.48
UNETR++23 95.77 87.18 87.54 71.25 96.42 86.01 92.52 81.10 7.53 87.22
LeVit-UNet53 88.86 80.25 84.61 62.23 93.11 72.76 87.33 59.07 16.84 78.53
PCCTrans54 88.84 82.64 85.49 68.79 93.45 71.88 86.59 66.31 17.10 80.50
Our model 95.68 92.05 87.63 71.51 96.65 83.52 92.85 82.15 6.67 87.75
Table 4. The results analysis of the synapse validation dataset is presented based on the evaluation indexes of
DSC and HD95. The evaluated organs include spleen (spl), right kidney (RKid), left kidney (LKid), gallbladder
(gal), liver (Liv), stomach (sto), aorta (aro), and pancreas (Pan). The reported results represent the average
performance across all samples in the validation set, with the best outcomes highlighted in bold and the next
best results underlined.
Table 5. Presents the results analysis of the ACDC validation dataset based on DSC evaluation indexes.
The abbreviations for each category are as follows: RV (right ventricle), Myo (myocardium), and LV (left
ventricle). The reported results represent the average of all samples in the validation set, with the best outcomes
highlighted in bold.
Fig. 6. Illustrates the comparison of training curves for different methods on the ACDC dataset: (a) depicts
the training curve of UNETR++ and (b) shows the training curve of DS-UNETR++.
Methods DSC
U-Net2 89.22
nnUNet26 93.87
3D UX-Net43 89.14
UNETR++23 94.41
Guanghui Fu’s Model55 82.10
Our model 94.51
Table 6. Presents the analysis of DSC evaluation indicators for the results of the Heart validation dataset. This
dataset exclusively focuses on a single category, namely the left atrium, with the best outcomes highlighted in
bold.
Fig. 7. Illustrates the comparison of training curves for different methods applied to the Heart dataset: (a)
depicts the training curve of UNETR++ on the Heart dataset, while (b) shows the training curve of DS-
UNETR++ on the same dataset.
task with relatively well-defined segmented areas, there is noticeable disparity among the models’ performances.
Nevertheless, DS-UNETR++ still achieves the best outcome, demonstrating its superior generalization capability
and overall stronger performance. Figure 7 illustrates a comparison between training curves for UNETR++23
and DS-UNETR++ models on the Heart dataset (as shown in Figures (a) and (b), respectively). Notably,
UNETR++23 exhibits considerable fluctuations even after 300 rounds, while DS-UNETR++, on the other hand,
achieves stability faster—demonstrating its ability to converge more efficiently.
Table 7 presents the parameter counts and computational complexities of each model. “Params” denotes
the total number of parameters during model training, while “FLOPs” represents floating-point operations per
second. The data in Table 7 indicate that, compared to the baseline model UNETR++23, our proposed model
exhibits a modest increase in both parameter count and computational complexity but achieves a higher Dice
Similarity Coefficient (DSC) value. This suggests that the additional parameters and computational resources
allocated to our model are effectively utilized. Furthermore, when compared to UNETR22 and nnFormer39, the
parameter counts of these models are 1.4 and 2.2 times higher, respectively, and their computational complexities
are 1.9 and 5.2 times greater than those of our proposed model. Despite this, neither UNETR nor nnFormer
achieve as high a DSC value as our model, further underscoring its good performance.
To further validate the robust performance of the proposed model under varying image quality conditions,
this study employs the Heart dataset for comparative experiments. Prior to model training, we increased the
sampling stride parameter and reduced the resolution of the input data to assess the model’s performance under
these altered conditions. As shown in Table 8, DS Str. denotes the downsampling stride. Based on the dual-
branch encoding network presented in this paper, each experimental group utilizes two stride parameters. This
study conducted three sets of experiments, with the last two serving as comparative verifications. The data in the
table indicate that for the first set of comparative experiments (Resolution Reduction 1), the input data resolution
for the first branch was reduced to half of the original, yielding a DSC value of 94.47%, which is insignificantly
different from the best result experiment at 94.51%. For the second set of comparative experiments (Resolution
Reduction 2), the input data resolution for the second branch was reduced to one-fourth of the original,
resulting in a DSC value of 94.45%, also insignificantly different from the best result experiment. These two sets
of comparative experiments demonstrate that the proposed model exhibits strong robustness and can maintain
high performance even when the quality of the input images changes.
Table 7. Comparison of Model parameters and computational complexity. The overall computational
efficiency of the model is evaluated based on the number of parameters and FLOPs, while the dice similarity
coefficient (DSC) is utilized to illustrate the model’s performance.
Table 8. Verification of model robustness. Assessing the impact of varying input image resolution on overall
model performance.
Average
Serial number Models DSC HD95
0 UNETR++23 (Baseline Model) 92.54 1.20
1 Dual scale EPA module 92.78 1.15
2 Dual scale G-SWPA module 92.90 1.11
3 Dual scale G-SWPA module + G-DSCAM 93.03 1.10
Table 9. Presents the utilization of the ACDC dataset to investigate the impact of DS-UNETR++ across
various modules, with the average DSC and HD95 values for the three categories of ACDC dataset used
as analytical results. Specifically, ‘Dual scale EPA module’ denotes the dual-scale efficient paired attention
block, ‘Dual scale G-SWPA module’ refers to the dual-scale gated shared weight paired attention block, and
‘G-DSCAM’ represents the gated dual-scale feature image fusion module.
considering all ablation experiments conducted herein, it can be concluded that several modules designed in this
study played pivotal roles and ultimately led to superior performance of the DS-UNETR++ model compared to
the baseline model as a whole.
Discussion
The limitations of the pure convolutional architecture model are evident, as it fails to establish effective long-
term dependencies. Although U-Net2’s inclusion of jump connections has mitigated this issue, a semantic gap
arises due to the disparity in semantic information between shallow and deep features. However, employing a
pure Transformer12 architecture for 3D image segmentation tasks would entail significant computational costs.
Fig. 8. Visualization results of BraTS and Synapse datasets. (a) Illustrates a visual comparison of the
segmentation results obtained from BraTS. (b) Illustrates a visual comparison of the results obtained from
Synapse segmentation.
Fig. 9. Visualization results of ACDC and Heart datasets. (a) illustrates a visual comparison of the
segmentation results for ACDC. (b) illustrates a visual comparison of the segmentation results for the Heart.
Hence, combining convolution with Transformer12 and utilizing a hybrid image segmentation approach proves
to be an optimal choice.
The research in this paper focuses on the hybrid image segmentation method. Based on the experimental
results presented in the previous chapter, it is evident that the proposed method is highly effective. This paper
introduces three improvements. Firstly, a transition from single-branch to two-branch image segmentation is
implemented. During the experimentation phase, various combinations of scale information from multiple
branches are explored for different datasets. It was observed that incorporating small-scale feature information
at an early stage of the encoder does not contribute positively to model performance. After conducting numerous
experiments and considering scale information integration, an appropriate combination was ultimately selected.
Secondly, this paper enhances the gated shared weight pairs of attention blocks (G-SWPA). In comparison to
the EPA23 module in the baseline model, this study introduces a gated valve for each attention output to regulate
its impact on model performance. During experimentation, various positions within the G-SWPA module
were considered for incorporating these gated valves. After evaluating their effects at different stages such as
input initialization, post dot product operation, and final output stage of attention components, it was observed
that adding the gating valve in the output stage yielded optimal results. Consequently, this paper concludes
by proposing the inclusion of this designed gating valve in the output stage. Finally, a gated dual-scale Cross
attention module (G-DSCAM) is devised to effectively integrate feature information from two scales. Various
experiments are conducted in this study to explore the optimal position of the gating mechanism. Additionally,
a ResNet45 structure is incorporated after the attention calculation stage to ensure output reliability.
The improvement of the aforementioned three aspects can be observed from the results of the ablation
experiment presented in Table 9. The incorporation and enhancement of each module positively contribute
to enhancing model performance. In the final experiment, this study evaluates the model’s performance on
four datasets. Each dataset exhibits varying degrees of improvement based on DSC and HD95 metrics, as
evident from data provided in Tables 3, 4, 5, and 6. Notably, significant improvements are observed for the
BraTS dataset (Table 3). To gain a more intuitive understanding of DS-UNETR++’s specific effects on image
segmentation process, this paper visualizes each dataset with accompanying real labels for comparison purposes.
The method proposed in this paper can be seamlessly integrated into clinical environments and readily adapted
for application in various other domains. Specifically, the G-SWPA module designed herein serves as an effective
feature extraction unit. The weight sharing and gating mechanisms embedded within it are versatile enough to
be applied across diverse fields, including urban landscape imagery, remote sensing image segmentation, natural
scene images, and even feature extraction in natural language processing. Furthermore, this study implements a
modular integration design at the code level for both G-SWPA and G-DSCAM, facilitating ease of deployment
and adaptability.
Although DS-UNETR++ has achieved satisfactory results, it still possesses certain limitations. Firstly, in
comparison to numerous lightweight model architectures, this study may necessitate a greater allocation of
computing resources. Although the use of a dual-branch feature extraction subnetwork undoubtedly enhances
the model’s ability to extract features, it also introduces additional computing overhead. Despite significant
advancements in computing power and memory capacity, the pursuit of enhanced performance metrics through
reduced parameter counts and computational complexity remains a key developmental direction for models and
a continuous innovation goal for researchers. Secondly, in the first three stages of the encoder, only convolution
is used to fuse the dual-scale features. Compared to the approach utilizing attention mechanisms, the proposed
method indeed reduces computational complexity; however, the fusion efficacy may not be optimal, potentially
impacting subsequent decoding performance. Therefore, this limitation warrants further enhancement.
Moreover, despite the model’s relatively low parameter count and computational complexity, there remains a
risk of overfitting, necessitating the implementation of strategies to mitigate this concern.
In light of the current limitations of this method, we will focus on enhancing our approach in the following
three areas: (1) By leveraging existing lightweight structural designs, we aim to optimize the dual-scale feature
encoding network presented in this paper, thereby reducing the computational complexity of the model; (2)
We will refine the feature fusion within the encoder and introduce a lightweight attention mechanism based
on convolutional operations to further enhance the fusion effect without increasing model complexity. (3) To
mitigate the risk of overfitting, we will employ techniques such as regularization and multiple cross-validation,
and explore the design of data augmentation methods specifically for medical images to further minimize the
risk of overfitting.
Conclusions
This paper introduces a 3d medical image segmentation network based on gated attention blocks and dual-scale
cross-attention mechanism. This architecture is designed to extract multi-scale features from two perspectives,
thereby reducing the model’s reliance on specific feature sets. A novel Gated Shared Weighted Paired Attention
block (G-SWPA) is proposed, which employs parallel spatial and channel attention modules to share query and
key weights. This design not only captures comprehensive spatial and channel information but also reduces the
number of parameters. Additionally, a gating mechanism is incorporated to dynamically adjust the influence of
spatial and channel attention, enhancing the effectiveness of feature extraction. In the bottleneck stage, a Gated
Dual-Scale Cross-Attention Module (G-DSCAM) is introduced to expand the receptive field by dimensionality
reduction, promoting the fusion and entanglement of feature information across different scales. The proposed
method was rigorously evaluated on four public medical datasets. In the BraTS dataset, the HD95 index drops
to 4.98, while the DSC index rises to 83.19%, which is 0.86 lower and 0.53% higher than the baseline model. In
the Synapse dataset, the HD95 value is 6.67 and the DSC value is 87.75%, which is 0.86 lower and 0.53% higher
than the baseline model. These experimental results demonstrate that DS-UNETR++ excels in multi-organ
segmentation tasks, offering a promising approach for this domain.
Data availability
Data underlying the results presented in this paper may be obtained from the corresponding authors upon
reasonable request.
References
1. Umirzakova, S., Ahmad, S., Khan, L. U. & Whangbo, T. Medical image super-resolution for smart healthcare applications: a
comprehensive survey. Inf. Fusion 103. https://doi.org/10.1016/j.inffus.2023.102075 (2024).
2. Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. Med. Image Comput.
Comput.-Assist. Interv. PT III 9351, 234–241. https://doi.org/10.1007/978-3-319-24574-4_28 (2015).
3. Zheng, Z. Z. et al. AFFU-Net: attention feature fusion U-Net with hybrid loss for winter jujube crack detection. Comput. Electron.
Agric. https://doi.org/10.1016/j.compag.2022.107049 (2022).
4. Song, H. J., Wang, Y. F., Zeng, S. J., Guo, X. Y. & Li, Z. H. OAU-net: outlined attention U-net for biomedical image segmentation.
Biomed. Signal. Process. Control https://doi.org/10.1016/j.bspc.2022.104038 (2023).
5. Hou, Q. Y. et al. ISTDU-Net: infrared small-target detection U-Net. IEEE Geosci. Remote Sens. Lett. https: //doi.org/ 10.1109/LG RS
.2022.31 41584 (2022).
6. Lan, C. F., Zhang, L., Wang, Y. Q. & Liu, C. D. Research on improved DNN and MultiResU_Net network speech enhancement
effect. Multimed. Tools Appl. 81, 26163–26184. https://doi.org/10.1007/s11042-022-12929-6 (2022).
7. Chen, M. et al. SAU-Net: a novel network for building extraction from high-resolution remote sensing images by reconstructing
fine-grained semantic features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 6747–6761. https://do i.org/10.1 109/JSTARS .202
4.3371 427 (2024).
8. Valanarasu, J. M. J., Sindagi, V. A., Hacihaliloglu, I. & Patel, V. M. KiU-Net: Overcomplete convolutional architectures for biomedical
image and volumetric segmentation. IEEE Trans. Med. Imaging 41, 965–976. https://doi.org/10.1109/TMI.2021.3130469 (2022).
9. Sun, Q. et al. UCR-Net: U-shaped context residual network for medical image. Comput. Biol. Med. https ://doi.org /10.1016/j .comp
biome d.2022.106203 (2022).
10. Lin, H., Cheng, X., Wu, X. & Shen, D. C. A. T. Cross attention in vision transformer. In IEEE International Conference on Multimedia
and Expo (ICME). https://doi.org/10.1109/icme52920.2022.9859720 (2022).
11. Jafari, M., Francis, S., Garibaldi, J. M. & Chen, X. LMISA: a lightweight multi-modality image segmentation network via domain
adaptation using gradient magnitude and shape constraint. Med. Image Anal., 102536. https://doi.org/10.1016/j.media.2022.102536
(2022).
12. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (NIPS 2017), 30. https:// doi.org/10 .48550/ARX IV.170
6.03 762 (2017).
13. Dosovitskiy, A. et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929 (2021).
14. Li, B. et al. RT-Unet: an advanced network based on residual network and transformer for medical image segmentation. Int. J.
Intell. Syst. 37, 8565–8582. https://doi.org/10.1002/int.22956 (2022).
15. Zhang, J. Y. et al. Star-shaped window Transformer Reinforced U-Net for medical image segmentation. Comput. Biol. Med.
SWTRU. https://doi.org/10.1016/j.compbiomed.2022.105954 (2022).
16. Zhu, Z. et al. Brain tumor segmentation in MRI with multi-modality spatial information enhancement and boundary shape
correction. Pattern Recognit. https://doi.org/10.1016/j.patcog.2024.110553 (2024).
17. Zhu, Z. et al. Lightweight medical image segmentation network with multi-scale feature-guided fusion. Comput. Biol. Med. 182,
109204–109204. https://doi.org/10.1016/j.compbiomed.2024.109204 (2024).
18. Zhan, B. C. et al. CFNet: a medical image segmentation method using the multi-view attention mechanism and adaptive fusion
strategy. Biomed. Signal. Process. Control. https://doi.org/10.1016/j.bspc.2022.104112 (2023).
19. Cai, W. T. et al. DFTNet: dual-path feature transfer network for weakly supervised medical image segmentation. IEEE-ACM Trans.
Comput. Biol. Bioinform. 20, 2530–2540. https://doi.org/10.1109/TCBB.2022.3198284 (2023).
20. Zhu, Z. et al. Sparse dynamic volume TransUNet with multi-level edge fusion for brain tumor segmentation. Comput. Biol. Med.
172, 108284–108284. https://doi.org/10.1016/j.compbiomed.2024.108284 (2024).
21. Zhu, Z. et al. Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal MRI. Inf.
Fusion. 91, 376–387. https://doi.org/10.1016/j.inffus.2022.10.022 (2023).
22. Hatamizadeh, A. et al. UNETR: Transformers for 3D medical image segmentation. In 2022 IEEE Winter Conference on Applications
of Computer Vision (WACV 2022), 1748–1758. https://doi.org/10.1109/WACV51458.2022.00181 (2022).
23. Shaker, A. et al. UNETR++: Delving into efficient and accurate 3D medical image segmentation. arXiv:2212.04497 (2023).
24. Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. In Computer Vision—ECCV 2018, Lecture Notes in Computer Science 3–19. https://do i.or
g/10.1 007/978-3- 030-01234- 2_1 (2018).
25. Milletari, F., Navab, N., Ahmadi, S. A. & Ieee V-Net: Fully convolutional neural networks for volumetric medical image segmentation.
In Proceedings of Fourth International Conference on 3D Vision (3DV), 565–571. https://doi.org/10.1109/3DV.2016.79 (2016).
26. Isensee, F. et al. nnU-Net: Self-adapting framework for U-Net-based medical image segmentation. arXiv:1809.10486 (2018).
27. Chen, L. C. et al. Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE
Trans. Pattern Anal. Mach. Intell. 40, 834–848. https://doi.org/10.1109/TPAMI.2017.2699184 (2018).
28. Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F. & Adam, H. Encoder-decoder with atrous separable convolution for semantic
image segmentation. Comput. Vis. ECCV 2018 PT VII. 11211, 833–851. https://doi.org/10.1007/978-3-030-01234-2_49 (2018).
29. Lin, A. et al. DS-TransUNet: dual swin transformer U-Net for medical image segmentation. IEEE Trans. Instrum. Meas. https://d o
i.org/10. 1109/TIM.2 022.317899 1 (2022).
30. Bai, Q. Y., Luo, X. B., Wang, Y. X. & Wei, T. F. DHRNet: a dual-branch hybrid reinforcement network for semantic segmentation of
remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 4176–4193. https://doi.org/10.1109/JSTARS.2024.3357216
(2024).
31. Khan, T. M., Naqvi, S. S. & Meijering, E. ESDMR-Net: A lightweight network with expand-squeeze and dual multiscale residual
connections for medical image segmentation. Eng. Appl. Artif. Intell. https://doi.org/10.1016/j.engappai.2024.107995 (2024).
32. Wang, K. N. et al. SBCNet: scale and boundary context attention dual-branch network for liver tumor segmentation. IEEE J.
Biomed. Health Inf. 28, 2854–2865. https://doi.org/10.1109/JBHI.2024.3370864 (2024).
33. He, Z. X., Li, X. X., Lv, N. Z., Chen, Y. L. & Cai, Y. Retinal vascular segmentation network based on multi-scale adaptive feature
fusion and dual-path upsampling. IEEE Access 12, 48057–48067. https://doi.org/10.1109/ACCESS.2024.3383848 (2024).
34. Kuang, H. P., Yang, X., Li, H. J., Wei, J. W. & Zhang, L. H. Adaptive multiphase liver tumor segmentation with multiscale supervision.
IEEE Signal. Process. Lett. 31, 426–430. https://doi.org/10.1109/LSP.2024.3356414 (2024).
35. Zhou, Q. et al. Boundary-guided lightweight semantic segmentation with multi-scale semantic context. IEEE Trans. Multimed. 26,
7887–7900. https://doi.org/10.1109/TMM.2024.3372835 (2024).
36. Liu, Z. et al. Swin transformer: hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference
on Computer Vision (ICCV 2021), 9992–10002. (2021). https://doi.org/10.1109/ICCV48922.2021.00986
37. Cao, H. et al. In IEEE International Conference on Computer Vision 205–218 (2023). https://doi.org/10.1007/978-3-031-25066-8_9
38. Peiris, H., Hayat, M., Chen, Z., Egan, G. & Harandi, M. A robust volumetric transformer for accurate 3D tumor segmentation.
Med. Image Comput. Comput. Assist. Interv. MICCAI 2022 PT V 13435, 162–172. https://doi.org/10.1007/978-3-031-16443-9_16
(2022).
39. Zhou, H. Y. et al. Volumetric medical image segmentation via a 3D transformer. IEEE Trans. Image Process. 32 (nnFormer),
4036–4045. https://doi.org/10.1109/TIP.2023.3293771 (2023).
40. Wang, H. et al. Axial-DeepLab: stand-alone axial-attention for panoptic segmentation. arXiv:2003.07853 (2020).
41. Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I. & Patel, V. M. Medical transformer: gated axial-attention for medical image
segmentation. Med. Image Comput. Comput. Assist. Interv. MICCAI 2021 PT I. 12901, 36–46. https:// doi.org/10 .1007/978- 3-030-8
719 3-2_4 (2021).
42. Chen, J. et al. TransUNet: transformers make strong encoders for medical image segmentation. Cornell Univ. https: //doi.org/ 10.48
550/a rxiv.2102. 04306 (2021).
43. Hin Lee, H., Bao, S., Huo, Y. & Landman, B. A. 3D UX-Net: a large kernel volumetric ConvNet modernizing hierarchical
transformer for medical image segmentation. arXiv:2209.15076 (2023).
44. Hatamizadeh, A. et al. Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images. In Brainlesion:
Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries, Brainles 2021, PT I 12962, 272–284. https://doi. org/10.100 7/978-3
-03 1-08999-2_22 (2022).
45. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2016.90 (2016).
46. Hu, J., Shen, L., Albanie, S., Sun, G. & Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2011–2023.
https://doi.org/10.1109/tpami.2019.2913372 (2020).
47. Zheng, S. X. et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, CVPR 2021, 6877–6886. https://doi.org/10.1109/CVPR46437.2021.00681
(2021).
48. Wang, W. X. et al. TransBTS: multimodal brain tumor segmentation using transformer. Med. Image Comput. Comput. Assist. Interv.
MICCAI 2021 PT I 12901, 109–119. https://doi.org/10.1007/978-3-030-87193-2_11 (2021).
49. Xie, Y. T., Zhang, J. P., Shen, C. H., Xia, Y. & CoTr Efficiently bridging CNN and transformer for 3D medical image segmentation.
Med. Image Comput. Comput. Assist. Interv. MICCAI 2021 PT III. 12903, 171–180. https://doi.org/10.1007/978-3-030-87199-4_16
(2021).
50. Zhang, D., Wang, C., Chen, T., Chen, W. & Shen, Y. Scalable swin transformer network for brain tumor segmentation from
incomplete MRI modalities. Artif. Intell. Med. 149, 102788–102788. https://doi.org/10.1016/j.artmed.2024.102788 (2024).
51. Ting, H. & Liu, M. Multimodal transformer of incomplete MRI data for brain tumor segmentation. IEEE J. Biomed. Health Inf. 28,
89–99. https://doi.org/10.1109/JBHI.2023.3286689 (2024).
52. Huang, X. H., Deng, Z. F., Li, D. D., Yuan, X. G. & Fu, Y. MISSFormer: an effective transformer for 2D medical image segmentation.
IEEE Trans. Med. Imaging 42, 1484–1494. https://doi.org/10.1109/TMI.2022.3230943 (2023).
53. Xu, G., Zhang, X., He, X., Wu, X. & Liu, Q. LeViT-UNet: make faster encoders with transformer for medical image segmentation.
Pattern Recognit. Comput. Vis. PRCV 2023 PT VIII 14432, 42–53. https://doi.org/10.1007/978-981-99-8543-2_4 (2024).
54. Feng, Y., Su, J., Zheng, J., Zheng, Y. & Zhang, X. A parallelly contextual convolutional transformer for medical image segmentation.
Biomed. Signal. Process. Control 98. https://doi.org/10.1016/j.bspc.2024.106674 (2024).
55. Fu, G. et al. Projected pooling loss for red nucleus segmentation with soft topology constraints. J. Med. Imaging (Bellingham Wash).
11, 044002 (2024).
56. Roy, S. et al. MedNeXt: Transformer-driven scaling of ConvNets for medical image segmentation. Med. Image Comput. Comput.
Assist. Interv MICCAI 2023 PT IV. 14223, 405–415. https://doi.org/10.1007/978-3-031-43901-8_39 (2023).
57. Antonelli, M. et al. The medical segmentation decathlon. Nat. Commun. https://doi.org/10.1038/s41467-022-30695-9 (2022).
58. Bennett Landman, Z. X., Igelsias, J., Styner, M., Langerak, T. & Klein, A. Miccai multi-atlas labeling beyond the cranial vault–
workshop and challenge. In MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge (2015).
59. Bernard, O. et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem
solved? IEEE Trans. Med. Imaging 2514–2525. https://doi.org/10.1109/tmi.2018.2837502 (2018).
Author contributions
Chunhui Jiang: Conceptualization, Methodology, Writing, Data testing, Reviewing, Software. Yi Wang: Concep-
tualization, Supervision, Formal analysis. Qingni Yuan: Conceptualization, Resources, Reviewing. Pengju Qu:
Provide critical revisions to the article. Heng Li: Provide critical revisions to the article.
Declarations
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to Y.W. or Q.Y.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit http://cr eativecomm
o
ns.org/li censes/by- nc-nd/4.0/.