Pre-Trained Image Processing Transformer
Pre-Trained Image Processing Transformer
Hanting Chen1,2 , Yunhe Wang2∗, Tianyu Guo1,2 , Chang Xu3 , Yiping Deng4 ,
Zhenhua Liu2,5,6 , Siwei Ma5,6 , Chunjing Xu2 , Chao Xu1 , Wen Gao5,6
1 2
Key Lab of Machine Perception (MOE), Dept. of Machine Intelligence, Peking University. Noah’s Ark Lab, Huawei Technologies.
3 4
School of Computer Science, Faculty of Engineering, The University of Sydney. Central Software Institution, Huawei Technologies.
5 6
Institute of Digital Media, School of Electronic Engineering and Computer Science, Peking University. Peng Cheng Laboratory.
[email protected], [email protected]
40.5
1.6dB↑
former (IPT). To maximally excavate the capability of trans- 32
31.5
29.5
29
40
30.5 28 39
benchmark for generating a large amount of corrupted RDN IPT RDN IPT RCDNet IPT
(CVPR 2018) (CVPR 2018) (CVPR 2020)
image pairs. The IPT model is trained on these images
with multi-heads and multi-tails. In addition, the con- Figure 1. Comparison on the performance of the proposed IPT and
trastive learning is introduced for well adapting to differ- the state-of-the-art image processing models on different tasks.
ent image processing tasks. The pre-trained model can
therefore efficiently employed on desired task after fine- ural to expect a model pre-trained on one dataset can be
tuning. With only one pre-trained model, IPT outperforms helpful for another. But few studies have generalized pre-
the current state-of-the-art methods on various low-level training across image processing tasks.
benchmarks. Code is available at https://github. Pre-training has the potential to provide an attractive so-
com/huawei-noah/Pretrained-IPT and https: lution to image processing tasks by addressing the follow-
/ / gitee . com / mindspore / mindspore / tree / ing two challenges: First, task-specific data can be limited.
master/model_zoo/research/cv/IPT This problem is exacerbated in image processing task that
involves the paid-for data or data privacy, such as medical
images [8] and satellite images [73]. Various inconsistent
1. Introduction factors (e.g. camera parameter, illumination and weather)
can further perturb the distribution of the captured data for
Image processing is one component of the low-level part training. Second, it is unknown which type of image pro-
of a more global image analysis or computer vision system. cessing job will be requested until the test image is pre-
Results from the image processing can largely influence the sented. We therefore have to prepare a series of image pro-
subsequent high-level part to perform recognition and un- cessing modules at hand. They have distinct aims, but some
derstanding of the image data. Recently, deep learning has underlying operations could be shared.
been widely applied to solve low-level vision tasks, such as
It is now common to have pre-training in natural lan-
image super-resolution, inpainting, deraining and coloriza-
guage processing and computer vision [12]. For example,
tion. As many image processing tasks are related, it is nat-
the backbones of object detection models [86, 85] are of-
∗ Corresponding author ten pre-trained on ImageNet classification [18]. A num-
12299
ber of well-trained networks can now be easily obtained specific head, and the generated features are cropped into
from the Internet, including AlexNet [41], VGGNet [56] patches (i.e., “words”) and flattened to sequences subse-
and ResNet [33]. The seminal work Transformers [61] quently. The transformer body is employed to process the
have been widely used in many natural language process- flattened features in which position and task embedding are
ing (NLP) tasks, such as translation [64] and question- utilized for encoder and decoder, respectively. In addition,
answering [58]. The secret of its success is to pre-train tails are forced to predict the original images with differ-
transformer-based models on a large text corpus and fine- ent output sizes according to the specific task. Moreover,
tune them on the task-specific dataset. Variants of Trans- a contrastive loss on the relationship between patches of
formers, like BERT [19] and GPT-3 [5], further enriched different inputs is introduced for well adopting to differ-
the training data and improved the pre-training skills. There ent image processing tasks. The proposed image processing
have been interesting attempts on extending the success of transformer is learned in an end-to-end manner. Experimen-
Transformers to the computer vision field. For example, tal results conducted on several benchmarks show that the
Wang et al. [62] and Fu et al. [25] applied the self-attention pre-trained IPT model can surpass most of existing meth-
based models to capture global information on images. Car- ods on their own tasks by a significant enhancement after
ion et al. [7] proposed DERT to use transformer architec- fine-tuning.
tures for an end-to-end object detection. Most recently,
Dosovitskiy et al. [22] introduced Vision Transformer (ViT) 2. Related Works
to treat input images as 16×16 words and attained excellent
results on image recognition. 2.1. Image Processing
The aforementioned pre-training in computer vision and Image processing consists of the manipulation of im-
natural language mostly investigate a pretest classification ages, including super-resolution, denoising, dehazing, de-
task, but both the input and the output in an image pro- raining, debluring, etc. There are a variety of deep-learning-
cessing task are images. A straightforward application of based methods proposed to conduct on one or many kinds of
these existing pre-training strategies might not be feasible. image processing tasks. For the super-resolution, Dong et
Further, how to effectively address different target image al. propose SRCNN [20, 21] which are considered as pio-
processing tasks in the pre-training stage remains a hard neering works introducing end-to-end models that recon-
challenge. It is also instructive to note that the pre-training structs HR images from their LR counterparts. Kim et
of image processing models enjoys a convenience of self- al. [39] further explore the capacity of deep neural network
generating training instances based on the original real im- with a more deeper convolutional network. Ahn et al. [2]
ages. The synthetically manipulated images are taken for and Lim et al. [47] propose introduce residual block into
training, while the original image itself is the ground-truth SR task. Zhang et al. [80] and Anwar and Barnes [3] utilize
to be reconstructed. the power of attention to enhance the performance on SR
In this paper, we develop a pre-trained model for im- task. A various excellent works are also proposed for the
age processing using the transformer architecture, namely, other tasks, such as denoising [60, 31, 36, 42, 24], dehaz-
Image Processing Transformer (IPT). As the pre-trained ing [6, 43, 74, 71], deraining [35, 69, 55, 29, 65, 44], and
model needs to be compatible with different image process- debluring [59, 50, 23, 10]. Different from above methods,
ing tasks, including super-resolution, denoising, and derain- we dig the capacity of both big models and huge volume
ing, the entire network is composed of multiple pairs of of data. Then a pre-training model handling several image
head and tail corresponding to different tasks and a sin- processing tasks is introduced.
gle shared body. Since the potential of transformer needs
to be excavated using large-scale dataset, we should pre- 2.2. Transformer
pair a great number of images with considerable diversity Transformer [61] and its variants have proven its suc-
for training the IPT model. To this end, we select the Im- cess being powerful unsupervised or self-supervised pre-
ageNet benchmark which contains various high-resolution training frameworks in various natural language processing
with 1,000 categories. For each image in the ImageNet, tasks. For example, GPTs [52, 53, 5] are pre-trained in a
we generate multiple corrupted counterparts using several autoregressive way that predicting next word in huge text
carefully designed operations to serve different tasks. For datasets. BERT [19] learns from data without explicit su-
example, training samples for the super-resolution task are pervision and predicts a masking word based on context.
generated by downsampling original images. The entired Colin et al. [54] proposes a universal pre-training frame-
dataset we used for training IPT contains about over 10 mil- work for several downstream tasks. Yinhan et al. [49] pro-
lions of images. poses a robust variant for original BERT.
Then, the transformer architecture is trained on the huge Due to the success of Transformer-based models in the
dataset as follows. The training images are input to the NLP field, there are many attempts to explore the benefits of
12300
Multi-head Flatten features Multi-tail
Denoising Denoising
Head Tail
Transformer Encoder
Deraining Deraining
Head Tail
Features
Task embedding
x2 Up Features x2 Up
Head Tail
…
…
…
Transformer Decoder
x4 Up x4 Up
Head Reshape Tail
Figure 2. The diagram of the proposed image processing transformer (IPT). The IPT model consists of multi-head and multi-tail for
different tasks and a shared transformer body including encoder and decoder. The input images are first converted to visual features and
then divided into patches as visual words for subsequent processing. The resulting images with high visual quality are reconstructed by
ensembling output patches.
Transformer in computer vision tasks. These attempts can 3.1. IPT architecture
be roughly divided into two types. The first is to introduce
self-attention into the traditional convolutional neural net- The overall architecture of our IPT consists of four com-
work. Yuan et al. [72] introduce spatial attention for image ponents: heads for extracting features from the input cor-
segmentation. Fu et al. [26] proposes DANET utilizing the rupted images (e.g., images with noise and low-resolution
context information by combining spatial and channel at- images), an encoder-decoder transformer is established for
tention. Wang et al. [66], Chen et al. [15], Jiang et al. [37] recovering the missing information in input data, and tails
and Zhang et al. [79] also augment features by self-attention are used formapping the features into restored images. Here
to enhance model performance on several high-level vision we briefly introduce our architecture, details can be found
tasks. The other type is to replace convolutional neural net- in the supplementary material.
work with self-attention block. For instance, Kolesnikov et Heads. To adjust different image processing task, we use
al. [40] and Dosovitskiy [22] conduct image classification a multi-head architecture to deal with each task separately,
with transformer block. Carion et al. [7] and Zhu et al. [88] where each head consists of three convolutional layers. De-
implement transformer-based models in detection. Chen et note the input image as x ∈ R3×H×W (3 means R, G, and
al. [11] proposes a pre-trained GPT model for generative B), the head generates a feature map fH ∈ RC×H×W with
and classification tasks. Wu et al. [68] and Zhao et al. [84] C channels and same height and width (typical we use C =
propose pre-training methods for teansformer-based mod- 64). The calculation can be formulated as fH = H i (x),
els for image recognition task. Jiang et al. [38] propose the where H i (i = {1, . . . , Nt }) denote the head for the ith
TransGAN to generate images using Transformer. How- task and Nt denotes the number of tasks.
ever, few related works focus on low-level vision tasks. In Transformer encoder. Before input features into the
this paper, we explore a universal pre-training approach for transformer body, we split the given features into patches
image processing tasks. and each patch is regarded as a ”word”. Specifically, the
features fH ∈ RC×H×W are reshaped into a sequence
2
of patches, i.e., fpi ∈ RP ×C , i = {1, . . . , N }, where
3. Image Processing Transformer N = HW P 2 is the number of patches (i.e., the length of se-
quence) and P is patch size. To maintain the position in-
To excavate the potential use of transformer on im- formation of each patch, we add learnable position encod-
2
age processing tasks for achieving better results, here we ings Epi ∈ RP ×C for each patch of feature fpi follow-
present the image processing transformer by pre-training on ing [22, 7], and Epi + fpi will be directly input into the
large-scale dataset. transformer encoder. The architecture of encoder layer is
12301
following the original structure in [61], which has a multi- with image classification, the number of available data used
head self-attention module and a feed forward network. The for image processing task is relatively small (e.g., only 2000
2
output of encoder fEi ∈ RP ×C for each patch has the images on DIV2K dataset for the image super-resolution
same size to that of the input patch fpi . The calculation can task), we propose to utilize the well-known ImageNet as
be formulated as the baseline dataset for pre-training our IPT model, then
we generate the entire dataset for several tasks (e.g., super-
y0 = [Ep1 + fp1 , Ep2 + fp2 , . . . , EpN + fpN ] ,
resolution and denosing) as follows.
qi = ki = vi = LN(yi−1 ), As the images in the ImageNet benchmark are of high
yi′ = MSA(qi , ki , vi ) + yi−1 , (1) diversity, which contains over 1 million of natural images
yi = FFN(LN(yi′ )) + yi′ , i = 1, . . . , l from 1,000 different categories. These images have abun-
dant texture and color information. We first remove the
[fE1 , fE2 , . . . , fEN ] = yl ,
semantic label and manually synthesize a variety of cor-
where l denotes the number of layers in the encoder, MSA rupted images from these unlabeled images with a variety
denotes the multi-head self-attention module in the conven- of degradation models for different tasks. Note that synthe-
tional transformer model [61], LN denotes the layer nor- sized dataset is also usually used in these image processing
malization [4] and FFN denotes the feed forward network, tasks and we use the same degeneration methods as sug-
which contains two fully connected layers. gested in [30, 1]. For example, super-resolution tasks often
Transformer decoder. The decoder also follows the take bicubic degradation to generate low-resolution images,
same architecture and takes the output of decoder as input denoising tasks add Gaussian noise in clean images with
in the transformer body, which consists of two multi-head different noise level to generate the noisy images. These
self-attention (MSA) layers and one feed forward network synthesized images can significantly improve the perfor-
(FFN). The difference to that of the original transformer mance of learned deep networks including both CNN and
here is that we utilize a task-specific embedding as an addi- transformer architectures, which will be shown in the exper-
tional input of the decoder. These task-specific embeddings iment part. Basically, the corrupted images are synthesized
2
Eti ∈ RP ×C , i = {1, . . . , Nt } are learned to decode fea- as:
tures for different tasks. The calculation of decoder can be Icorrupted = f (Iclean ), (3)
formulated as: where f denotes the degradation transformation, which is
depended on the specific task: for the super-resolution task,
z0 = [fE1 , fE2 , . . . , fEN ] , f sr is exactly the bicubic interpolation; for image denois-
qi = ki = LN(zi−1 ) + Et , vi = LN(zi−1 ), ing, f noise (I) = I + η, where η is the additive Gaussian
zi′ = MSA(qi , ki , vi ) + zi−1 , noise; for deraining, f rain (I) = I + r in which r is a hand-
crafted rain streak. The loss function for learning our IPT
qi′ = LN(zi′ ) + Et , ki′ = vi′ = LN(z0 ), (2)
in the supervised fashion can be formulated as:
zi′′ = MSA(qi′ , ki′ , vi′ ) + zi′ ,
X
Nt
zi = FFN(LN(zi′′ )) + zi′′ , i = 1, . . . , l Lsupervised = i
L1 (IPT(Icorrupted ), Iclean ), (4)
[fD1 , fD2 , . . . , fDN ] = yl , i=1
2
where L1 denote the conventional L1 loss for reconstructing
where fDi ∈ RP ×C denotes the outputs of decoder. The i
desired images and Icorrupted denote the corrupted image
decoded N patched features with size P 2 × C are then re- for task i, respectively. In addition, Eq. 4 implies that the
shaped into the features fD with size C × H × W . proposed framework is trained with multiple image process
Tails. The properties of tails are same as those of heads, tasks simultaneously. Specifically, for each batch, we ran-
we use multi tails to deal with different tasks. The cal- domly select one task from Nt supervised tasks for train-
culation can be formulated as fT = T i (fD ), where T i ing and each task will be processed using the correspond-
(i = {1, . . . , Nt }) denote the head for the ith task and Nt ing head, tail and task embedding, simultaneously. After
denotes the number of tasks. The output fT is the resulted the pre-training the IPT model, it will capture the intrin-
images size of 3 × H ′ × W ′ which is determined by the sic features and transformations for a large variety of image
specific task. For example, H ′ = 2H, W = 2W for a 2× processing tasks thus can be further fine-tuned to apply on
super-resolution task. the desired task using the new provided dataset. Moreover,
other heads and tails will be dropped for saving the compu-
3.2. Pre-training on ImageNet
tation costs and parameters in the remained head, tail and
Besides the architecture of transformer itself, one of body will be updated according to the back-propagation.
the key factors for successfully training an excellent trans- However, due to the variety of degradation models, we
former is that the well use of large-scale datasets. Compared cannot synthesize images for all image processing tasks.
12302
For example, there is a wide range of possible noise lev- training the IPT model. We then generate the corrupted im-
els in practice. Therefore, the generalization ability of ages with 6 types of degradation: 2×, 3×, 4× bicubic inter-
the resulting IPT should be further enhanced. Similar to polation, 30, 50 noise level Gaussian noise and adding rain-
the pre-training natural language processing models, the streaks, respectively. For the rain-streak generation, we fol-
relationship between patches of images is also informa- low the method described in [70]. During the test, we crop
tive. The patch in image scenario can be considered as a the images in the test set into 48 × 48 patches with a 10
word in natural language processing. For example, patches pixels overlap. Note that the same testing strategy is also
cropped from the same feature map are more likely to ap- adopted for CNN based models for a fair comparison, and
pear together, which should be embedded into similar posi- the resulting PSNR values of CNN models are the same as
tions. Therefore, we introduce contrastive learning [13, 32] that of their baselines.
for learning universal features so that the pre-trained IPT Training & Fine-tuning. We use 32 Nvidia NVIDIA
model can be utilized to unseen tasks. In practice, denote Tesla V100 cards to train our IPT model using the conven-
the output patched features generated by IPT decoder for tional Adam optimizer with β1 = 0.9, β2 = 0.999 for 300
j 2
the given input xj as fD i
∈ RP ×C , i = {1, . . . , N }, epochs on the modified ImageNet dataset. The initial learn-
where xj is selected from a batch of training images X = ing rate is set as 5e−5 and decayed to 2e−5 in 200 epoch
{x1 , x2 , . . . , xB }. We aims to minimize the distance be- with 256 batch size. Since the training set consists of dif-
tween patched features from the same images while max- ferent tasks, we cannot input all of them in a single batch
imize the distance between patches from different images. due to the expensive memory cost. Therefore, we stack a
The loss function for contrastive learning is formulated as: batch of images from a randomly selected task in each iter-
ation. After pre-training on the entire synthesized dataset,
j j
j j
exp(d(fD i
, fD i
)) we fine-tune the IPT model on the desired task (e.g., ×3
l(fD , fD ) = −log PB 1
j
2
, single image super-resolution) for 30 epochs with a learn-
i1 i2 k
k=1 Ik6=j exp(d(fDi , fDi ))
1 2
ing rate of 2e−5 . Note that SRCNN [20] also found that
1 X XX j
N N B
using ImageNet training can bring up the performance of
Lconstrastive = l(f , f j ), the super-resolution task, while we propose a model fitting
BN 2 i =1 i =1 j=1 Di1 Di2
1 2
general low-level vision tasks.
(5)
aT b
where d(a, b) = kakkbk denotes the cosine similarity. 4.1. Super-resolution
Moreover, to make fully usage of both supervised and self-
supervised information, we reformulate the loss function as: We compare our model with several state-of-the-art
CNN-based SR methods. As shown in Table 1, our pre-
LIP T = λ · Lcontrastive + Lsupervised . (6) trained IPT outperforms all the other methods and achieves
the best performance in ×2, ×3, ×4 scale on all datasets.
Wherein, we combine the λ-balanced contrastive loss with It is worth to highlight that our model achieves 33.76dB
the supervised loss as the final objective function of IPT. PSNR on the ×2 scale Urban100 dataset, which surpasses
Thus, the proposed transformer network trained using Eq. 6 other methods with more than ∼0.4dB, while previous
can be effectively exploited on various existing image pro- SOTA methods can only achieve a <0.2dB improvement
cessing tasks. compared with others, which indicates the superiority of the
proposed model by utilizing large scale pre-training.
4. Experiments We further present the visualization results on our model
In this section, we evaluate the performance of the pro- in 4× scale on Urban100 dataset. As shown in Figure 3,
posed IPT on various image processing tasks including it is difficult for recover the original high resolution images
super-resolution and image denoising. We show that the since lots of information are lost due to the high scaling
pre-trained IPT model can achieve state-of-the-art perfor- factor. Previous methods generated blurry images, while the
mance on these tasks. Moreover, extensive experiments for super-resolution images produced by our model can well
ablation study show that the transformer-based models per- recover the details from the low-resolution images.
form better than convolutional neural networks when us-
4.2. Denoising
ing the large-scale dataset for solving the image processing
problem. Since our pre-trained model can be well adapt to many
Datasets. To obtain better pre-trained results of the IPT tasks, we then evaluate the performance of our model on
model, we use the well-known ImageNet dataset, which image denoising task. The training and testing data is gen-
consists of over 1M color images of high diversity. The erated by adding Gaussian noise with σ = 30, 50 to the
training images are cropped into 48 × 48 patches with 3 clean images.
channels for training, i.e., there are over 10M patches for To verify the effectiveness of the proposed method,
12303
HR VDSR [39] EDSR [48]
Urban100 (×4): img 004 RNAN [81] IGNN [87] IPT (ours)
Urban100 (4×):img 012 OISR [34] SAN [17] RNAN [81] IGNN [87] IPT (ours)
Urban100 (4×): img 044 OISR [34] SAN [17] RNAN [81] IGNN [87] IPT (ours)
Figure 3. Visual results with bicubic downsampling (×4) from Urban100. The proposed method recovers more details. Compared images
are derived from [87].
BSD68: 163085 DnCNN [75] MemNet [57] IRCNN [76] FFDNet [77] IPT (ours)
Figure 4. Color image denoising results with noise level σ = 50. Compared images are derived from [78].
we compare our results with various state-of-the-art mod- can well recover several details in the hair of this cat and our
els. Table 2 reported the color image denoising results visual quality beats all the previous models obviously.
on BSD68 and Urban100 dataset. As a result, our IPT
achieves the best results among all denoising methods on
4.3. Deraining
different Gaussian noise level. Moreover, we surprisingly For the image deraining task, we evaluate our model on
found that our model improve the state-of-the-art perfor- the synthesized Rain100L dataset [70], which consists of
mance by ∼2dB on the Urban100 dataset, which demon- 100 rainy images. Quantitative results can be viewed in
strate the effectiveness of pre-training and the superiority of Table 3. Compared with the state-of-the-art methods, we
our transformer-based model. achieve the best performance (41.62dB) with an 1.62dB im-
Figure 4 shows the visualization of the resulted images. provement.
As shown in the figure, noisy images are hard to be recog- Figure 5 shows the visualization results. Previous meth-
nized and it is difficult to recover the clean images. There- ods are failed to reconstruct the original clean images since
fore, existing methods fail to reconstruct enough details and they lack of image prior. As a result, our IPT model can
generate abnormal pixels. As a result, our pre-trained model present exactly the same image as the ground-truth and sur-
12304
Input / Groundtruth DSC GMM JCAS Clear SPANet
27.37 / 0.8154 29.34 / 0.8479 32.38 / 0.9306 31.45 / 0.9151 31.59 / 0.9380 35.67 / 0.9700
Figure 5. Image deraining results on the Rain100L dataset. Compared images are derived from [63].
Table 1. Quantitative results on image super-resolution. Best and Table 2. Quantitative results on color image denoising. Best and
second best results are highlighted and underlined. second best results are highlighted and underlined.
Method Scale Set5 Set14 B100 Urban100 BSD68 Urban100
VDSR [39] 37.53 33.05 31.90 30.77
Method
×2 30 50 30 50
EDSR [48] ×2 38.11 33.92 32.32 32.93
RCAN [80] ×2 38.27 34.12 32.41 33.34 CBM3D [16] 29.73 27.38 30.36 27.94
RDN [82] ×2 38.24 34.01 32.34 32.89 TNRD [14] 27.64 25.96 27.40 25.52
OISR-RK3 [34] ×2 38.21 33.94 32.36 33.03 DnCNN [75] 30.40 28.01 30.28 28.16
RNAN [81] ×2 38.17 33.87 32.32 32.73
SAN [17] ×2 38.31 34.07 32.42 33.10
MemNet [57] 28.39 26.33 28.93 26.53
HAN [51] ×2 38.27 34.16 32.41 33.35 IRCNN [76] 30.22 27.86 30.28 27.69
IGNN [87] ×2 38.24 34.07 32.41 33.23 FFDNet [77] 30.31 27.96 30.53 28.05
IPT (ours) ×2 38.37 34.43 32.48 33.76 SADNet [9] 30.64 28.32 N/A N/A
VDSR [39] ×3 33.67 29.78 28.83 27.14 RDN [83] 30.67 28.31 31.69 29.29
EDSR [48] ×3 34.65 30.52 29.25 28.80
RCAN [80] ×3 34.74 30.65 29.32 29.09
IPT (ours) 32.32 29.88 33.75 31.12
RDN [82] ×3 34.71 30.57 29.26 28.80
OISR-RK3 [34] ×3 34.72 30.57 29.29 28.95 4.4. Generalization Ability
RNAN [81] ×3 34.66 30.52 29.26 28.75
SAN [17] ×3 34.75 30.59 29.33 28.93 Although we can generate various corrupted images, nat-
HAN [51] ×3 34.75 30.67 29.32 29.10 ural images are of high complexity and we cannot syn-
IGNN [87] ×3 34.72 30.66 29.31 29.03
thesize all possible images for pre-training the transformer
IPT (ours) ×3 34.81 30.85 29.38 29.49
model. However, a good pre-trained model should have the
VDSR [39] ×4 31.35 28.02 27.29 25.18
EDSR [48] ×4 32.46 28.80 27.71 26.64 capacity for well adapting other tasks as those in the field of
RCAN [80] ×4 32.63 28.87 27.77 26.82 NLP. To this end, we then conduct several experiments to
SAN [17] ×4 32.64 28.92 27.78 26.79 verify the generalization ability of our model. In practice,
RDN [82] ×4 32.47 28.81 27.72 26.61
OISR-RK3 [34] ×4 32.53 28.86 27.75 26.79
we test corrupted images that did not include in our syn-
RNAN [81] ×4 32.49 28.83 27.72 26.61 thesized ImageNet dataset, i.e., image denoising with noisy
HAN [51] ×4 32.64 28.90 27.80 26.85 level 10 and 70, respectively. We use the heads and tails for
IGNN [87] ×4 32.57 28.85 27.77 26.84 image denoising tasks as the pre-trained model.
IPT (ours) ×4 32.64 29.01 27.82 27.26
The detailed results are shown in Table 4, we compare
the performance of using the pre-trained IPT model and the
passes all the previous algorithms in visual quality. This state-of-the-art methods for image denoising. Obviously,
result substantiates the generality of the proposed model. IPT model outperforms other conventional methods, which
12305
Table 3. Quantitative results of image deraining on the Rain100L dataset. Best and second best results are highlighted and underlined.
Method Input DSC [28] GMM [46] JCAS [30] Clear [27] DDN [28]
PSNR 26.90 27.34 29.05 28.54 30.24 32.38
SSIM 0.8384 0.8494 0.8717 0.8524 0.9344 0.9258
RESCAN [45] PReNet [55] JORDER E [70] SPANet [65] SSIR [67] RCDNet [63] IPT (ours)
38.52 37.45 38.59 35.33 32.37 40.00 41.62
0.9812 0.9790 0.9834 0.9694 0.9258 0.9860 0.9880
Table 4. Generation ability of our IPT model on color image de- which demonstrates that the effectiveness of our IPT model
noising with different noise levels. Best and second best results
for pre-training.
are highlighted and underlined.
BSD68 Urban100 Table 5. Impact of λ for contrastive learning.
Method
10 70 10 70 λ 0 0.05 0.1 0.2 0.5
CBM3D [16] 35.91 26.00 36.00 26.31 PSNR 38.27 38.32 38.37 38.33 38.26
TNRD [14] 33.36 23.83 33.60 22.63
DnCNN [75] 36.31 26.56 36.21 26.17
Impact of contrastive learning. As discussed above, to
MemNet [57] N/A 25.08 N/A 24.96
improve the representation ability of our pre-trained model,
IRCNN [76] 36.06 N/A 35.81 N/A
we embed the contrastive learning loss (Eq. 6) into the train-
FFDNet [77] 36.14 26.53 35.77 26.39
ing procedure. We then evaluate its effectiveness on the ×2
RDN [83] 36.47 26.85 36.69 27.63
scale super-resolution task using the Set4 dataset. Table 5
IPT (ours) 38.30 28.21 39.07 28.80
shows the impact of the hyper-parameter λ for balancing
33.8 the two terms in Eq. 6. When λ=0, the IPT model is trained
using only a supervised learning approach, the resulting
33.6 PSNR value is 38.27dB. When employing the contrastive
loss for self-supervised learning, the model can achieve a
33.4 38.37dB PSNR value (λ = 0.1), which is about 0.1dB higher
PSNR (dB)
12306
References global reasoning networks. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pages
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge 433–442, 2019. 3
on single image super-resolution: Dataset and study. In Pro-
[16] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and
ceedings of the IEEE Conference on Computer Vision and
Karen Egiazarian. Color image denoising via sparse 3d col-
Pattern Recognition Workshops, pages 126–135, 2017. 4
laborative filtering with grouping constraint in luminance-
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast,
chrominance space. In 2007 IEEE International Conference
accurate, and lightweight super-resolution with cascading
on Image Processing, volume 1, pages I–313. IEEE, 2007.
residual network. In Proceedings of the European Confer-
6, 7, 8
ence on Computer Vision (ECCV), pages 252–268, 2018. 2
[17] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and
[3] Saeed Anwar and Nick Barnes. Densely residual laplacian
Lei Zhang. Second-order attention network for single im-
super-resolution. IEEE Transactions on Pattern Analysis and
age super-resolution. In Proceedings of the IEEE conference
Machine Intelligence, 2020. 2
on computer vision and pattern recognition, pages 11065–
[4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
11074, 2019. 6, 7
ton. Layer normalization. arXiv preprint arXiv:1607.06450,
[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
2016. 4
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
[5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub-
database. In 2009 IEEE conference on computer vision and
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan-
pattern recognition, pages 248–255. Ieee, 2009. 1
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
Language models are few-shot learners. arXiv preprint [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
arXiv:2005.14165, 2020. 2 Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint
[6] Bolun Cai, Xiangmin Xu, Kui Jia, Chunmei Qing, and
arXiv:1810.04805, 2018. 2
Dacheng Tao. Dehazenet: An end-to-end system for single
image haze removal. IEEE Transactions on Image Process- [20] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
ing, 25(11):5187–5198, 2016. 2 Tang. Learning a deep convolutional network for image
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas super-resolution. In European conference on computer vi-
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- sion, pages 184–199. Springer, 2014. 2, 5
to-end object detection with transformers. arXiv preprint [21] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
arXiv:2005.12872, 2020. 2, 3 Tang. Image super-resolution using deep convolutional net-
[8] Gabriella Castellano, Leonardo Bonilha, LM Li, and Fer- works. IEEE transactions on pattern analysis and machine
nando Cendes. Texture analysis of medical images. Clinical intelligence, 38(2):295–307, 2015. 2
radiology, 59(12):1061–1069, 2004. 1 [22] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[9] Meng Chang, Qi Li, Huajun Feng, and Zhihai Xu. Spatial- Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
adaptive network for single image denoising. arXiv preprint Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
arXiv:2001.10291, 2020. 7 vain Gelly, et al. An image is worth 16x16 words: Trans-
[10] Liang Chen, Faming Fang, Shen Lei, Fang Li, and Guixu formers for image recognition at scale. arXiv preprint
Zhang. Enhanced sparse model for blind deblurring. 2020. arXiv:2010.11929, 2020. 2, 3
2 [23] Thomas Eboli, Jian Sun, and Jean Ponce. End-to-end in-
[11] Mark Chen, Alec Radford, Rewon Child, Jeff Wu, Hee- terpretable learning of non-blind image deblurring. arXiv
woo Jun, Prafulla Dhariwal, David Luan, and Ilya Sutskever. preprint arXiv:2007.01769, 2020. 2
Generative pretraining from pixels. In Proceedings of the [24] Yuchen Fan, Honghui Shi, Jiahui Yu, Ding Liu, Wei
37th International Conference on Machine Learning, vol- Han, Haichao Yu, Zhangyang Wang, Xinchao Wang, and
ume 1, 2020. 3 Thomas S Huang. Balanced two-stage residual networks for
[12] Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, image super-resolution. In Proceedings of the IEEE Con-
Yang Zhang, Michael Carbin, and Zhangyang Wang. The ference on Computer Vision and Pattern Recognition Work-
lottery tickets hypothesis for supervised and self-supervised shops, pages 161–168, 2017. 2
pre-training in computer vision models. arXiv preprint [25] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
arXiv:2012.06908, 2020. 1 wei Fang, and Hanqing Lu. Dual attention network for
[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- scene segmentation. In Proceedings of the IEEE Conference
offrey Hinton. A simple framework for contrastive learning on Computer Vision and Pattern Recognition, pages 3146–
of visual representations. arXiv preprint arXiv:2002.05709, 3154, 2019. 2
2020. 5 [26] Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhi-
[14] Yunjin Chen and Thomas Pock. Trainable nonlinear reaction wei Fang, and Hanqing Lu. Dual attention network for
diffusion: A flexible framework for fast and effective image scene segmentation. In Proceedings of the IEEE Conference
restoration. IEEE transactions on pattern analysis and ma- on Computer Vision and Pattern Recognition, pages 3146–
chine intelligence, 39(6):1256–1272, 2016. 6, 7, 8 3154, 2019. 3
[15] Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Yan [27] Xueyang Fu, Jiabin Huang, Xinghao Ding, Yinghao Liao,
Shuicheng, Jiashi Feng, and Yannis Kalantidis. Graph-based and John Paisley. Clearing the skies: A deep network archi-
12307
tecture for single-image rain removal. IEEE Transactions on Big transfer (bit): General visual representation learning.
Image Processing, 26(6):2944–2956, 2017. 8 arXiv preprint arXiv:1912.11370, 6, 2019. 3
[28] Xueyang Fu, Jiabin Huang, Delu Zeng, Yue Huang, Xinghao [41] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Ding, and John Paisley. Removing rain from single images Imagenet classification with deep convolutional neural net-
via a deep detail network. In Proceedings of the IEEE Con- works. Communications of the ACM, 60(6):84–90, 2017. 2
ference on Computer Vision and Pattern Recognition, pages [42] Stamatios Lefkimmiatis. Non-local color image denoising
3855–3863, 2017. 8 with convolutional neural networks. In Proceedings of the
[29] Xueyang Fu, Borong Liang, Yue Huang, Xinghao Ding, and IEEE Conference on Computer Vision and Pattern Recogni-
John Paisley. Lightweight pyramid networks for image de- tion, pages 3587–3596, 2017. 2
raining. IEEE transactions on neural networks and learning [43] Boyi Li, Xiulian Peng, Zhangyang Wang, Jizheng Xu, and
systems, 2019. 2 Dan Feng. An all-in-one network for dehazing and beyond.
[30] Shuhang Gu, Deyu Meng, Wangmeng Zuo, and Lei Zhang. arXiv preprint arXiv:1707.06543, 2017. 2
Joint convolutional analysis and synthesis sparse representa- [44] Siyuan Li, Wenqi Ren, Feng Wang, Iago Breno Araujo,
tion for single image layer separation. In Proceedings of the Eric K Tokuda, Roberto Hirata Junior, Roberto M Cesar-
IEEE International Conference on Computer Vision, pages Jr, Zhangyang Wang, and Xiaochun Cao. A comprehen-
1708–1716, 2017. 4, 8 sive benchmark analysis of single image deraining: Current
[31] Shi Guo, Zifei Yan, Kai Zhang, Wangmeng Zuo, and Lei challenges and future perspectives. International Journal of
Zhang. Toward convolutional blind denoising of real pho- Computer Vision, pages 1–22, 2021. 2
tographs. In Proceedings of the IEEE Conference on Com- [45] Xia Li, Jianlong Wu, Zhouchen Lin, Hong Liu, and Hongbin
puter Vision and Pattern Recognition, pages 1712–1722, Zha. Recurrent squeeze-and-excitation context aggregation
2019. 2 net for single image deraining. In Proceedings of the Euro-
[32] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross pean Conference on Computer Vision (ECCV), pages 254–
Girshick. Momentum contrast for unsupervised visual rep- 269, 2018. 8
resentation learning. In Proceedings of the IEEE/CVF Con- [46] Yu Li, Robby T Tan, Xiaojie Guo, Jiangbo Lu, and Michael S
ference on Computer Vision and Pattern Recognition, pages Brown. Rain streak removal using layer priors. In Proceed-
9729–9738, 2020. 5 ings of the IEEE conference on computer vision and pattern
[33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. recognition, pages 2736–2744, 2016. 8
Deep residual learning for image recognition. In Proceed- [47] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
ings of the IEEE conference on computer vision and pattern Kyoung Mu Lee. Enhanced deep residual networks for single
recognition, pages 770–778, 2016. 2 image super-resolution. In Proceedings of the IEEE confer-
[34] Xiangyu He, Zitao Mo, Peisong Wang, Yang Liu, Mingyuan ence on computer vision and pattern recognition workshops,
Yang, and Jian Cheng. Ode-inspired network design for sin- pages 136–144, 2017. 2
gle image super-resolution. In Proceedings of the IEEE Con- [48] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
ference on Computer Vision and Pattern Recognition, pages Kyoung Mu Lee. Enhanced deep residual networks for single
1732–1741, 2019. 6, 7 image super-resolution. In Proceedings of the IEEE confer-
[35] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, and Pheng-Ann Heng. ence on computer vision and pattern recognition workshops,
Depth-attentional features for single-image rain removal. In pages 136–144, 2017. 6, 7
Proceedings of the IEEE Conference on Computer Vision [49] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar
and Pattern Recognition, pages 8022–8031, 2019. 2 Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-
[36] Xixi Jia, Sanyang Liu, Xiangchu Feng, and Lei Zhang. Foc- moyer, and Veselin Stoyanov. Roberta: A robustly optimized
net: A fractional optimal control network for image denois- bert pretraining approach. arXiv preprint arXiv:1907.11692,
ing. In Proceedings of the IEEE Conference on Computer 2019. 2
Vision and Pattern Recognition, pages 6054–6063, 2019. 2 [50] Boyu Lu, Jun-Cheng Chen, and Rama Chellappa. Unsuper-
[37] Peng-Tao Jiang, Qibin Hou, Yang Cao, Ming-Ming Cheng, vised domain-specific deblurring via disentangled represen-
Yunchao Wei, and Hong-Kai Xiong. Integral object min- tations. In Proceedings of the IEEE Conference on Computer
ing via online attention accumulation. In Proceedings of the Vision and Pattern Recognition, pages 10225–10234, 2019.
IEEE/CVF International Conference on Computer Vision, 2
pages 2070–2079, 2019. 3 [51] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping
[38] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, and
Two transformers can make one strong gan. arXiv preprint Haifeng Shen. Single image super-resolution via a holistic
arXiv:2102.07074, 2021. 3 attention network. In European Conference on Computer
[39] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate Vision, pages 191–207. Springer, 2020. 7
image super-resolution using very deep convolutional net- [52] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
works. In Proceedings of the IEEE conference on computer Sutskever. Improving language understanding by generative
vision and pattern recognition, pages 1646–1654, 2016. 2, pre-training, 2018. 2
6, 7 [53] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
[40] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Amodei, and Ilya Sutskever. Language models are unsuper-
Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. vised multitask learners. OpenAI blog, 1(8):9, 2019. 2
12308
[54] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, moval. In Proceedings of the IEEE Conference on Computer
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Vision and Pattern Recognition, pages 3877–3886, 2019. 8
Peter J Liu. Exploring the limits of transfer learning with a [68] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan,
unified text-to-text transformer. Journal of Machine Learn- Peizhao Zhang, Masayoshi Tomizuka, Kurt Keutzer, and Pe-
ing Research, 21(140):1–67, 2020. 2 ter Vajda. Visual transformers: Token-based image repre-
[55] Dongwei Ren, Wangmeng Zuo, Qinghua Hu, Pengfei Zhu, sentation and processing for computer vision. arXiv preprint
and Deyu Meng. Progressive image deraining networks: A arXiv:2006.03677, 2020. 3
better and simpler baseline. In Proceedings of the IEEE con- [69] Wenhan Yang, Jiaying Liu, Shuai Yang, and Zongming Guo.
ference on computer vision and pattern recognition, pages Scale-free single image deraining via visibility-enhanced re-
3937–3946, 2019. 2, 8 current wavelet learning. IEEE Transactions on Image Pro-
[56] Karen Simonyan and Andrew Zisserman. Very deep convo- cessing, 28(6):2948–2961, 2019. 2
lutional networks for large-scale image recognition. arXiv [70] Wenhan Yang, Robby T Tan, Jiashi Feng, Zongming Guo,
preprint arXiv:1409.1556, 2014. 2 Shuicheng Yan, and Jiaying Liu. Joint rain detection and
[57] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Mem- removal from a single image with contextualized deep net-
net: A persistent memory network for image restoration. In works. IEEE transactions on pattern analysis and machine
Proceedings of the IEEE international conference on com- intelligence, 42(6):1377–1393, 2019. 5, 6, 8
puter vision, pages 4539–4547, 2017. 6, 7, 8
[71] Xitong Yang, Zheng Xu, and Jiebo Luo. Towards percep-
[58] Hao Tan and Mohit Bansal. Lxmert: Learning cross- tual image dehazing by physics-based disentanglement and
modality encoder representations from transformers. arXiv adversarial training. In AAAI, pages 7485–7492, 2018. 2
preprint arXiv:1908.07490, 2019. 2
[72] Yuhui Yuan and Jingdong Wang. Ocnet: Object context net-
[59] Xin Tao, Hongyun Gao, Xiaoyong Shen, Jue Wang, and Ji- work for scene parsing. arXiv preprint arXiv:1809.00916,
aya Jia. Scale-recurrent network for deep image deblurring. 2018. 3
In Proceedings of the IEEE Conference on Computer Vision
[73] Yongnian Zeng, Wei Huang, Maoguo Liu, Honghui Zhang,
and Pattern Recognition, pages 8174–8182, 2018. 2
and Bin Zou. Fusion of satellite images in urban area: As-
[60] Chunwei Tian, Yong Xu, Zuoyong Li, Wangmeng Zuo,
sessing the quality of resulting images. In 2010 18th Inter-
Lunke Fei, and Hong Liu. Attention-guided cnn for image
national Conference on Geoinformatics, pages 1–4. IEEE,
denoising. Neural Networks, 124:117–129, 2020. 2
2010. 1
[61] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
[74] He Zhang and Vishal M Patel. Densely connected pyramid
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
dehazing network. In Proceedings of the IEEE conference on
Polosukhin. Attention is all you need. In Advances in neural
computer vision and pattern recognition, pages 3194–3203,
information processing systems, pages 5998–6008, 2017. 2,
2018. 2
4
[62] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng [75] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and
Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Lei Zhang. Beyond a gaussian denoiser: Residual learning of
Residual attention network for image classification. In Pro- deep cnn for image denoising. IEEE Transactions on Image
ceedings of the IEEE conference on computer vision and pat- Processing, 26(7):3142–3155, 2017. 6, 7, 8
tern recognition, pages 3156–3164, 2017. 2 [76] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang.
[63] Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng. A model- Learning deep cnn denoiser prior for image restoration. In
driven deep neural network for single image rain removal. Proceedings of the IEEE conference on computer vision and
In Proceedings of the IEEE/CVF Conference on Computer pattern recognition, pages 3929–3938, 2017. 6, 7, 8
Vision and Pattern Recognition, pages 3103–3112, 2020. 7, [77] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward
8 a fast and flexible solution for cnn-based image denoising.
[64] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang IEEE Transactions on Image Processing, 27(9):4608–4622,
Li, Derek F Wong, and Lidia S Chao. Learning deep 2018. 6, 7, 8
transformer models for machine translation. arXiv preprint [78] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward
arXiv:1906.01787, 2019. 2 a fast and flexible solution for cnn-based image denoising.
[65] Tianyu Wang, Xin Yang, Ke Xu, Shaozhe Chen, Qiang IEEE Transactions on Image Processing, 27(9):4608–4622,
Zhang, and Rynson WH Lau. Spatial attentive single-image 2018. 6
deraining with a high quality real rain dataset. In Proceed- [79] Songyang Zhang, Xuming He, and Shipeng Yan. Latent-
ings of the IEEE Conference on Computer Vision and Pattern gnn: Learning efficient non-local relations for visual recog-
Recognition, pages 12270–12279, 2019. 2, 8 nition. In International Conference on Machine Learning,
[66] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- pages 7374–7383, 2019. 3
ing He. Non-local neural networks. In Proceedings of the [80] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
IEEE conference on computer vision and pattern recogni- Zhong, and Yun Fu. Image super-resolution using very deep
tion, pages 7794–7803, 2018. 3 residual channel attention networks. In Proceedings of the
[67] Wei Wei, Deyu Meng, Qian Zhao, Zongben Xu, and Ying European Conference on Computer Vision (ECCV), pages
Wu. Semi-supervised transfer learning for image rain re- 286–301, 2018. 2, 7
12309
[81] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and Yun
Fu. Residual non-local attention networks for image restora-
tion. arXiv preprint arXiv:1903.10082, 2019. 6, 7
[82] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
Yun Fu. Residual dense network for image super-resolution.
In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2472–2481, 2018. 6, 7
[83] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
Yun Fu. Residual dense network for image restoration. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
2020. 7, 8
[84] Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Explor-
ing self-attention for image recognition. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 10076–10085, 2020. 3
[85] Jia-Xing Zhao, Yang Cao, Deng-Ping Fan, Ming-Ming
Cheng, Xuan-Yi Li, and Le Zhang. Contrast prior and fluid
pyramid integration for rgbd salient object detection. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 3927–3936, 2019. 1
[86] Jia-Xing Zhao, Jiang-Jiang Liu, Deng-Ping Fan, Yang Cao,
Jufeng Yang, and Ming-Ming Cheng. Egnet: Edge guid-
ance network for salient object detection. In Proceedings
of the IEEE/CVF International Conference on Computer Vi-
sion, pages 8779–8788, 2019. 1
[87] Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, and
Chen Change Loy. Cross-scale internal graph neural network
for image super-resolution. Advances in Neural Information
Processing Systems, 33, 2020. 6, 7
[88] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
Wang, and Jifeng Dai. Deformable detr: Deformable trans-
formers for end-to-end object detection. arXiv preprint
arXiv:2010.04159, 2020. 3
12310