0% found this document useful (0 votes)
61 views23 pages

Image-To-Image Translation Methods and Applications

This paper provides a comprehensive overview of image-to-image translation (I2I), which focuses on transferring images between source and target domains while maintaining content representation. It discusses key techniques, categorizes I2I tasks into two-domain and multi-domain, and highlights various applications in computer vision and image processing. The paper also addresses the impact of I2I on research and industry, as well as the challenges that remain in the field.

Uploaded by

melkamu.swe21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views23 pages

Image-To-Image Translation Methods and Applications

This paper provides a comprehensive overview of image-to-image translation (I2I), which focuses on transferring images between source and target domains while maintaining content representation. It discusses key techniques, categorizes I2I tasks into two-domain and multi-domain, and highlights various applications in computer vision and image processing. The paper also addresses the impact of I2I on research and industry, as well as the challenges that remain in the field.

Uploaded by

melkamu.swe21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

24, 2022 3859

Image-to-Image Translation: Methods


and Applications
Yingxue Pang , Graduate Student Member, IEEE, Jianxin Lin, Tao Qin , Senior Member, IEEE,
and Zhibo Chen , Senior Member, IEEE

Abstract—Image-to-image translation (I2I) aims to transfer


images from a source domain to a target domain while preserving
the content representations. I2I has drawn increasing attention and
made tremendous progress in recent years because of its wide range
of applications in many computer vision and image processing
problems, such as image synthesis, segmentation, style transfer,
restoration, and pose estimation. In this paper, we provide an
overview of the I2I works developed in recent years. We will analyze
the key techniques of the existing I2I works and clarify the main Fig. 1. An example of image-to-image translation (I2I) for illustration. (Left):
progress the community has made. Additionally, we will elaborate How to make your selfie more artistic as drawings from cartoonists? This type
on the effect of I2I on the research and industry community and of research work can be broadly deemed the I2I problem. (Right): You can take
point out remaining challenges in related fields. a selfie as a source image and a cartoon as a target reference to “translate” it into
desired artistic style image.
Index Terms—Image-to-image translation, two-domain I2I,
multi-domain I2I, supervised methods, unsupersived methods,
semi-supervised methods, few-shot methods.
image xB ∈ B given the input source image xA ∈ A. Mathe-
matically, we can model this translation process as
I. INTRODUCTION
xAB ∈ B : xAB = GA→B (xA ). (1)
MAGINE if you take a selfie and want to make it more artistic
I as a drawing from a cartoonist, how can you automatically
achieve that with a computer? This type of research work can
From the above basic definition of I2I, we see that convert-
ing an image from one source domain to another target do-
be broadly deemed the image-to-image translation (I2I) ([1], main can cover many problems in image processing, computer
[2]) problem. In general, the goal of I2I is to convert an input graphics, computer vision and so on. Specifically, I2I has been
image xA from a source domain A to a target domain B with broadly applied in semantic image synthesis [3]–[7], image seg-
the intrinsic source content preserved and the extrinsic target mentation [8]–[10], style transfer [2], [11]–[14], image inpaint-
style transferred. For example, one can take selfie images as the ing [15]–[19], 3D pose estimation [20], [21], image/video col-
source domain and “translate” them to desired artistic style im- orization [22]–[27], image super-resolution [28], [29], domain
ages given some cartoons as target domain references, as shown adaptation [30]–[32], cartoon generation [33]–[38] and image
in Fig. 1. To this end, we need to train a mapping GA→B that registration [39]. We will analyze and discuss these related ap-
generates image xAB ∈ B indistinguishable from target domain plications in detail in Section VI.
In this paper, we aim to provide a comprehensive review of the
recent progress in image-to-image translation research works. To
Manuscript received 28 January 2021; revised 5 June 2021 and 14 July 2021;
accepted 24 August 2021. Date of publication 3 September 2021; date of current the best of our knowledge, this is the first overview paper to cover
version 9 August 2022. This work was supported in part by NSFC under Grants the analysis, methodology, and related applications of I2I. In
U1908209, 61632001, and 62021001 and in part by the National Key Research detail, we present our survey with the following organization:
and Development Program of China under Grant 2018AAA0101400. The As- r First, we briefly introduce the two most representative
sociate Editor coordinating the review of this manuscript and approving it for
publication was Dr. Wen-Huang Cheng. (Corresponding author: Zhibo Chen.) and commonly adopted generative models, as well as
Yingxue Pang and Zhibo Chen are with the Department of Electronic some well-known evaluation metrics, applied for image-
Engineer and Information Science, University of Science and Technology
of China, Hefei, Anhui 230026, China (e-mail: pangyx@[Link]; to-image translation, and then we analyze how these gen-
chenzhibo@[Link]). erative models learn to represent and acquire the desired
Jianxin Lin is with the School of Computer Science and Elec- translation results.
tronic Engineering, Hunan University, Changsha 410000, China (e-mail: r Second, we categorize the I2I problem into two main sets
linjx@[Link]).
Tao Qin is with the Microsoft Research Asia, Beijing 100080, China (e-mail: of tasks, i.e., two-domain I2I tasks and multi-domain I2I
taoqin@[Link]). tasks, in which numerous I2I works have appeared for each
Color versions of one or more figures in this article are available at
[Link] set of I2I tasks and brought far-reaching influence on other
Digital Object Identifier 10.1109/TMM.2021.3109419 research fields, as shown in Fig. 2.

1520-9210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3860 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

TABLE I
LIST OF TWO-DOMAIN I2I METHODS INCLUDING MODEL NAME, PUBLICATION YEAR, THE TYPE OF TRAINING DATA, WHETHER MULTIMODAL OR
NOT AND CORRESPONDING INSIGHTS

r Last but not least, we provide a thorough taxonomy of the parameters (i.e., a Gaussian distribution) or non-parametric
I2I applications following the same categorizations of I2I variants (each instance has its own contribution to the distri-
methods, as illustrated in Table V. bution), and it approximates that underlying distribution with
In general, our paper is organized as follows. Section I pro- particular algorithms. This approach enables the generative
vides the problem setting of the image-to-image translation task. model to generate data rather than only discriminate between
Section II introduces the generative models used for I2I meth- data (classification). For instance, the deep generative mod-
ods. Section III discusses the works on the two-domain I2I task. els have shown substantial performance improvements in mak-
Section IV focuses on works related to the multi-domain I2I task. ing predictions [43], estimating missing data [44], compressing
Then, Section VI reviews the various and fruitful applications datasets [45] and generating invisible data. In an I2I task, a gen-
of I2I tasks. Summary and outlook are given in Section VII. erative model can model the distribution of the target domain
by producing convincing “fake” data, namely, the translated im-
ages, that appear to be drawn from the distribution of the target
II. THE BACKBONE OF I2I domain.
Because an I2I task aims to learn the mappings between However, considering the length of this article and the dif-
different image domains, how to represent these mappings to ference in research foci, we inevitably omit those generative
generate the desirable results is explicitly related to the gen- models that are vaguely connected with the theme of I2I,
erative models. The generative model [40]–[42] assumes that such as deep Boltzmann machines (DBMs) [46]–[48], deep
data is created by a particular distribution that is defined by two autoregressive models (DARs) [49]–[51] and normalizing flow
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3861

TABLE II
LIST OF MULTI-DOMAIN I2I METHODS INCLUDING MODEL NAME, PUBLICATION YEAR, THE TYPE OF TRAINING DATA, WHETHER MULTIMODAL OR
NOT AND CORRESPONDING INSIGHTS

TABLE III
THE AVERAGE FID, IS, LPIPS SCORES OF DIFFERENT TWO-DOMAIN I2I METHODS TRAINED ON UT-ZAP50 K DATASET [188] IN TASK EDGE → SHOES.
THE BEST SCORES ARE IN BOLD

TABLE IV
THE AVERAGE FID, IS, LPIPS SCORES OF DIFFERENT MULTI-DOMAIN I2I METHODS TRAINED ON CELEBA DATASET [186] IN 5 DOMAINS INCLUDING BLACK
HAIR, BLOND HAIR, BROWN HAIR, GENDER (MALE OR FEMALE) AND AGE (YOUNG OR OLD). IN ADDITION TO THE FINAL AVERAGE METRIC SCORES, WE ALSO
REPORT TWO DOMAINS RESULTS (BLACK HAIR AND GENDER) FOR REFERENCE. THE BEST SCORES ARE IN BOLD

models (NFMs) [52], [53]. Therefore, we will briefly introduce A. Variational AutoEncoder
two of the most commonly used and efficient deep generative
Inspired by the Helmholtz machine [54], the variational au-
models in I2I tasks, variational autoencoders (VAEs) [52], [54]– toencoder (VAE) [55], [56] was initially proposed for a varia-
[62] and generative adversarial networks (GANs) [63]–[72], as tional inference problem in deep latent Gaussian models.
well as the intuition behind them. Both models basically aim
As shown in Fig 3, a VAE [55], [56] adopts a recognition
to construct a replica x = g(z) for generating the desired sam- model (encoder) qφ (z|x) to approximate the posterior distribu-
ples x from the latent variable z, but their specific approaches tion p(z|x) and a generative model (decoder) pθ (x|z) to map
are different. A VAE models data distribution by maximizing
the latent variable z to the data x. Specifically, a VAE trains its
the lower bound of the data log-likelihood, whereas a GAN generative model to learn a distribution p(x) to be near the given
tries to find the Nash equilibrium between a generator and data x by maximizing the log-likelihood function logpθ (x):
discriminator.
On the other hand, after obtaining the translated results from
the generative model, we need subjective and objective metrics 
N
for evaluating the quality of translated images. Therefore, we logpθ (x) = logpθ (xi ),
will also briefly present common evaluation metrics in the I2I i=1

problem.
logpθ (xi ) = log pθ (xi |z)p(z)dz. (2)

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3862 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

TABLE V
APPLICATIONS OF I2I DISCUSSED IN SECTION VI

Stochastic gradient ascent (SGA) combined with the naive where DKL denotes the KL divergence that is non-negative, and
Monte Carlo gradient estimator (MCGE) can be used to find the θ and φ are neural network parameters. Naturally, we can obtain
optimal solution in Eqn. (2). However, it often fails because of a variational lower bound on the log-likelihood:
the highly skewed samples pθ (x|z) that exhibit a very high vari-
ance. A VAE therefore introduces the recognition model qφ (z|x) logpθ (xi ) ≥ L(xi , θ, φ). (5)
as a multivariate Gaussian distribution with a diagonal covari-
ance structure: Hence, a VAE differentiates and optimizes the lower bound
L(xi , θ, φ) instead of logpθ (xi ). Here is the final objective func-
qφ (z|x) = N (z|μz (x, φ), σz2 (x, φ)I). (3) tion of a VAE:

Eqn. (2) can be rewritten as: L(xi , θ, φ) = Ez∼qφ (z|xi ) [logpθ (xi |z)]

logpθ (xi ) = L(xi , θ, φ) + DKL [qφ (z|xi )||pθ (z|xi )]. (4) − DKL [qφ (z|xi )||pθ (z|xi )]. (6)

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3863

Fig. 2. An overview of image-to-image translation methods. This figure shows the relationship between different methods and where they intersect with each
other.

Fig. 3. The structure of a VAE.

As stated in [73], VAEs provide more stable training than


generative adversarial networks (GANs) [63] and more efficient
sampling mechanisms than autoregressive models [57], [58]. Fig. 4. The structure of unconditional GANs, where z, G and D denote the
However, several practical and theoretical challenges of VAEs random noise, generator, and discriminator, respectively.
remain unsolved. The main drawback of variational methods
is their tendency to strike an unsatisfactory trade-off between
the sample quality and the reconstruction quality because of
player is represented by a differentiable function controlled
the weak approximate posterior distribution or overly simplis-
by a set of parameters. Generator G tries to generate fake but
tic posterior distribution. The studies in [52], [59], [60] enrich
plausible images, while discriminator D is trained to distinguish
the variational posterior to alleviate the blurriness of generated
the difference between real and fake images. The solution of
samples. Tomczak et al. [61] proposed a new prior, VampPrior,
this game is to find a Nash equilibrium between the two players.
to learn more powerful hidden representations. In addition, [62]
In the following subsection, we will discuss the unconditional
claimed that the inherent over-regularization induced by the KL
GANs, the conditional GANs and the way to train GANs.
divergence term in the VAE objective often leads to a gap be-
1) Unconditional GANs: The original GAN proposed
tween L(xi , θ, φ) and the true likelihood.
by [63] can be considered an unconditional GAN. It adopts
Generally, with the development of VAEs, this type of gen-
the multilayer perceptron (MLP) [81] to construct a structured
erative model constitutes one well-established approach for I2I
probabilistic model taking latent noise variables z and observed
tasks [16], [74]–[77]. Next, we will introduce another important
real data x as inputs. Because the convolutional neural network
generative model, generative adversarial networks, which have
(CNN) [82] has been demonstrated to be more effective than
been widely used in multiple I2I models [1], [2], [78]–[80].
the MLP in representing image features, the studies in [66] pro-
posed the deep convolutional generative adversarial networks
B. Generative Adversarial Networks (DGANs) to learn a better representation of images and improve
The main idea of generative adversarial networks the original GAN performance.
(GANs) [63]–[65] is to establish a zero-sum game between two As illustrated in Fig 4, the generator G inputs a random noise
players, namely, a generator and discriminator, in which each z sampled from the model’s prior distribution p(z) to generate

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3864 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

is often trapped in mode collapse, and it is difficult to achieve


convergence.
To address the stability problem, many recent studies focus
on finding new cost functions with smoother non-vanishing or
non-exploding gradients everywhere. WGAN [64] proposes a
new cost function using the Wasserstein distance to address
the mode collapse problem appearing in naive GAN [63], and
WGAN-GP [68] uses a gradient penalty instead of the weight
clipping to enforce the Lipschitz constraint in WGAN. LS-
GAN [65] finds that optimizing the least squares cost function is
identical to optimizing a Pearson χ2 divergence. EBGAN [69]
replaces the discriminator with an autoencoder and uses the re-
Fig. 5. The structure of conditional GANs, where z, G and D denote the construction cost (MSE) to criticize the real and generated im-
random noise, generator, and discriminator, respectively. Conditional GANs
usually add additional information y (such as data labels, text or attributes of ages. BEGAN [70] builds with the same EBGAN autoencoder
images) to the generator and discriminator to generate desirable results. concept for the discriminator but with different cost functions.
RSGAN [71] measures the probability that the real data is more
realistic than the generated data, making the cost function rel-
a fake image G(z) to fit the distribution of real data as much ativistic. SNGAN [72] proposes a weight normalization tech-
as possible. Then, the discriminator D randomly takes the real nique called spectral normalization to stabilize the training of
sample x from the dataset and the fake sample G(z) as input the discriminator.
to output a probability between 0 and 1, indicating whether the
input is a real or fake image. In other words, D wants to discrim- C. Evaluation Metrics
inate the generated fake sample G(z) while G intends to create To reflect the visual quality of the translation performance
samples to confuse D. Consequently, the objective optimization more comprehensively, we also introduce some common eval-
problem is as shown below: uation metrics used in I2I, including subjective and objective
metrics.
min max L(D, G) = Ex∼pdata (x) [logD(x)] 1) Subjective Image Quality Assessment:
G D
r AMT perceptual studies: This test is a “real or fake”
− Ez∼pz (z) [log(1 − D(G(z)))]. (7) two-alternative forced choice experiment on the Amazon
Mechanical Turk (AMT) used in many I2I algorithms [1],
where x denotes the real data, z denotes the random noise vec-
[2], [85], [86]. Turkers are presented a series of pairs of
tor, and G(z) are the fake samples generated by the generator
images, one real and one fake (generated by the I2I mod-
G. D(x) indicates the probability that D’s input is real, and
els). Participants are asked to choose the photo they think
D(G(z)) denotes the probability that D discriminates between
is real and then obtain the feedback to compute the scores.
the input generated by G.
2) Objective Image Quality Assessment:
2) Conditional GANs: In the unconditional GAN, there is r Peak signal-to-noise ratio (PSNR): PSNR is one of the
no control of what we want to generate because the only input
most widely used full-reference quality metrics. It reflects
is the random noise vector z. Therefore, [67] proposed adding
the intensity differences between the translated image and
additional information y concatenated with z to generate image
its ground truth. A higher PSNR score means that the in-
G(z|y) shown in Fig 5. The conditional input y can be any
tensity of two images is closer.
information, such as data labels, text and attributes of images. r Structural similarity index (SSIM) [87]: I2I uses SSIM
In this way, we can use the additional information to adjust the
to compute the perceptual distance between the translated
generated results in a desirable direction. The objective function
image and its ground truth. The higher the SSIM is, the
is described as:
greater the similarity of the luminance, contrast and struc-
min max L(D, G) = Ex∼pdata (x) [logD(x|y)] ture of two images will be.
G D r Inception score (IS) [88]: IS encodes the diversity across
+ Ez∼pz (z) [log(1 − D(G(z|y)))]. (8) all translated outputs. It exploits a pretrained inception
classification model to predict the domain label of the
Note that the real data is also under the control of the same translated images. A higher score indicates a better trans-
conditional variable y, i.e., D(x|y). lated performance.
3) The Way to Train GANs: In the training process, GAN up- r Mask-SSIM and Mask-IS [89]: These two metrics are the
dates the parameters of G along with D using gradient-based op- masked versions of SSIM and IS to reduce the background
timization methods, such as stochastic gradient descent (SGD), influence by masking it out. They are designed for evalu-
Adam [83] and RMSProp [84]. The entire optimization goal ating the performance of person image generation task.
is achieved when D cannot distinguish between the generated r Conditional inception score (CIS) [90]: CIS is modified
sample x = G(z) and real sample x, i.e., when the Nash equi- from IS to better evaluate the multimodal I2I works. It
librium is found in this status. In practice, the training of GANs encodes the diversity of the translated output conditioned

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3865

accuracy indicates that the model learns more deterministic


patterns to be represented in the target domain.
r Density and Coverage (DC) [98]: It is the latest metric for
simultaneously judging the diversity and fidelity of gener-
ative models. It measures the distance between real images
and generated images by introducing a manifold estima-
tion procedure. Higher scores indicate larger diversity and
better coverage to the ground-truth domain, respectively.
Fig. 6. Examples of two-domain I2I.
III. TWO-DOMAIN IMAGE-TO-IMAGE TRANSLATION
In this section, we focus on introducing the two-domain I2I
on a single input image. A higher score indicates a better methods. As shown in Fig. 6, two-domain I2I can solve many
translated performance. problems in computer vision, computer graphics and image pro-
r Perceptual distance (PD) [91]: PD computes the percep- cessing, such as image style transfer (f.) [2], [75], which can be
tual distance between the translated image and correspond- used in photo editor apps to promote user experience and se-
ing source image. A lower PD score indicates that the con- mantic segmentation (c.) [4], [5], which benefits the autonomous
tents of two images are more similar. driving and image colorization (d.) [23], [27]. If low-resolution
r Fréchet inception distance (FID) [92]: The FID measures images are taken as the source domain and high-resolution im-
the distance between the distributions of synthesized im- ages are taken as the target domain, we can naturally achieve
ages and real images. A lower FID score means a better image super-resolution through I2I (e.) [28], [29]. Indeed, two-
performance. domain I2I can be used for many different types of applications
r Kernel inception distance (KID) [93]: The KID com- as long as the appropriate type and amount of data are provided
putes the squared maximum mean discrepancy between as the source-target images. Therefore, we refer to the univer-
the feature representations of real and generated images sal taxonomy in machine learning, such as the categorizations
in which feature representations are extracted from the used in [143]–[144], and classify two-domain I2I methods into
Inception network [94]. A lower KID indicates more four categories based on the different ways of leveraging var-
shared visual similarities between real and generated ious sources of information: supervised I2I, unsupervised I2I,
images. semi-supervised I2I and few-shot I2I, as described in following
r Single image Fréchet inception distance (SIFID) [95]: paragraph. We also provide the summary of these two-domain
The SIFID captures the difference between the internal dis- I2I methods in Table I including method name, publication year,
tributions of two images, which is implemented by com- the type of training data, whether multi-modal or not and corre-
puting the Fréchet inception distance (FID) between the sponding insights.
deep features of two images. The SIFID is computed us- r Supervised I2I In the earlier I2I works [1], researchers
ing the translated image and corresponding target image. used many aligned image pairs as the source domain and
A lower SIFID score indicates that the styles of two images target domain to obtain the translation model that translates
are more similar. the source images to the desired target images.
r LPIPS [96]: LPIPS evaluates the diversity of the translated r Unsupervised I2I Training supervised translation is not
images and is demonstrated to correlate well with human very practical because of the difficulty and high cost of
perceptual similarity. It is computed as the average LPIPS acquiring these large, paired training data in many tasks.
distance between pairs of randomly sampled translation Taking photo-to-painting translation as an example (e.g.,
outputs from the same input. A higher LPIPS score means f. in Fig. 6), it is almost impossible to collect massive
a more realistic, diverse translated result. amounts of labeled paintings that match the input land-
r FCN scores [1]: This metric is mostly used in the trans- scapes. Hence, unsupervised methods [2], [11], [12] have
lation of semantic maps ↔real photos (e.g., c. in Fig. 6). gradually attracted more attention. In an unsupervised
It uses the FCN-8 s architecture [97] for semantic seg- learning setting, I2I methods use two large but unpaired
mentation to predict a label map for a translated photo sets of training images to convert images between repre-
and then compares this label map with ground truth la- sentations.
bels with standard semantic segmentation metrics, such r Semi-supervised I2I In some special scenarios, we still
as per-pixel accuracy, per-class accuracy, and mean class need a little expensive human labeling or expert guid-
intersection-over-union (class IOU). A higher score indi- ance, as well as abundant unlabeled data, such as those
cates a better translated result. of old movie restoration [107] or genomics [145]. There-
r Classification accuracy [74]: This metric adapts a clas- fore, researchers consider introducing semi-supervised
sifier pretrained on target domain images to classify the learning [146]–[148] into I2I to further promote the per-
translated images. The intuition behind this metric is that formance of image translation. Semi-supervised I2I ap-
a well-trained I2I model would generate outputs that can proaches leverage only source images alongside a few
be classified as an image from the target domain. A higher source-target aligned image pairs for training but can

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3866 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

supervised I2I tasks. In addition to the pixelwise regression loss


L1 between the translated image and the ground truth, pix2pix
leverages adversarial training loss LcGAN to ensure that the out-
puts cannot be distinguished from “real” images. The objective
is:

L = min max LcGAN (G, D) + λLL1 (G). (9)


G D

Pix2pix is also a strong baseline image translation framework


that inspires many improved I2I works based on it, as described
in following parts.
Fig. 7. Examples of multimodal outputs in two-domain I2I. Wang et al. [86] claim that the GAN loss and pixelwise loss
used in pix2pix often lead to blurry results. They present dis-
achieve more promoted translated results than their un-
criminative region proposal adversarial networks (DRPANs) to
supervised counterpart.
r Few-shot I2I Nonetheless, several problems remain re- address it by adding a reviser (R) to distinguish real from masked
fake samples. Wang et al. [99] argue that the adversarial train-
garding translation using a supervised, unsupervised or
ing in pix2pix [1] might be unstable and prone to failure for
semi-supervised I2I method with extremely limited data.
high-resolution image generation tasks. They propose an HD
In contrast, humans can learn from only one or limited ex-
version of pix2pix that can increase the photo realism and resolu-
emplars to achieve remarkable learning results. As noted
tion of the results to 2048 × 1024. Moreover, AlBahar et al. [100]
by meta-learning [149], [150] and few-shot learning [151],
take an important step toward addressing the controllable or
[152], humans can effectively use prior experiences and
user-specific generation based on pix2pix [1] via respecting the
knowledge when learning new tasks, while artificial learn-
constraints provided by an external, user-provided guidance im-
ers usually severely overfit without the necessary prior
age.
knowledge. Inspired by the human learning strategy, few-
Unfortunately, pix2pix [1] and its improved variants [86],
and one-shot I2I algorithms [18], [79], [80], [137] have
[99], [100] still fail to capture the complex scene structural re-
been proposed to translate from very few (or even one)
lationships through a single translation network when the two
in the limit unpaired training examples of the source and
domains have drastically different views and severe deforma-
target domains.
tions. Tang et al. [101] therefore proposed SelectionGAN to
Although learning settings may differ, most of these I2I tech-
solve the cross-view translation problem, i.e., translating source
niques tend to learn a deterministic one-to-one mapping and
view images to target view scenes in which the fields of views
only generate single-modal output, as shown in Fig. 6. How-
have little or no overlap. It was the first attempt to combine the
ever, in practice, the two-domain I2I is inherently ambiguous,
multichannel attention selection module with GAN to solve the
as one input image may correspond to multiple possible outputs,
I2I problem.
namely, multimodal outputs, as shown in Fig. 7. Multimodal I2I
What’s more, SPADE [4] proposes the spatially-adaptive nor-
translates the input image from one domain to a distribution of
malization layer to further improve the quality of the synthesized
potential outputs in the target domain while remaining faithful
images. But SPADE uses only one style code to control the en-
to the input. These diverse outputs represent different color or
tire style of an image and inserts style information only in the
style texture themes (i.e., multimodal) but still preserve the sim-
beginning of a network. SEAN [5] therefore designs semantic
ilar semantic content as the input source image. Therefore, we
region-adaptive normalization layer to alleviate the two short-
actually view multimodal I2I as a special two-domain I2I and
comings.
discuss it in supervised (Subsection III-A) and unsupervised set-
Having said that, Shaham et al. [104] claim that traditional
tings (Subsection III-B).
I2I networks [1], [4], [99] suffer from acute computational cost
when operating on high-resolution images. They propose to de-
A. Supervised Image-to-Image Translation
sign a more lightweight but efficient enough network ASAPNet
Supervised I2I aims to translate source images into the target for fast high-resolution I2I.
domain with many aligned image pairs as the source domain and Recently, Zhang et al. [102] proposed an exemplar-based I2I
target domain for training. In this subsection, we further divide framework, CoCosNet, to translate images by establishing the
the supervised I2I in two categories: methods with single-modal dense semantic correspondence between cross-domain images.
output and methods with multimodal outputs. However, the semantic matching process may lead to a pro-
1) Single-Modal Output: The idea of I2I can be traced back hibitive memory footprint when estimating a high-resolution
to Hertzmann et al.’s image analogies [153], which use a non- correspondence. Zhou et al. [103] therefore proposed a GRU-
parametric texture model for a wide variety of “image filter” assisted refinement module that applies PatchMatch in a hierar-
effects with an image pair input. More recent research on I2I chy to first learn the full-resolution, 1024 × 1024, cross-domain
mainly leverages the deep convolutional neural network to learn semantic correspondence, namely CoCosNetv2.
the mapping function. Isola et al. [1] first apply conditional GAN 2) Multimodal Outputs: As shown in Fig. 7, multimodal I2I
to an I2I problem by proposing pix2pix to solve a wide range of translates the input image from one domain to a distribution of

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3867

potential outputs in the target domain while remaining faithful


to the input.
Actually, this multimodal translation benefits from the so-
lutions of mode collapse problem [64], [68], [154], in which
the generator tends to learn to map different input samples to
the same output. Thus, many multimodal I2I methods [16],
[105] focus on solving the mode collapse problem to lead to Fig. 8. Taking edge → face translation as an example, we use a cycle-
diverse outputs naturally. BicycleGAN [16] became the first su- consistency constraint between a source image xA and its cyclic reconstructed
pervised multimodal I2I work by combining cVAE-GAN [55], image xABA , termed cyclic loss through two translators GA→B and GB→A .
[155], [156] and cLR-GAN [157]–[159] to systematically study
a family of solutions to the mode collapse problem and generate
diverse and realistic outputs.
Similarly, Bansal et al. [105] proposed PixelNN to achieve
multimodal and controllable translated results in I2I. They pro-
posed a nearest-neighbor (NN) approach combining pixelwise
matching to translate the incomplete, conditioned input to mul-
tiple outputs and allow a user to control the translation through
on-the-fly editing of the exemplar set.
Another solution for producing diverse outputs is to use disen-
tangled representation [157], [160]–[162] which aims to break Fig. 9. Examples of I2I with large domain gaps, where the translated images
down, or disentangle, each feature into narrowly defined vari- are sufficiently realistic in the target domain and preserve semantic information
learned from the source domain.
ables and encodes them as separate dimensions. When combin-
ing it with I2I, researchers disentangle the representation of the
source and target domains into two parts: domain-invariant fea- constraint was proposed, as shown in Fig. 8 and was proved
tures content, which are preserved during the translation, and to be an effective strategy for overcoming the lack of su-
domain-specific features style, which are changed during the pervised pairing.
translation. In other words, I2I aims to transfer images from the r Translation beyond Cycle-consistency Constraint How-
source domain to the target domain by preserving content while ever, while the cycle-consistency constraint can eliminate
replacing style. Therefore, one can achieve multimodal outputs the dependence on supervised paired data, it tends to force
by randomly choosing the style features that are often regular- the model to generate a translated image that contains all
ized to be drawn from a prior Gaussian distribution N (0, 1). the information of the input image for reconstructing the
Gonzalez-Garcia et al. [106] disentangled the representation of input image. Approaches using cyclic loss are typically un-
two domains into three parts: the shared part containing com- successful when the two domains require substantial clut-
mon information of both domains, and two exclusive parts that ter and heterogeneity instead of small, simple changes in
only represent those factors of variation that are particular to low-level shape and context. Therefore, many UI2I meth-
each domain. In addition to the bi-directional multimodal trans- ods focus on the translation beyond the cycle-consistency
lation and retrieval of similar images across domains, they can constraint, as shown in Fig. 9, to solve the homogeneous
also transfer a domain-specific transfer and interpolation across limitation, as well as the large shape deformation problem
two domains. between the source and target domains.
r Translation of Fine-grained Objects Most UI2I mod-
els using or beyond the cycle-consistency constraint tend
B. Unsupervised Image-to-Image Translation (UI2I)
to directly synthesis a new domain with the global target
UI2I uses two large but unpaired sets of training images to style translated and give little thought of the local objects
convert images from one representation to another. In this sub- or fine-grained instances during translation. However, in
section, we follow the same categories in Subsection III-A: some application scenarios, such as virtual try-on, we may
single-modal output and multimodal outputs. only need to change a local object, such as changing pants
1) Single-Modal Output: UI2I methods have been explored to a skirt with other parts unchanged, as shown in Fig. 10.
primarily by focusing on different issues. We will introduce In this case, severe setbacks are incurred when the transla-
those methods with single-modal output in the following four tion involves large shape changes of instances or multiple
categories: translation using a cycle-consistency constraint, discrepant objects. Hence, research on applying instances
translation beyond a cycle-consistency constraint, translation of or objects information in UI2I is a growing trend.
fine-grained objects and translation by combining knowledge in r Translation by combining knowledge in other fields In
other fields. addition, some UI2Is try to improve the network efficiency
r Translation using a Cycle-consistency Constraint In the or translation performance by combining knowledge from
beginning, researchers tried to find new frameworks or other research areas.
constraints to establish the I2I mapping without labels or a) Translation using a Cycle-consistency Constraint: The
pairings. Based on this motivation, the cycle-consistency popularly known strategy for tackling an unsupervised setting

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3868 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

the generator such that each original image shares semantics


with its generated version. Wu et al. [77] present TransGaGa
to solve the large geometry variations in I2I. They disentangle
each domain into a Cartesian product of the geometry space
and appearance space by the VAE. In each space, they apply a
bi-direction geometry transformation and appearance transfor-
mation with two transformers, respectively. Zhao et al. [112]
argue that I2I can barely perform shape changes, remove ob-
jects or ignore irrelevant texture because of the strict pixel-level
constraint of cycle-consistent loss. They propose ACL-GAN,
which uses a novel adversarial-consistency loss to replace the
cyclic loss to maintain the commonalities across two domains.
Fig. 10. Examples of I2I focusing on fine-grained objects. The top examples
generate results with the global style (weather or foundation) translated and the Recently, Katzir et al. [113] mitigated shape translation in a
local object instance (car or lipstick) changed. The bottom examples achieve cascaded, deep-to-shallow fashion, in which they exploited the
remarkable translated results for fine-grained local objects. deep features extracted from a pretrained VGG-19 and translated
them at the feature level.
Moreover, some UI2I works try to design a one-side transla-
is to use the cycle-consistency constraint (cyclic loss) shown in
tion process to remove the cycle-consistency constraint. These
Fig. 8. Cyclic loss uses two translators, GA→B and GB→A , to
methods usually take into account some kind of geometry dis-
define a cycle-consistency loss between the source image xA
tance as content loss between the original source image and
and its reconstruction xABA when the pairs are not available,
translated results. Benaim et al. [114] propose DistanceGAN
and the objective can be written as:
to achieve one-side translation by maintaining the distances
Lcyc = L(xA , GB→A (GA→B (xA ))). (10) between images within domains. Fu et al. [115] propose GC-
GAN to preserve the given geometric transformation between
Taigman et al. [108] present a domain transfer network (DTN) the input images before and after translation. Zheng et al. [119]
for unpaired cross-domain image generation by assuming con- propose F-LSeSim to learn a domain-invariant representation
stant latent space between two domains, which could generate to precisely express scene structure via self-similarity. Liang
images of the target domains’ style and preserve their identity. et al. [120] propose a Laplacian Pyramid Translation Network
Similar to the idea of dual learning in neural machine transla- (LPTN) to achieve photorealistic I2I by decomposing the input
tion, DualGAN [12], DiscoGAN [11] and CycleGAN [2] are into a Laplacian pyramid and translating on the low-frequency
proposed to train two cross-domain transfer GANs with two component. Park et al. [116] propose CUT to maximize the mu-
cyclic losses at the same time. Liu et al. [74] propose UNIT tual information between the input-output pairs via contrastive
to make a shared latent space assumption that a pair of corre- learning [163] in a patch-based way rather than operating on en-
sponding images in different domains can be mapped to the tire images. Jiang et al. [118] propose a symmetrical two-stream
same latent code in a shared latent space. They show that framework (TSIT) to learn feature-level semantic structure in-
the shared-latent space constraint implies the cycle-consistency formation and style representation, and then they exploit the
constraint. Li et al. [109] claim that these single-stage unsu- generator to fuse content and style feature maps from coarse to
pervised approaches are difficult to use for translating two- fine. Park et al. [117] also propose a swapping autoencoder for
domain images with high-resolution or a substantial visual texture swapping by enforcing the output and reference patches
gap. They hence propose a stacked cycle-consistent adversar- to appear indistinguishable via the patch co-occurrence discrim-
ial network (SCAN) to decompose the single complex image inator.
translation process into multistage transformations. More re- c) Translation of Fine-grained Objects: Some I2I works try
cently, Kim et al. [37] proposed U-GAT-IT to incorporate a to translate on a higher semantic level by replacing the local
novel attention module to force the generator and discrimi- texture information of object instances as well as the global
nator to focus on more important regions via the auxiliary style translated, as shown in Fig. 10. Ma et al. [121] propose
classifier. DAGAN to construct a deep attention encoder to enable the
b) Translation beyond Cycle-consistency Constraint: To ad- instance-level translation. Chen et al. [122] and Mejjati [13]
dress the challenging shape deformation problem (i.e., large do- almost simultaneously proposed attention GAN and attention-
main gaps) in I2I shown in Fig. 9, Gokaslan et al. [110] pro- guided I2I, respectively, focusing on achieving an I2I translation
pose GANimorph to reframe the discrimination problem from of individual objects without altering the background, as shown
determining real or fake images into a semantic segmentation in Fig. 10 (local). Mo et al. [123] propose InstaGAN, which
task of finding real or fake regions of the image with dilated is the first work to solve multi-instance transfiguration tasks
convolutions. Amodio et al. [111] introduce TraVeLGAN to ad- in UI2I. It uses the object segmentation masks to translate an
dress the challenge. In addition to the generator and discrim- image and the corresponding set of instance attributes while
inator, they add a Siamese network to define a transformation maintaining the permutation invariance property of instances.
vector between two images of each domain and minimize the Shen et al. [124] propose the instance-aware I2I approach (INIT)
distance between the two vectors. The Siamese network guides to use the fine-grained local instances based on MUNIT [90] and

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3869

losses, content adversarial loss and cross-cycle consistency loss.


Through content adversarial loss, it applies weight sharing and
a content discriminator to force content representation to be
mapped onto the same shared space, as well as to guarantee the
same content representations encode the same information for
both domains. Then, with the constraint of cross-cycle consis-
tency loss, it performs forward-backward translation by swap-
ping domain-specific representations. At test time, DRIT can
use different attribute vectors randomly sampled from domain-
specific attribute space to generated diverse outputs.
Fig. 11. The architecture of cd-GAN [132]. However, these aforementioned methods still cannot solve
the problem of target domain images that are content-rich with
DRIT [75]. DUNIT [125] incorporates an object detector within multiple discrepant objects. Shen et al. [124] therefore propose
the I2I architecture used in DRIT [75] to leverage the object INIT to translate instance-level objects and background/global
instances to reassemble the resulting representation. areas separately with different style codes. Chang et al. [135]
d) Translation by Combining Knowledge in Other Fields: declare that the shared domain-invariant content space in disen-
Tomei et al. [14] present Art2Real to translate artistic paintings tangled representations could limit the ability to represent con-
to real photos using a weakly supervised segmentation model tent because these representations ignore the relationship be-
and memory banks. Inspired by the style transfer, Cho et al. [126] tween content and style. They present DSMAP to leverage two
propose GDWCT, which extends whitening-and-coloring trans- extra domain-specific mappings to remap the content features
formation (WCT) to I2I to achieve a highly competitive im- from shared domain-invariant content space to two independent
age quality. Chen et al. [128] use the knowledge distillation domain-specific content spaces for two domains.
scheme to define a teacher generator and student discrimina- Other attempts to address multimodal UI2I are proposed by
tor. A distilling portable model is shown to achieve a compa- Mao et al. [133] and Alharbi et al. [134]. The study in [133]
rable performance with substantially lower memory usage and uses MSGAN to employ a mode-seeking regularization method
computational cost. With the help of domain adaptation, Chen to solve the mode collapse problem in cGANs, and the proposed
et al. [129] develop DAI2I to adapt a given I2I model trained on regularization method can be readily integrated with an exist-
the source domain to a new domain, which improves the gen- ing cGANs framework, such as DRIT [75], to generate more
eralization capacity of existing models. RevGAN [127] inter- diverse translated images. In addition, [134] uses latent filter
polates the invertible neural networks (INNs) into I2I to reduce scaling (LFS) to perform multimodal UI2I, which is the first
memory overhead, as well as increase the fidelity of the output. multimodal UI2I framework that does not require autoencoding
NICE-GAN [130] first reuses the discriminator for embedding or reconstruction losses for the latent codes or images.
from images to hidden vectors (as encoding) in which the dis-
criminator is conducted using introspective adversarial networks C. Semi-Supervised Image-to-Image Translation
(IANs). It derives a more compact and effective architecture for
generating translated images. Semi-supervised I2I draws much attention in some special
2) Multi-Modal Outputs: Kazemi et al. [78] show that applications, such as old movie restoration or artistic reconstruc-
shared-latent space assumptions only model the domain- tions. In these scenarios, one needs few human-labeled data for
invariant information across two domains and fail to capture guidance and abundant, unlabeled other data for automatic trans-
the domain-specific information. They argue for learning a lation. Unpaired data, when used in conjunction with a small
one-to-many UI2I mapping by extending CycleGAN to learn amount of paired data, can produce considerable improvement
a domain-specific code for each domain jointly with a domain- in translation performance.
invariant code. Similarly, Almahairi et al. [131] also claim that Mustafa et al. [107] first study the applicability of semi-
the mapping across two domains should be characterized as supervised learning in a two-domain I2I setting. They introduce
many-to-many instead of one-to-one. They therefore introduce a regularization term, transformation consistency regularization
the augmented CycleGAN model to capture the diversity of the (TCR), to force a model’s prediction to remain unchanged for
outputs, i.e., multimodal, by extending CycleGAN’s training the perturbed (geometric transform) input sample and its recon-
procedure to the augmented spaces. struction version. In detail, they train an I2I mapping model fθ
Disentangled representations [157], [160], [161] also offer by minimizing the supervised loss Ls with paired source-target
a solution to this problem in unsupervised settings. They in- images (xi , yi ):
spire multimodal UI2I advances, such as cd-GAN [132], MU- Ls = mse(yi , fθ (xi )). (11)
NIT [90], DRIT [75] and EGSC-IT [76], which were pro-
posed almost simultaneously and present similar model design- Then, they leverage unsupervised data to regularize the model’s
ings, such as that of Fig. 11. For example, DRIT [75] assumes predictions over varied forms of geometric transformations Tm .
embedding images onto two spaces, domain-invariant content They make use of Tm to process the unlabeled input samples
space and domain-specific attribute space, via a content encoder and feed these transformed samples into the I2I model fθ to
and attribute encoder. Specifically, DRIT learns two objective obtain the disturbed outputs fθ (Tm (ui )). On the other hand, they

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3870 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

directly feed the unlabeled samples into fθ to acquire primary


outputs and apply geometric transformations Tm onto them to
obtain another type of perturbed outputs Tm (fθ (ui ). The TCR
of unlabeled data guarantees the consistency between the two
outputs to learn more about the inherent structure of the source
and target domain distributions. The detailed unsupervised TCR
regularization loss Lus is as follows:

Lus = mse(Tm (fθ (ui ), fθ (Tm (ui )))). (12)

Their method can use unlabeled data and less than 1% of labeled
data to complete several I2I tasks, such as image colorization,
image denoising and image super-resolution.

D. Few-Shot Image-to-Image Translation


Existing I2I models cannot translate images from very few
(even one) training examples of the source and target domains.
In contrast, humans can learn from very limited exemplars to Fig. 12. The architecture of TuiGAN [80].
obtain extraordinary learning results. For example, a child can
recognize what a “zebra” and “rhino” are with only a few design two pyramids of generators and discriminators to transfer
pictures. Inspired by the rapid learning ability of humans, re- the domain distribution of the input image to the target domain
searchers expect that after the machine learning model has by progressively translating the image from coarse to fine. Using
learned a large amount of data in a certain category, for new this progressive translation, their model can extract the underly-
categories, it only needs a few samples to learn quickly. In other ing relationship between two images by continuously varying the
words, few-shot I2I wants to solve the transfer capability or gen- receptive fields at different scales. All in all, TuiGAN represents
eralization ability of the I2I model given very few samples. a further step toward the possibility of unsupervised learning
Drawing inspiration from domain adaptation or transfer learn- with extremely limited data.
ing, some methods solve the few-shot translation by adapting a
pretrained-network trained on large-scale source domain to the IV. MULTI-DOMAIN IMAGE-TO-IMAGE TRANSLATION
target domain with a few images only. Wang et al. [136] pro-
In this section, we will discuss the I2I problem on multiple
pose Transferring GAN (TGAN) to successful combine transfer
domains and list correlative algorithms in Table II convering
learning with GAN. It transfers images from source domain to
model name, publication year, the type of training data, whether
target domain with a pretrained network when data is limited.
multi-modal or not and corresponding insights. We have dis-
Lin et al. propose MT-GAN [137] to incorporate prior from pre-
cussed a series of attractive I2I works for translating two do-
vious domain translation tasks when assuming a new domain
mains. However, these methods can only capture the relationship
translation task from a perspective of meta-learning. Likewise,
of two domains based on one model at a time. Given n domains,
Li et al. [138] propose to utilize EWC to adapt the weights
the network requires n × (n − 1) generators to be trained, which
of pretrained network on source domain to a new target do-
leads to an unavoidable burden. Moreover, it fails to fully use
main. Ojha et al. [139] achieve few-shot adaptation by a novel
the entire training data from all the domains. Even if there exists
cross-domain distance consistency loss and an anchor-based
global information learned from all the domains that can be ap-
strategy.
plied to promote the translation performance, the network still
An extreme scenario of few-shot I2I is one-shot I2I. Benaim
only learns from two domains, and it is difficult to acquire that
et al. [140] propose OST to solve the one-shot cross-domain
global multi-domain information. How to further reduce the net-
translation problem, which aims to learn a unidirectional map-
work complexity and improve the efficiency to handle multiple
ping function given a single source image and a set of images
domains remains unaddressed.
from the target domain. By sharing layers between the two au-
Therefore, researchers study the multi-domain I2I prob-
toencoders and selective backpropagation, OST enforces the
lem. It focuses on handling multiple domains using a sin-
same structure on the encoding of both domains, which benefits
gle unified model in which multiple outputs contain differ-
the translation. As an extension of OST, Cohen et al. [141] pro-
ent semantic contents or style textures. We divide multi-
pose BiOST to translate in bi-direction without a weight sharing
domain I2I research into three categories: unsupervised multi-
strategy via a feature-cycle consistency term.
domain I2I, semi-supervised multi-domain I2I and few-shot
In contrast to the one-shot setting in [140], [141] that uses
multi-domain I2I.
a single image from the source domain and a set of images
from the target domain, Lin et al. propose TuiGAN [80] to
A. Unsupervised Multi-Domain Image-to-Image Translation
achieve one-shot UI2I with only two unpaired images from two
domains. They train TuiGAN in a coarse-to-fine manner using In this subsection, we introduce unsupervised multi-domain
the cycle-consistency constraint shown in Fig. 12. In detail, they I2I (multi-domain UI2I) in two aspects: single-modal output

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3871

r Training by combining knowledge in other fields Some


algorithms try to introduce knowledge from other research
areas to facilitate multi-domain UI2I. To some extent, these
methods have indeed brought us new insight.
a) Training with Multimodules: Based on the shared-latent
space assumption [74], Hui et al. [164] propose a unified frame-
work named Domain-Bank. Given n domains, Domain-Bank
obtains n pairs of translated results by training the network
only once, while the two-domain I2I methods require the train-
ing of n models for translations between different pairs of do-
Fig. 13. Illustration of the dataset used in multi-domain I2I scenarios: Each mains. By leveraging several reusable and composable modules,
domain usually means a set of images sharing the same attribute value. Images
are from the CelebA dataset [186]. Zhao et al. [165] propose ModularGAN to translate an image
to multiple domains efficiently. They predefine an attribute set
A = {A1 , . . . , An } in which each attribute Ai represents mean-
and multimodal outputs. First, we explain how to achieve this ingful inherent property of each domain with different attribute
translation with multiple domains. For example, the CelebA values.
dataset [186] contains 40 facial attributes, and each domain usu- b) Training with One Generator and Discriminator Pair: Choi
ally means a set of images sharing the same attribute value et al. propose StarGAN [166], which fully proves the effective-
in [166]–[169]. We therefore can obtain numerous unpaired ness of the auxiliary domain label by mappings between all avail-
translation domains based on different attribute values, as shown able domains using only a single model. They design a special
in Fig. 13. Notwithstanding the demonstrated success of unsu- discriminator and introduce an auxiliary classifier on top of it,
pervised two-domain I2I (two-domain UI2I) in Subsection III-B, in which the discriminator not only justifies whether an image
when we have multiple unpaired domains, do these two-domain is a natural or fake one Dsrc (xA ) but also distinguishes which
UI2I methods still work? The answer may be “no.” The typ- domain the input belongs to Dcls (xA ):
ical problems are the efficiency and network burden in which
D : xA → {Dsrc (xA ), Dcls (xA )}. (13)
these two-domain UI2I can only transfer one pair of different
domains through one training. The multi-domain UI2I hence at- To translate input image xA ∈ A to target domain B, StarGAN
tracts much attention, and we will provide detailed illustrations learns an adversarial loss and a cycle-consistency loss condi-
from two aspects: multi-domain UI2I with single-modal output tioned with input domain label cA and target domain label cB .
and with multimodal outputs. Through the three loss functions, StarGAN can achieve a scal-
1) Single-Modal Output: In addition, a large variety of multi- able translation for multiple domains and obtain results with
domain UI2I methods with a single-modal output have been higher visual quality. Sharing an extremely similar idea, He
proposed to obtain image representations in an unsupervised et al. [167] propose AttGAN to address this problem. However,
way. We classify them into three categories: training with multi- AttGAN uses an encoder-decoder architecture to model the re-
modules, training with one generator and discriminator pair and lationship between the latent representation and the attributes.
training by combining knowledge in other fields. The target-attribute constraint used in StarGAN and AttGAN
r Training with multimodules In earlier times, methods fails to provide a more fine-grained control and often requires
mainly designed complex multiple modules to address specifying the complete target attributes, even if most of the at-
multi-domain UI2I by regarding it as a composition of sev- tributes are not changed. Wu et al. [168] and Liu et al. [169]
eral two-domain UI2Is, in which each module represents therefore consider a novel attribute description termed relative-
each domain information. Compared with directly applied attribute that represents the desired change of attributes. They
two-domain UI2I methods, these works can train all the propose RelGAN and STGAN, respectively, to satisfy arbitrary
domains at one time, which saves much training time and image attribute editing with relative attributes. Given input do-
many model parameters. main attribute cA and target domain attribute cB , the relative
r Training with one generator and discriminator pair Un- attribute c is formulated as:
fortunately, methods training with multimodules can train
c = cB − c A . (14)
translations between multiple domains at one time, but they
still need to define multiple-domain modules to represent StarGAN or AttGAN may perform worse when multiple in-
the corresponding domains. Is there a model that can train puts are required to obtain a desired output. Any missing in-
all domains at one time using the same module to process put data will introduce a large bias and lead to terrible re-
multiple-domain information? A more effective solution is sults. Therefore, CollaGAN [170] has been proposed to process
to use an auxiliary label (binary or one-shot attribute vec- multiple-inputs from multiple domains instead of only handling
tor) to represent domain information that leads to a more single-input and single-output.
flexible translation. After randomly choosing the target do- Rather than introducing an auxiliary domain classifier, Lin
main label as conditional input, we can translate the source et al. [171] propose introducing an additional auxiliary domain
domain input to this target domain without extra translators and constructing a multipath consistency loss for multi-domain
using one generator and discriminator pair. I2I. Their work is motivated by an important property shown in

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3872 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

Fig. 14. The illustration of multipath consistency regularization [171] on the


translation between different hair colors. Ideally, they consider that the direct
translation (i.e., one-hop translation) from brown hair to blonde should be iden-
tical to the indirect translation (i.e., two-hop translation) from brown to black to
blonde.

Fig. 14, namely, the direct translation (i.e., one-hop translation)


from brown hair to blonde should ideally be identical to the in-
Fig. 15. The training architecture of DosGAN [178]: (1) top: DosGAN for
direct translation (i.e., two-hop translation) from brown to black unpaired I2I; (2) bottom: DosGAN-c for unpaired conditional I2I.
to blonde. Their multipath consistency loss evaluates the differ-
ences between direct two-domain translation A → C and indi-
rect multiple-domain translations A → B → C with domain B
as an auxiliary domain. The method regularizes the training of GMM-UNIT [182] are proposed to perform multimodal out-
each task and obtains a better performance. puts in a multi-domain UI2I setting. For example, DRIT++
c) Training by Combining Knowledge in Other Fields: By consists of two content encoders {EA c
, EB c
}, two attribute en-
expanding the concept of a multi-domain from data to the loss coders {EA , EB }, two generators {GA , GB }, two discrimina-
a a
c
area, Chang et al. [172] introduce the sym-parameter to syn- tors DA , DB and a content discriminator Dadv . Through weight
c
chronize various mixed losses with input conditions. Siddiquee sharing and a content discriminator Dadv , the network can
et al. [173] propose Fixed-Point GAN, which uses a trainable achieve representation disentanglement with content adversarial
generator and a frozen discriminator to perform fixed-point loss. Then, it leverages cross-cycle consistency loss for forward
translation learning. He et al. [174] propose deliberation learn- and backward translations. In addition to these two losses, these
ing for I2I by adding a polishing step on the output image. methods also use domain adversarial loss, self-reconstruction
Cao et al. [176] propose the informative sample mining net- loss, latent regression loss and an extra mode-seeking regular-
work (INIT) to analyze the importance of sample selection ization to effectively improve the sample diversity and visual
and select the informative samples for multihop training. Wu quality.
et al. [175] propose ADSPM to learn attribute-driven deforma-
tion by a spontaneous motion (SPM) estimation module and
a refinement part (R) with much consideration for geometric B. Semi-Supervised Multi-Domain Image-to-Image
transform. Translation
2) Multimodal Outputs: However, all of the multi-domain Li et al. [184] propose an attribute guided I2I (AGUIT) model
approaches mentioned above still learn a deterministic mapping that is the first work to handle multimodal and multi-domain I2I
between two arbitrary domains. Researchers therefore consider with semi-supervised learning. AGUIT is trained following three
addressing the multi-domain I2I, as well as the outputs of mul- steps. The first step is representation decomposition, which ex-
timodal results. tracts content and style features with two encoders, a content
Lin et al. observe that if the pretrained CNN network can ac- discriminator and a label predictor, and the style code includes
curately classify the domain of an image, then the output of the a noise part and an attribute part. The second step is reconstruc-
second-to-last layer of the classifier should well capture the do- tion and translation using AdaIN [187] and a discriminator and
main information of this image. Combining this network with a a domain classifier. The third step is consistency reconstruc-
domain classifier, they propose domain-supervised GAN (Dos- tion with cycle consistency loss and feature consistency loss.
GAN) [178] to use the domain label as an explicit supervision AGUIT is trained in a training set containing labeled images
and pretrain a deep CNN to predict which domain an image is mixed with unlabeled images so that it can translate attributes
from. The detailed training architecture is shown in Fig. 15. well. By going one step further to reduce the amount of la-
In addition, GANimation [177] is proposed to generate beled data required in the training process and source domain,
anatomically aware facial animation. It continuously synthe- Wang et al. [185] propose SEMIT to address the challenging
sizes anatomical facial movements by controlling the magnitude problem combined with few-shot I2I. SEMIT initially applies
of activation of each action unit (AU). semi-supervised learning via a noise-tolerant pseudo-labeling
By exploiting the disentanglement assumption, UFDN [32], procedure to assign pseudo-labels to the unlabeled training data.
DMIT [179], StarGAN v2 [180], DRIT++ [181] and Then, it performs UI2I using adversarial loss, classification loss

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3873

translator trained on seen domains to translate images of unseen


domains with annotated attributes.

V. EXPERIMENTAL EVALUATION
In this section, we evaluate twelve I2I models on two tasks,
including seven two-domain algorithms on edge-to-shoes trans-
Fig. 16. Illustration of the dataset used in few-shot multi-domain I2I scenarios:
lation task and five multi-domain algorithms on attribute manip-
The training set consists of multiple domains in which the source and target ulation task. We train all the models following their default set-
images are randomly sampled from arbitrary n domains; given very few images tings as original papers except the same dataset and implemen-
of the target unseen domain (unseen in the training process), few-shot multi-
domain I2I aims to translate a source content image (randomly sampled from n
tation environments. The selection criteria of methods mainly
domains) into an image analogous to this unseen domain. takes into account algorithm categories and publication years.
All experimental codes come from the public official version.

A. Datasets
UT-Zap50 K We utilize the UT-Zap50 K dataset [188] to eval-
uate the performance of two-domain I2I methods. The number
of training pairs is 49826 where each pair consists of a shoes im-
age and its corresponding edge map. And the number of testing
images is 200. We resize all images to 256 × 256 for training
and testing. In unsupervised setting, images from source domain
and target domain are not paired.
Fig. 17. The architecture of ZstGAN [79].
CelebA We employ the CelebFaces Attributes (CelebA)
and reconstruction loss with only a few labeled examples during dataset [186] to compare the performance of multi-domain I2I
training. methods. It contains 202,599 face images of celebrities with 40
with/without attribute labels for each image. We randomly di-
C. Few-Shot Multi-Domain Image-to-Image Translation vide all images into training set, validation set and test set with
ratio 8 : 1 : 1. Next, we center-crop the initial 178 × 218 size
Although prolific, the aforementioned successful multi- images to 178 × 178. Finally, after resizing all images to 128 ×
domain I2I techniques can hardly rapidly generalize from a few 128 by bicubic interpolation, we construct the multiple domains
examples. In contrast, humans can learn new tasks rapidly using dataset using the following attributes: Black hair, Blond hair,
what they learned in the past. Given a static picture of a butterfly, Brown hair, gender (male/female), and age (young/old).
you can easily imagine it flying similar to a bird or a bee after
watching a video of a flock of birds or a swarm of bees in flight.
Hence, few-shot multi-domain I2I attracts much attention. Fig. B. Metrics
16. shows the illustration of dataset usually used in this scenario. We evaluate both the visual quality and the diversity of gen-
Liu et al. [18] seek a few-shot UI2I algorithm, FUNIT, to erated images using Frechét inception distance (FID), Incep-
successfully translate source images to analogous images of the tion score (IS) and Learned Perceptual Image Patch Similarity
target class with many source class images but few target class (LPIPS).
images available. FUNIT first trains a multiclass UI2I with mul- Fréchet inception distance (FID) [92] is computed by mea-
tiple classes of images, such as those of various animal species, suring the mean and variance distance of the generated and real
based on a few-shot image translator and a multitask adversarial images in a deep feature space. A lower score means a better per-
discriminator. In the test time, it can translate any source class formance. (1) For single-modal two-domain setting, we directly
image to analogous images of the target class with a few images compared the mean and variance of generated and real sets. (2)
from a novel object class (namely, the unseen target class). For multi-modal two-domain setting, we sample the same test-
However, FUNIT fails to preserve domain invariant appear- ing set 19 times. Then compute the FID for each testing set and
ance information in the content image because of the severe in- average the scores to get the final FID score. (3) For single-modal
fluence of the style code extracted from the target image, namely, multi-domain setting, we compute the FID score in each domain
the content loss problem. Saito et al. [183] therefore proposed and then average the scores. (4) For multi-modal multi-domain
COCO-FUNIT to redesign a content-conditioned style encoder setting, we first sample each image in each domain 19 times.
that interpolates content information into a style code. Then we compute the average FID scores within each domain.
Moreover, Lin et al. found that the current I2I works trans- Finally, we average the scores again to get the result.
late from random noise, which, unlike humans, cannot easily Inception score (IS) [88] encodes the diversity across all
adapt acquired prior knowledge to solve new problems. They translated outputs. It exploits a pretrained inception classifica-
hence proposed the unsupervised zero-shot I2I (ZstGAN) [79] tion model to predict the domain label of the translated images.
shown in Fig. 17. ZstGAN uses meta-learning to transfer trans- A higher score indicates a better translated performance. The
lation knowledge from seen domains to unseen classes using a evaluation process is just as similar as FID.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3874 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

Fig. 18. Qualitative comparison on single modal two-domain I2I methods.


Here we show the examples of edge → shoes.

Fig. 19. Qualitative comparison on multi-modal two-domain I2I methods.


Here we show the examples of edge → shoes, where * indicates addtionally
injecting noise vectors to the translation network.
Learned Perceptual Image Patch Similarity (LPIPS) [96]
evaluates the diversity of the translated images and is demon-
strated to correlate well with human perceptual similarity. It
is computed as the average LPIPS distance between pairs of
randomly sampled translation outputs from the same input.
Specifically, we sample 100 images with 19 pairs of outputs
(randomly sample two style vectors or inputs added random
Guassian noise). We then compute the distance between two gen-
erated results and get an average. A higher LPIPS score means
a more realistic, diverse translated result.

C. Results
A fair comparison is only possible by keeping all the param-
eters consistent. That said, it is difficult to declare that one algo-
rithm has an absolute superiority over the others. Besides model Fig. 20. Qualitative comparison on single modal multi-domain I2I methods.
design itself, there are still many factors influencing the per- Here we show the examples of 5 attributes.
formance, such as training time, batch size and iteration times,
FLOPs and number of parameters, etc. Therefore, our conclu-
sion only build on current experimental settings, models and multi-modal and realistic results. Also supervised method Bicy-
tasks. cleGAN achieves 0.047 more LPIPS scores than unsupervised
Two-domain I2I We qualitatively and quantitatively compare method MUNIT as shown in Table III.
pix2pix [1], BicycleGAN [16], CycleGAN [2], U-GAT-IT [37], Multi-domain I2I We qualitatively and quantitatively com-
GDWCT [126], CUT [116] and MUNIT [90] in single-modal pare StarGAN [166], AttGAN [167], STGAN [169] Dos-
and multi-modal setting respectively. GAN [178] and StarGANv2 [180] in single-modal and multi-
The single-modal qualitative comparisons are shown in modal setting respectively.
Fig. 18 where two supervised methods pix2pix and Bicycle- In Fig. 20, all methods can successfully achieve multiple
GAN achieve better FID, IS and LPIPS scores than unsuper- domains translation. However, StarGAN and AttGAN gener-
vised methods CycleGAN, U-GAT-IT and GDWCT. Without ate obvious visible artifacts while DosGAN leads to blurry re-
any supervision, the newest method CUT gets the best FID and sults. The results of STGAN are pretty excellent whereas Star-
IS scores in Table III than the rest methods including supervised GANv2 can generate realistic and vivid translated results by
and unsupervised. There could be a couple reasons for that. First, changing the image style latent code. Table IV shows that Star-
the backbone of CUT, namely StyleGAN, is a strong powerful GANv2 acquires the best FID and IS scores. Similarly, there are
GAN model for image synthesis compared to others. Besides, also many factors contributed to such result including stronger
the contrastive learning they used is an effective content con- GAN backbone, more effective training strategies, higher quality
straint for translation. dataset, etc.
As for the multi-modal setting shown in Fig. 19, we inject the We also conduct the multimodal multi-domain I2I experi-
Gaussian noise into the input of pix2pix, CycleGAN, U-GAT-IT, ments for comparison. In detail, we additionally inject noise
GDWCT and CUT to get multi-modal results. However, they vectors to StarGAN for multi-modal translation. As for AttGAN
can hardly generate the diverse outputs. On the contrary, the and STGAN, we apply the linear interpolation between two at-
multi-modal algorithms BicycleGAN and MUNIT can acquire tributes: Brown-Hair → Black-Hair as multi-modal results. As

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3875

same person with a new outfit as well as diverse poses by manip-


ulating the target clothes or poses. In addition, sketch-to-image
translation [1], [75], [78], [86], [90], [105], [123], [129], [131],
[196], [247] text-to-image translation [79], [121], [133], [179],
[248], [249], audio-to-image translation [237] and painting-
to-image translation [14], [75], [117], [140] aim to translate
human-drawn sketches, text, audio and artwork paintings to re-
alistic images of the real world. Before I2I, most methods relied
on the retrieval of existing photographs and copying the im-
age patches to the corresponding location in an inefficient and
time-consuming manner.
Using I2I for image manipulation focuses on altering or mod-
ifying an image while keeping the unedited factors unchanged.
Semantic manipulation tries to edit the high-level semantics of
an image, such as the presence and appearance of objects (image
composition) [198] with or without makeup [199]. Attribute ma-
nipulation [167], [173], [175] varies the binary representations
or utilizes the landmark to edit image attributes, such as the
gender of the subject [250], the color of hair [126], [180], the
Fig. 21. Qualitative comparison on multi-modal two-domain I2I methods. presence of glasses [251] and the facial expression [252]–[254],
Here we show the examples of 5 domain translation, where * indicates addi- and performs image relabeling [198] as well as gaze correction
tionally injecting noise vectors to the translation network,  denotes the linear and animation in the wild [255]. Moreover, image/video retar-
interpolation between two attributes: Brown-Hair → Black-Hair.
geting [201] enables the transfer of sequential content from one
domain to another while preserving the style of the target do-
main. Much of the I2I research focuses on filling in missing pix-
shown in Fig 21 and Table IV, StarGAN fails to generate di- els, i.e, image inpainting [15]–[19] and image outpainting [202],
verse outputs and get the worst LPIPS score despite the injected but they treat different occluded images. Taking the image of a
randomness. AttGAN and STGAN can generate multi-modal human face as an example, the image inpainting task produces
results, but the gap is very small meanwhile DosGAN performs visually realistic and semantically correct results from the input
worse translation quality. In comparison, StarGANv2 can gen- with a masked nose, mouth and eyes, while the image outpaint-
erate totally different modality leading to the best LPIPS score. ing task translates a highly occluded face image that only has a
Conclusion Generally, supervised methods usually produce nose, mouth and eyes.
better translated results than unsupervised methods on simi- I2I has made great contributions to artistic creation. In the
lar network structure. However, in some special cases, super- past, redrawing an image in a particular form of art requires a
vised methods do not always perform better than unsupervised well-trained artist and much time. In contrast, many I2I studies
methods such as CUT which benefits from the development of can automatically turn photo-realistic images into synthetic art-
network architecture (StyleGAN) and more effective training works without human intervention. Using the I2I methods for
strategies (contrastive learning). As reported in Table III and artistic creation can directly translate real-world photographic
Table IV, choosing an updated model to train I2I task may be a works into illustrations in children’s books [256], cartoon im-
good idea because such model is usually trained with some of ages [12], [33], [35]–[38], [75], [80], [108], [110], [112], [121],
the latest training strategies and well-designed network archi- [135], comics [34], [203] or a multichirography of Chinese char-
tecture. Moreover, the high-quality dataset plays a crucial role acters [204]. Additionally, the style transfer task achieves re-
in I2I task. markable success through I2I methods. It contains two main
objectives: artistic style transfer [2], [11], [12], [74], [75], [90],
[126], [118], [181], which involves translating the input image
VI. APPLICATION to the desired artistic style, such as that of Monet or van Gogh;
In this section, we review the various and fruitful applica- and photo-realistic style transfer [2], [12], [74], [117], [118],
tions of I2I shown in Table V. We classify the main applications [116], [124], [135], which must clearly maintain the original
following the taxonomy of I2I methods. edge structure when transferring a style.
For realistic-looking image synthesis, the related I2I works We can also exploit I2I for image restoration. The goal of
tend to generate photos of real-world scenes given differ- image restoration is to restore a degraded image to its orig-
ent forms of input data. A typical task involves translating a inal form via the degradation model. Specifically, the image
semantic segmentation mask [4]–[7], [239], [240] into real- super-resolution task [28], [29] involves increasing the resolu-
world images, that is, semantic synthesis. Person image syn- tion of an image, which is usually trained with down-scaled ver-
thesis, including virtual try-on [89], [190]–[194], [241]–[245] sions of the target image as inputs. Image denoising [205]–[207]
and skeleton/keypoint-to-person translation [102], [103], [246] aims to remove artificially added noise from the images. Image
learns to translate an image of a person to another image of the deraining [208]–[210], image dehazing [211]–[215] and image

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3876 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

deblurring [216]–[219] aim to remove optical distortions from REFERENCES


photos that were taken out of focus or while the camera was [1] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
moving, or from photos of faraway geographical or astronomi- with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
cal features. Pattern Recognit., 2017, pp. 1125–1134.
[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
Image enhancement is a subjective process that involves translation using cycle-consistent adversarial networks,” in Proc. IEEE
heuristic procedures designed to process an image to satisfy the Int. Conf. Comput. Vis., 2017, pp. 2223–2232.
human visual system. I2I proves its effectiveness in this field [3] K. Regmi and A. Borji, “Cross-view image synthesis using condi-
tional gans,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
including image colorization and image quality improvement. pp. 3501–3510.
Image colorization [22]–[27] involves imagining the color of [4] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image syn-
each pixel, given only its luminosity. It is trained on images with thesis with spatially-adaptive normalization,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., Jun. 2019, pp. 2337–2346.
their color artificially removed. Image quality improvement [2], [5] P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “SEAN: Image synthesis with
[220]–[222] focuses on producing noticeably fewer colored ar- semantic region-adaptive normalization,” in Proc. IEEE/CVF Conf. Com-
tifacts around hard edges and more accurate colors, as well as put. Vis. Pattern Recognit., Jun. 2020, pp. 5104–5113.
[6] C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “MaskGAN: Towards diverse
reduced noise in smooth shadows. Moreover, [257] Learns to and interactive facial image manipulation,” in Proc. IEEE/CVF Conf.
fuse multi-focus image using I2I methods. Comput. Vis. Pattern Recognit., 2020, pp. 5549–5558.
We also notice that two special types of data are used in [7] H. Tang, D. Xu, Y. Yan, P. H. Torr, and N. Sebe, “Local class-specific and
global image-level generative adversarial networks for semantic-guided
I2I algorithms for particular tasks: remote sensing imaging for scene generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
wildlife habitat analysis [223] and building extraction [258]; nit., Jun. 2020, pp. 7870–7879.
medical imaging for disease diagnosis [205], [224]–[226], dose [8] Q. Yang et al., “MRI cross-modality neuroimage-to-neuroimage transla-
tion,” 2018, arXiv:1801.06940.
calculation [227] and surgical training phantoms improve- [9] X. Guo et al., “GAN-based virtual-to-real image translation for urban
ment [228]. scene semantic segmentation,” Neurocomputing, vol. 394, pp. 127–135,
I2I methods can also contribute to other visual tasks, such 2020.
[10] R. Li, W. Cao, Q. Jiao, S. Wu, and H.-S. Wong, “Simplified unsuper-
as transfer learning for reinforcement learning [229], im- vised image translation for semantic segmentation adaptation,” Pattern
age registration [39], domain adaptation [30]–[32], person re- Recognit., vol. 105, 2020, Art. no. 107343.
identification [230]–[234], image segmentation [8]–[10], facial [11] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover
cross-domain relations with generative adversarial networks,” in Proc.
geometry reconstruction [235], 3D pose estimation [20], [21], Int. Conf. Mach. Learn., 2017, pp. 1857–1865.
neural talking head generation [236] and hand gesture-to-gesture [12] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual
translation [238]. learning for image-to-image translation,” in Proc. IEEE Int. Conf. Com-
put. Vis., 2017, pp. 2849–2857.
[13] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, and K. I. Kim, “Un-
supervised attention-guided image-to-image translation,” in Proc. Adv.
VII. SUMMARY AND OUTLOOK Neural Inf. Process. Syst., 2018, pp. 3693–3703.
[14] M. Tomei, M. Cornia, L. Baraldi, and R. Cucchiara, “Art2Real: Un-
In recent years, the image-to-image translation (I2I) task has folding the reality of artworks via semantically-aware image-to-image
achieved great success and benefited many computer visual translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
tasks. I2I is attracting increasing attention because of its wide pp. 5849–5859.
[15] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Con-
practical application value and scope. We therefore conduct this text encoders: Feature learning by inpainting,” in Proc. IEEE Conf. Com-
comprehensive review of the analysis, methodology, and related put. Vis. Pattern Recognit., Jun. 2016, pp. 2536–2544.
applications of I2I to clarify the main progress the community [16] J.-Y. Zhu et al., “Toward multimodal image-to-image translation,” in
Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 465–476.
has made. In detail, we first briefly introduce the two most rep- [17] Y. Song et al., “Contextual-based image inpainting: Infer, match, and
resentative generative models that are widely used as the back- translate,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 3–19.
bone of I2I and some well-known evaluation metrics. Then, we [18] M.-Y. Liu et al., “Few-shot unsupervised image-to-image translation,” in
Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 10 551–10 560.
elaborate on the methodology of I2I regarding two-domain and [19] L. Zhao et al., “UCTGAN: Diverse image inpainting based on unsuper-
multi-domain tasks. In addition, we provide a thorough taxon- vised cross-space translation,” in Proc. IEEE/CVF Conf. Comput. Vis.
omy of the I2I applications. Pattern Recognit., Jun. 2020, pp. 5741–5750.
[20] H.-Y. Fish Tung, A. W. Harley, W. Seto, and K. Fragkiadaki, “Adversar-
Looking forward, there are still many challenges in I2I, which ial inverse graphics networks: Learning 2D-to-3D lifting and image-to-
need further explorations and investigations. The most iconic image translation from unpaired supervision,” in Proc. IEEE Int. Conf.
dilemma is the trade-off between network complexity and result Comput. Vis., Oct. 2017, pp. 4354–4362.
[21] S. Li, S. Gunel, M. Ostrek, P. Ramdya, P. Fua, and H. Rhodin>,
quality with higher resolution. Similarly, the efficiency should “Deformation-aware unpaired image translation for pose estimation on
also be considered when the I2I framework attempts to gener- laboratory animals,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
ate diverse and high-fidelity outputs. We believe that a more Jun. 2020, pp. 13 158–13 168.
[22] R. Zhang et al., “Real-time user-guided image colorization with learned
lightweight I2I network would attract more attention for prac- deep priors,” ACM Trans. Graph., vol. 36, no. 4, pp. 1–11, 2017.
tical application. Moreover, it is an interesting research trend [23] P. L. Suárez, A. D. Sappa, and B. X. Vintimilla, “Infrared image col-
to generalize the aforementioned methods to domains beyond orization based on a triplet DCGAN architecture,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. Workshops, 2017, pp. 18–23.
images, such as those of language, text and speech, termed [24] M. He, D. Chen, J. Liao, P. V. Sander, and L. Yuan, “Deep exemplar-
cross-modality translation tasks. Overall, we hope that this arti- based colorization,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–16,
cle can serve as a basis for the development of better methods 2018.
[25] B. Zhang et al., “Deep exemplar-based video colorization,” in Proc. IEEE
for I2I and inspire researchers in more domains in addition to Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8052–8061.
images.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3877

[26] Z. Xu, T. Wang, F. Fang, Y. Sheng, and G. Zhang, “Stylization-based [52] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent com-
architecture for fast deep exemplar colorization,” in Proc. IEEE/CVF ponents estimation,” 2014, arXiv:1410.8516.
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9363–9372. [53] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow: Noise
[27] J. Lee et al., “Reference-based sketch image colorization using modeling with conditional normalizing flows,” in Proc. IEEE/CVF Int.
augmented-self reference and dense semantic correspondence,” in Proc. Conf. Comput. Vis., Oct. 2019, pp. 3165–3173.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 5801– [54] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel, “The helmholtz
5810. machine,” Neural Comput., vol. 7, no. 5, pp. 889–904, 1995.
[28] Y. Yuan et al., “Unsupervised image super-resolution using cycle-in- [55] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” Proc.
cycle generative adversarial networks,” in Proc. IEEE Conf. Comput. 2nd Int. Conf. Learn. Representations, ICLR 2014, Banff, AB, Canada,
Vis. Pattern Recognit. Workshops, Jun. 2018, pp. 701–710. vol. 1050, p. 1, 2014.
[29] Y. Zhang, S. Liu, C. Dong, X. Zhang, and Y. Yuan, “Multiple cycle- [56] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropaga-
in-cycle generative adversarial networks for unsupervised image super- tion and approximate inference in deep generative models,” in Proc. Int.
resolution,” IEEE Trans. Image Process., vol. 29, pp. 1101–1112, Conf. Mach. Learn., 2014, pp. 1278–1286.
Sep. 2019. [57] H. Larochelle and I. Murray, “The neural autoregressive distribu-
[30] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, “Image tion estimator,” in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011,
to image translation for domain adaptation,” in Proc. IEEE Conf. Comput. pp. 29–37.
Vis. Pattern Recognit., Jun. 2018, pp. 4500–4509. [58] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “Made: Masked
[31] J. Cao et al., “DIDA: Disentangled synthesis for domain adaptation,” autoencoder for distribution estimation,” in Proc. Int. Conf. Mach. Learn.,
CoRR, 2018, arXiv:1805.08019. 2015, pp. 881–889.
[32] A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang, “A unified feature [59] E. Nalisnick, L. Hertel, and P. Smyth, “Approximate inference for deep la-
disentangler for multi-domain image translation and manipulation,” in tent gaussian mixtures,” in Proc. NIPS Workshop Bayesian Deep Learn.,
Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 2590–2599. vol. 2, pp. 131–134, 2016.
[33] Y. Shi, D. Deb, and A. K. Jain, “WarpGAN: Automatic caricature gen- [60] D. Rezende and S. Mohamed, “Variational inference with normalizing
eration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, flows,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1530–1538.
pp. 10 762–10 771. [61] J. Tomczak and M. Welling, “Vae with a vampprior,” in Proc. Int. Conf.
[34] M. Pesko, A. Svystun, P. Andruszkiewicz, P. Rokita, and T. Trzcinski, Artif. Intell. Statist., 2018, pp. 1214–1223.
“Comixify: Transform video into comics,” Fundamenta Informaticae, [62] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein
vol. 168, no. 2-4, pp. 311–333, 2019. auto-encoders,” in Proc. Int. Conf. Learn. Representations, 2018.
[35] Z. Zheng et al., “Unpaired photo-to-caricature translation on faces in the [63] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
wild,” Neurocomputing, vol. 355, pp. 71–81, 2019. Inf. Process. Syst., 2014, pp. 2672–2680.
[36] Y. Chen, Y.-K. Lai, and Y.-J. Liu, “CartoonGAN: Generative adversarial [64] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-
networks for photo cartoonization,” in Proc. IEEE Conf. Comput. Vis. sarial networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 214–223.
Pattern Recognit., 2018, pp. 9465–9474. [65] X. Mao et al., “Least squares generative adversarial networks,” in Proc.
[37] J. Kim, M. Kim, H. Kang, and K. H. Lee, “U-GAT-IT: Unsupervised gen- IEEE Int. Conf. Comput. Vis., 2017, pp. 2794–2802.
erative attentional networks with adaptive layer-instance normalization [66] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
for image-to-image translation,” in Proc. Int. Conf. Learn. Representa- learning with deep convolutional generative adversarial networks,” in
tions, 2019. Proc. 4th Int. Conf. Learn. Representations, San Juan, Puerto Rico, May
[38] X. Wang and J. Yu, “Learning to cartoonize using white-box cartoon rep- 2016.
resentations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [67] M. Mirza and S. Osindero, “Conditional generative adversarial
2020, pp. 8090–8099. nets,” 2014, arXiv:1411.1784.
[39] M. Arar, Y. Ginger, D. Danon, A. H. Bermano, and D. Cohen-Or, [68] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
“Unsupervised multi-modal image registration via geometry preserving “Improved training of wasserstein GANs,” in Proc. Adv. Neural Inf. Pro-
image-to-image translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pat- cess. Syst., 2017, pp. 5767–5777.
tern Recognit., Jun. 2020, pp. 13410–13419. [69] J. J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adver-
[40] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial sarial networks,” in Proc. 5th Int. Conf. Learn. Representations, Toulon,
on energy-based learning,” Predicting Structured Data, vol. 1, no. 0, France, Apr. 2017.
2006. [70] D. Berthelot, T. Schumm, and L. Metz, “BeGAN: Boundary equilibrium
[41] J. Xu, H. Li, and S. Zhou, “An overview of deep generative models,” generative adversarial networks,” 2017, arXiv:1703.10717.
IETE Tech. Rev., vol. 32, no. 2, pp. 131–139, 2015. [71] A. Jolicoeur-Martineau, “The relativistic discriminator: A key element
[42] A. Oussidi and A. Elhassouny, “Deep generative models: Survey,” in missing from standard GAN,” in Proc. Int. Conf. Learn. Representations,
Proc. Int. Conf. Intell. Syst. Comput. Vis., 2018, pp. 1–8. 2018.
[43] H.-M. Chu, C.-K. Yeh, and Y.-C. Frank Wang, “Deep generative mod- [72] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normal-
els for weakly-supervised multi-label classification,” in Proc. Eur. Conf. ization for generative adversarial networks,” in Proc. Int. Conf. Learn.
Comput. Vis., 2018, pp. 400–415. Representations, 2018.
[44] R. A. Yeh et al., “Semantic image inpainting with deep generative [73] P. Ghosh, M. S. Sajjadi, A. Vergari, M. Black, and B. Scholkopf, “From
models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, variational to deterministic autoencoders,” in Int. Conf. Learn. Represen-
pp. 5485–5493. tations, 2019.
[45] M. Tschannen, E. Agustsson, and M. Lucic, “Deep generative models [74] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image trans-
for distribution-preserving lossy compression,” in Proc. Adv. Neural Inf. lation networks,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 700–
Process. Syst., 2018, pp. 5929–5940. 708.
[46] R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” in Proc. [75] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse
Artif. Intell. Statist., 2009, pp. 448–455. image-to-image translation via disentangled representations,” in Proc.
[47] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann ma- Eur. Conf. Comput. Vis., 2018, pp. 35–51.
chines for collaborative filtering,” in Proc. 24th Int. Conf. Mach. Learn., [76] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool, “Exemplar
2007, pp. 791–798. guided unsupervised image-to-image translation with semantic consis-
[48] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for tency,” in Proc. Int. Conf. Learn. Representations, 2018.
deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. [77] W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy, “Transgaga: Geometry-
[49] A. Van den Oord et al., “Conditional image generation with pix- aware unsupervised image-to-image translation,” in Proc. IEEE Conf.
elcnn decoders,” in Proc. Adv. Neural Inf. Process. Syst., 2016, Comput. Vis. Pattern Recognit., 2019, pp. 8012–8021.
pp. 4790–4798. [78] H. Kazemi, S. Soleymani, F. Taherkhani, S. Iranmanesh, and N.
[50] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent Nasrabadi, “Unsupervised image-to-image translation using domain-
neural networks,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1747–1756. specific variational information bound,” in Proc. Adv. Neural Inf. Process.
[51] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel, “Pixelsnail: An Syst., 2018, pp. 10 348–10 358.
improved autoregressive generative model,” in Proc. Int. Conf. Mach. [79] J. Lin, Y. Xia, S. Liu, T. Qin, and Z. Chen, “ZSTGAN: An adversarial
Learn., 2018, pp. 864–872. approach for unsupervised zero-shot image-to-image translation,” 2019,
arXiv:1906.00184.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3878 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

[80] J. Lin, Y. Pang, Y. Xia, Z. Chen, and J. Luo, “TuiGAN: Learning versa- [106] A. Gonzalez-Garcia, J. Van De Weijer, and Y. Bengio, “Image-to-image
tile image-to-image translation with two unpaired images,” in Proc. Eur. translation for cross-domain disentanglement,” in Proc. Adv. Neural Inf.
Conf. Comput. Vis. Cham: Springer, 2020, pp. 18–35. Process. Syst., 2018, pp. 1287–1298.
[81] S. Pal and S. Mitra, “Multilayer perceptron, fuzzy sets, and classification,” [107] A. Mustafa and R. [Link], “Transformation consistency regulariza-
IEEE Trans. Neural Netw., vol. 3, no. 5, pp. 683–697, Sep. 1992. tion - a semi-supervised paradigm for image-to-image translation,” Com-
[82] Y. LeCun et al., “Backpropagation applied to handwritten zip code recog- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
nition,” Neural Comput., vol. 1, no. 4, pp. 541–551, 1989. Aug. 23–28, 2020, Proceedings, Part XVIII 16, A. Vedaldi, H. Bischof,
[83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing,
in Proc. 3rd Int. Conf. Learn. Representations, San Diego, CA, USA, 2020, pp. 599–615.
May 2015. [108] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain im-
[84] T. Tieleman and G. Hinton, “Lecture 6.5-RmsProp: Divide the gradient age generation,” in Proc. 5th Int. Conf. Learn. Representations, Toulon,
by a running average of its recent magnitude,” COURSERA: Neural Netw. France, Apr. 2017.
Mach. Learn., vol. 4, no. 2, pp. 26–31, 2012. [109] M. Li et al., “Unsupervised image-to-image translation with stacked
[85] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in cycle-consistent adversarial networks,” in Proc. Eur. Conf. Comput.
Proc. Eur. Conf. Comput. Vis. Cham: Springer, 2016, pp. 649–666. Vis., 2018, pp. 184–199.
[86] C. Wang et al., “Discriminative region proposal adversarial networks for [110] A. Gokaslan, V. Ramanujan, D. Ritchie, K. In Kim, and J. Tompkin, “Im-
high-quality image-to-image translation,” in Proc. Eur. Conf. Comput. proving shape deformation in unsupervised image-to-image translation,”
Vis., Sep. 2018, pp. 770–785. in Proc. Eur. Conf. Comput. Vis., 2018, pp. 649–665.
[87] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality [111] M. Amodio and S. Krishnaswamy, “TravelGAN: Image-to-image trans-
assessment: From error visibility to structural similarity,” IEEE Trans. lation by transformation vector learning,” in Proc. IEEE Conf. Comput.
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. Vis. Pattern Recognit., 2019, pp. 8983–8992.
[88] T. Salimans et al., “Improved techniques for training gans,” in Proc. Adv. [112] Y. Zhao, R. Wu, and H. Dong, “Unpaired image-to-image translation us-
Neural Inf. Process. Syst., 2016, pp. 2234–2242. ing adversarial consistency loss,” in Proc. Eur. Conf. Comput. Vis. Cham:
[89] L. Ma et al., “Pose guided person image generation,” in Proc. Adv. Neural Springer, 2020, pp. 800–815.
Inf. Process. Syst., vol. 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wal- [113] O. Katzir, D. Lischinski, and D. Cohen-Or, “Cross-domain cascaded deep
lach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, translation,” Computer Vision - ECCV 2020-16th European Conference,
Inc., 2017. Glasgow, U.K., Aug. 23-28, 2020, Proceedings, Part II, Ser. Lecture Notes
[90] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsuper- in Computer Science, vol. 12347, A. Vedaldi, H. Bischof, T. Brox, and J.
vised image-to-image translation,” in Proc. Eur. Conf. Comput. Vis., 2018, Frahm, Eds. Springer, 2020, pp. 673–689.
pp. 172–189. [114] S. Benaim and L. Wolf, “One-sided unsupervised domain mapping,” in
[91] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style Proc. Adv. Neural Inf. Process. Syst., vol. 30, I. Guyon, U. V. Luxburg,
transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis. Cham: S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.
Springer, 2016, pp. 694–711. Curran Associates, Inc., 2017.
[92] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, [115] H. Fu et al., “Geometry-consistent generative adversarial networks for
“GANs trained by a two time-scale update rule converge to a local nash one-sided unsupervised domain mapping,” in Proc. IEEE/CVF Conf.
equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6626– Comput. Vis. Pattern Recognit., 2019, pp. 2427–2436.
6637. [116] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for
[93] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying unpaired image-to-image translation,” in Proc. Eur. Conf. Comput. Vis.
MMD GANs,” in Proc. Int. Conf. Learn. Representations, 2018. Cham: Springer, 2020, pp. 319–345.
[94] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink- [117] T. Park et al., “Swapping autoencoder for deep image manipulation,” Adv.
ing the inception architecture for computer vision,” in Proc. IEEE Conf. Neural Inf. Proc. Syst., Curran Associates, Inc., vol. 33, pp. 7198–7211,
Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826. 2020.
[95] T. R. Shaham, T. Dekel, and T. Michaeli, “SinGAN: Learning a generative [118] L. Jiang et al., “Tsit: A simple and versatile framework for image-to-
model from a single natural image,” in Proc. IEEE Int. Conf. Comput. image translation,” in Proc. Eur. Conf. Comput. Vis. Cham: Springer,
Vis., 2019, pp. 4570–4580. 2020, pp. 206–222.
[96] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unrea- [119] C. Zheng, T.-J. Cham, and J. Cai, “The spatially-correlative loss for vari-
sonable effectiveness of deep features as a perceptual metric,” in Proc. ous image translation tasks,” Proc. IEEE/CVF Conf. Compu. Vis. Pattern
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586–595. Recognition (CVPR), Jun. 2021, pp. 16407–16417.
[97] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [120] J. Liang, H. Zeng, and L. Zhang, “High-resolution photorealistic image
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern translation in real-time: A laplacian pyramid translation network,” Proc.
Recognit., 2015, pp. 3431–3440. IEEE/CVF Conf. Compu. Vis. Pattern Recognition (CVPR), Jun. 2021,
[98] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo, “Reliable fidelity pp. 9392–9400.
and diversity metrics for generative models,” in Proc. Int. Conf. Mach. [121] S. Ma, J. Fu, C. W. Chen, and T. Mei, “Da-GAN: Instance-
Learn., 2020, pp. 7176–7185. level image translation by deep attention generative adversarial net-
[99] T.-C. Wang et al., “High-resolution image synthesis and semantic manip- works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
ulation with conditional gans,” in Proc. IEEE Conf. Comput. Vis. Pattern pp. 5657–5666.
Recognit., 2018, pp. 8798–8807. [122] X. Chen, C. Xu, X. Yang, and D. Tao, “Attention-GAN for object trans-
[100] B. AlBahar and J.-B. Huang, “Guided image-to-image translation with figuration in wild images,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018,
bi-directional feature transformation,” in Proc. IEEE/CVF Int. Conf. pp. 164–180.
Comput. Vis., Oct. 2019, pp. 9016–9025. [123] S. Mo, M. Cho, and J. Shin, “InstaGAN: Instance-aware image-to-image
[101] H. Tang et al., “Multi-channel attention selection GAN with cascaded translation,” in Proc. Int. Conf. Learn. Representations, 2018.
semantic guidance for cross-view image translation,” in Proc. IEEE Conf. [124] Z. Shen, M. Huang, J. Shi, X. Xue, and T. S. Huang, “Towards instance-
Comput. Vis. Pattern Recognit., 2019, pp. 2417–2426. level image-to-image translation,” in Proc. IEEE Conf. Comput. Vis. Pat-
[102] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen, “Cross- tern Recognit., 2019, pp. 3683–3692.
domain correspondence learning for exemplar-based image transla- [125] D. Bhattacharjee, S. Kim, G. Vizier, and M. Salzmann, “DUNIT:
tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, Detection-based unsupervised image-to-image translation,” in Proc.
pp. 5143–5153. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 4787–
[103] X. Zhou et al., “CoCosNet v2: Full-resolution correspondence learn- 4796.
ing for image translation,” Proc. IEEE/CVF Conf. Compu. Vis. Pattern [126] W. Cho, S. Choi, D. K. Park, I. Shin, and J. Choo, “Image-to-image
Recognition (CVPR), Jun. 2021, pp. 11465–11475. translation via group-wise deep whitening-and-coloring transformation,”
[104] T. R. Shaham, M. Gharbi, R. Zhang, E. Shechtman, and T. Michaeli, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10 639–
“Spatially-adaptive pixelwise networks for fast image translation,” Proc. 10 647.
IEEE/CVF Conf. Compu. Vis. Pattern Recognition (CVPR), Jun. 2021, [127] T. F. van der Ouderaa and D. E. Worrall, “Reversible GANs for memory-
pp. 14882–14891. efficient image-to-image translation,” in Proc. IEEE Conf. Comput. Vis.
[105] A. Bansal, Y. Sheikh, and D. Ramanan, “PixeINN: Example-based image Pattern Recognit., 2019, pp. 4720–4728.
synthesis,” in Proc. Int. Conf. Learn. Representations, 2018.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3879

[128] H. Chen et al., “Distilling portable generative adversarial networks for [154] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,”
image translation,” in Proc. AAAI Conf. Artif Intell., vol. 34, no. 04, 2020, 2017.
pp. 3585–3592. [155] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
[129] Y.-C. Chen, X. Xu, and J. Jia, “Domain adaptive image-to-image data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2006.
Jun. 2020, pp. 5274–5283. [156] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Au-
[130] R. Chen, W. Huang, B. Huang, F. Sun, and B. Fang, “Reusing dis- toencoding beyond pixels using a learned similarity metric,” in Proc. Int.
criminators for encoding: Towards unsupervised image-to-image trans- Conf. Mach. Learn., 2016, pp. 1558–1566.
lation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, [157] X. Chen et al., “InfoGAN: Interpretable representation learning by in-
pp. 8168–8177. formation maximizing generative adversarial nets,” in Proc. Adv. Neural
[131] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and A. Courville, Inf. Process. Syst., 2016, pp. 2172–2180.
“Augmented cyclegan: Learning many-to-many mappings from un- [158] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,”
paired data,” in Proc. Mach. Learn. Res., vol. 80, J. Dy and A. Proc. 5th Int. Conf. Learn. Representations, ICLR 2017, Toulon, France:
Krause, Eds. Stockholmsmässan, Stockholm Sweden: PMLR, Jul. 2018, [Link]. Apr. 2017.
pp. 195–204. [159] V. Dumoulin et al., “Adversarially learned inference,” Proc. 5th Int. Conf.
[132] J. Lin, Y. Xia, T. Qin, Z. Chen, and T.-Y. Liu, “Conditional image-to- Learn. Representations, ICLR 2017, vol. 1050, Apr. 2017.
image translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [160] I. Higgins et al., “Beta-VAE: Learning basic visual concepts with a con-
Jun. 2018, pp. 5524–5532. strained variational framework,” in Proc. Int. Conf. Learn. Representa-
[133] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang, “Mode seeking tions, 2017.
generative adversarial networks for diverse image synthesis,” in Proc. [161] H. Kim and A. Mnih, “Disentangling by factorising,” in Proc. Int. Conf.
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1429–1437. Mach. Learn., 2018, pp. 2649–2658.
[134] Y. Alharbi, N. Smith, and P. Wonka, “Latent filter scaling for multimodal [162] E. L. Denton and V. Birodkar, “Unsupervised learning of disentan-
unsupervised image-to-image translation,” in Proc. IEEE Conf. Comput. gled representations from video,” in Proc. Adv. Neural Inf. Process.
Vis. Pattern Recognit., 2019, pp. 1458–1466. Syst. 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
[135] H.-Y. Chang, Z. Wang, and Y.-Y. Chuang, “Domain-specific mappings S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017,
for generative adversarial style transfer,” in Proc. Eur. Conf. Comput. Vis. pp. 4414–4423.
Cham: Springer, 2020, pp. 573–589. [163] A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with
[136] Y. Wang et al., “Transferring GANs: Generating images from limited contrastive predictive coding,” 2018, arXiv:1807.03748.
data,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 218–234. [164] L. Hui, X. Li, J. Chen, H. He, and J. Yang, “Unsupervised multi-domain
[137] J. Lin, Y. Wang, T. He, and Z. Chen, “Learning to transfer: Unsupervised image translation with domain-specific encoders/decoders,” in Proc. 24th
meta domain translation,” Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, Int. Conf. Pattern Recognit., 2018, pp. 2044–2049.
pp. 11507–11514, Apr. 2020. [165] B. Zhao, B. Chang, Z. Jie, and L. Sigal, “Modular generative adversarial
[138] Y. Li, R. Zhang, J. C. Lu, and E. Shechtman, “Few-shot image generation networks,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 150–165.
with elastic weight consolidation,” in Proc. Adv. Neural Inf. Process. [166] Y. Choi et al., “StarGAN: Unified generative adversarial networks for
Syst., 2020, pp. 15885–15896. multi-domain image-to-image translation,” in Proc. IEEE Conf. Comput.
[139] U. Ojha et al., “Few-shot image generation via cross-domain correspon- Vis. Pattern Recognit., Jun. 2018, pp. 8789–8797.
dence,” Proc. IEEE/CVF Conf. Compu. Vis. Pattern Recognition (CVPR), [167] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute
Jun. 2021, pp. 10 743–10 752. editing by only changing what you want,” IEEE Trans. Image Process.,
[140] S. Benaim and L. Wolf, “One-shot unsupervised cross domain transla- vol. 28, no. 11, pp. 5464–5478, Nov. 2019.
tion,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 2104–2114. [168] P.-W. Wu, Y.-J. Lin, C.-H. Chang, E. Y. Chang, and S.-W. Liao, “REL-
[141] T. Cohen and L. Wolf, “Bidirectional one-shot unsupervised domain map- GAN: Multi-domain image-to-image translation via relative attributes,”
ping,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1784–1792. in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 5914–5922.
[142] R. Navigli, “Word sense disambiguation: A survey,” ACM Comput. Surv., [169] M. Liu et al., “STGAN: A unified selective transfer network for arbitrary
vol. 41, no. 2, pp. 1–69, 2009. image attribute editing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[143] G.-J. Qi and J. Luo, “Small data challenges in big data era: A survey of Recognit., Jun. 2019, pp. 3673–3682.
recent progress on unsupervised and semi-supervised methods,” IEEE [170] D. Lee, J. Kim, W.-J. Moon, and J. C. Ye, “CollaGAN: Collaborative
Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2020. GAN for missing image data imputation,” in Proc. IEEE/CVF Conf. Com-
[144] L. Schmarje, M. Santarossa, S.-M. Schröder, and R. Koch, “A survey put. Vis. Pattern Recognit., Jun. 2019, pp. 2487–2496.
on semi-, self- and unsupervised learning for image classification,” IEEE [171] J. Lin, Y. Xia, Y. Wang, T. Qin, and Z. Chen, “Image-to-image translation
Access, vol. 9, pp. 82146–82168, 2021. with multi-path consistency regularization,” in Proc. Int. Joint Conf. Artif.
[145] M. Shi and B. Zhang, “Semi-supervised learning improves gene Intell., 2019, pp. 2980–2986.
expression-based prediction of cancer recurrence,” Bioinformatics, [172] S. Chang, S. Park, J. Yang, and N. Kwak, “Sym-parameterized dynamic
vol. 27, no. 21, pp. 3017–3023, 2011. inference for mixed-domain image translation,” in Proc. IEEE/CVF Int.
[146] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi- Conf. Comput. Vis., Oct. 2019, pp. 4803–4811.
supervised learning with deep generative models,” in Proc. Adv. Neural [173] M. M. R. Siddiquee et al., “Learning fixed points in generative adver-
Inf. Process. Syst., 2014, pp. 3581–3589. sarial networks: From image-to-image translation to disease detection
[147] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi- and localization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019,
supervised learning with ladder networks,” in Proc. Adv. Neural Inf. Pro- pp. 191–200.
cess. Syst., 2015, pp. 3546–3554. [174] T. He et al., “Deliberation learning for image-to-image translation.” in
[148] D. Berthelot et al., “Mixmatch: A holistic approach to semi-supervised Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 2484–2490.
learning,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 5049–5059. [175] R. Wu, X. Tao, X. Gu, X. Shen, and J. Jia, “Attribute-driven spontaneous
[149] R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song, “MetaGAN: motion in unpaired image translation,” in Proc. IEEE/CVF Int. Conf.
An adversarial approach to few-shot learning,” in Proc. Adv. Neural Inf. Comput. Vis., Oct. 2019, pp. 5923–5932.
Process. Syst., 2018, pp. 2365–2374. [176] J. Cao, H. Huang, Y. Li, R. He, and Z. Sun, “Informative sample min-
[150] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-transfer learning for ing network for multi-domain image-to-image translation,” in Proc. Eur.
few-shot learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Conf. Comput. Vis. Cham: Springer, 2020, pp. 404–419.
2019, pp. 403–412. [177] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-
[151] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot Noguer, “Ganimation: Anatomically-aware facial animation from a sin-
learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4077–4087. gle image,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 818–833.
[152] F. Sung et al., “Learning to compare: Relation network for few-shot [178] J. Lin et al., “Exploring explicit domain supervision for latent space
learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, disentanglement in unpaired image-to-image translation,” IEEE Trans.
pp. 1199–1208. Pattern Anal. Mach. Intell., vol. 43, no. 4, pp. 1254–1266, Apr. 2021.
[153] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin, [179] X. Yu, Y. Chen, S. Liu, T. Li, and G. Li, “Multi-mapping image-to-
“Image analogies,” in Proc. 28th Annu. Conf. Comput. Graph. Interactive image translation via learning disentanglement,” in Proc. Adv. Neural
Techn., 2001, pp. 327–340. Inf. Process. Syst., 2019, pp. 2994–3004.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3880 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022

[180] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN V2: Diverse image [205] K. Armanious et al., “MedGAN: Medical image translation using GANs,”
synthesis for multiple domains,” in Proc. IEEE/CVF Conf. Comput. Vis. Computerized Med. Imag. Graph., vol. 79, 2020, Art. no. 101684.
Pattern Recognit., 2020, pp. 8188–8197. [206] I. Manakov et al., “Noise as domain shift: Denoising medical images
[181] H.-Y. Lee et al., “DRIT: Diverse image-to-image translation via dis- by unpaired image translation,” Domain Adaptation and Representation
entangled representations,” Int. J. Comput. Vis., vol. 128, no. 10, Transfer and Medical Image Learning with Less Labels and Imperfect
pp. 2402–2417, 2020. Data. Cham: Springer, 2019, pp. 3–10.
[182] Y. Liu et al., “GMM-unit: Unsupervised multi-domain and multi-modal [207] H. Touvron, M. Douze, M. Cord, and H. Jégou, “Powers of layers for
image-to-image translation via attribute gaussian mixture modeling,” image-to-image translation,” 2020, arXiv:2008.05763.
2020, arXiv:2003.06788. [208] H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a condi-
[183] K. Saito, K. Saenko, and M.-Y. Liu, “COCO-FUNIT: Few-shot unsu- tional generative adversarial network,” IEEE Trans. Circuits Syst. Video
pervised image translation with a content conditioned style encoder,” Technol., vol. 30, no. 11, pp. 3943–3956, Nov. 2020.
Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, [209] R. Li, L.-F. Cheong, and R. T. Tan, “Heavy rain image restoration: In-
Aug. 23-28, 2020, Proceedings, Part III 16, A. Vedaldi, H. Bischof, T. tegrating physics model and conditional adversarial learning,” in Proc.
Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1633–1642.
2020, pp. 382–398. [210] H. Zhu et al., “Singe image rain removal with unpaired information:
[184] X. Li et al., “Attribute guided unpaired image-to-image translation with A differentiable programming perspective,” in Proc. AAAI Conf. Artif
semi-supervised learning,” 2019, arXiv:1904.12428. Intell., vol. 33, 2019, pp. 9332–9339.
[185] Y. Wang, S. Khan, A. Gonzalez-Garcia, J. van de Weijer, and F. S. Khan, [211] A. Dudhane, H. S. Aulakh, and S. Murala, “Ri-GAN: An end-to-end net-
“Semi-supervised learning for few-shot image-to-image translation,” work for single image haze removal,” in Proc. IEEE/CVF Conf. Comput.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, Vis. Pattern Recognit. Workshops, 2019, pp. 2014–2023.
pp. 4453–4462. [212] D. Engin, A. Genç, and H. Kemal Ekenel, “Cycle-dehaze: Enhanced
[186] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the cyclegan for single image dehazing,” in Proc. IEEE Conf. Comput. Vis.
wild,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 3730–3738. Pattern Recognit. Workshops, 2018, pp. 825–833.
[187] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with [213] Y. Cho, R. Malav, G. Pandey, and A. Kim, “DehazeGAN: Underwa-
adaptive instance normalization,” in Proc. IEEE Int. Conf. Comput. Vis., ter haze image restoration using unpaired image-to-image translation,”
2017, pp. 1501–1510. IFAC-PapersOnLine, vol. 52, no. 21, pp. 82–85, 2019.
[188] A. Yu and K. Grauman, “Fine-grained visual comparisons with local [214] Y.-F. Chen, A. K. Patel, and C.-P. Chen, “Image haze removal by adaptive
learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, cyclegan,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit
pp. 192–199. Conf., 2019, pp. 1122–1127.
[189] Y. Li et al., “Asymmetric GAN for unpaired image-to-image trans- [215] Y. Cho, H. Jang, R. Malav, G. Pandey, and A. Kim, “Underwater im-
lation,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 5881–5896, age dehazing via unpaired image-to-image translation,” Int. J. Control,
Dec. 2019. Automat. Syst., vol. 18, no. 3, pp. 605–614, 2020.
[190] Y. Yan, J. Xu, B. Ni, W. Zhang, and X. Yang, “Skeleton-aided articulated [216] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “De-
motion generation,” in Proc. 25th ACM Int. Conf. Multimedia. New York, blurGAN: Blind motion deblurring using conditional adversarial net-
NY, USA: Association for Computing Machinery, 2017, pp. 199–207. works,” in Proc. IEEE Conf. Comput. Vis. pattern Recognit., 2018,
[191] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe, “Deformable pp. 8183–8192.
GANs for pose-based human image generation,” in Proc. IEEE Conf. [217] H. Liu, P. Navarrete Michelini, and D. Zhu, “Deep networks for image-
Comput. Vis. Pattern Recognit., 2018, pp. 3408–3416. to-image translation with mux and demux layers,” in Proc. Eur. Conf.
[192] L. Ma et al., “Disentangled person image generation,” in Proc. IEEE Comput. Vis. Workshops, Sep. 2018.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 99–108. [218] O. Kupyn, T. Martyniuk, J. Wu, and Z. Wang, “DeblurGAN-V2: Deblur-
[193] P. Esser, E. Sutter, and B. Ommer, “A variational u-net for conditional ring (orders-of-magnitude) faster and better,” in Proc. IEEE Int. Conf.
appearance and shape generation,” in Proc. IEEE Conf. Comput. Vis. Comput. Vis., 2019, pp. 8878–8887.
Pattern Recognit., Jun. 2018, pp. 8857–8866. [219] T. Madam Nimisha, K. Sunil, and A. Rajagopalan, “Unsupervised class-
[194] H. Dong et al., “Towards multi-pose guided virtual try-on network,” in specific deblurring,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 353–369.
Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 9026–9035. [220] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool,
[195] J. Huang, J. Liao, and S. Kwong, “Semantic example guided image- “DSLR-quality photos on mobile devices with deep convolutional net-
to-image translation,” IEEE Trans. Multimedia, vol. 23, pp. 1654–1665, works,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3277–3285.
2021. [221] E. de Stoutz, A. Ignatov, N. Kobyshev, R. Timofte, and L. Van Gool,
[196] W. Chen and J. Hays, “Sketchygan: Towards diverse and realistic sketch “Fast perceptual image enhancement,” in Proc. Eur. Conf. Comput. Vis.,
to image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
2018, pp. 9416–9425. [222] Y.-S. Chen, Y.-C. Wang, M.-H. Kao, and Y.-Y. Chuang, “Deep photo
[197] Z. Li, C. Deng, E. Yang, and D. Tao, “Staged sketch-to-image synthe- enhancer: Unpaired learning for image enhancement from photographs
sis via semi-supervised generative adversarial networks,” IEEE Trans. with GANs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
Multimedia, vol. 23, pp. 2694–2705, 2021. pp. 6306–6314.
[198] A. Shocher et al., “Semantic pyramid for image generation,” in Proc. [223] R. Zheng, Z. Luo, and B. Yan, “Exploiting time-series image-to-image
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 7457– translation to expand the range of wildlife habitat analysis,” in Proc. AAAI
7466. Conf. Artif Intell., vol. 33, 2019, pp. 825–832.
[199] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “Pairedcyclegan: Asymmetric [224] R. Zhang, T. Pfister, and J. Li, “Harmonic unpaired image-to-image trans-
style transfer for applying and removing makeup,” in Proc. IEEE Conf. lation,” in Proc. Int. Conf. Learn. Representations, 2018.
Comput. Vis. Pattern Recognit., 2018, pp. 40–48. [225] M. M. R. Siddiquee et al., “Learning fixed points in generative adversar-
[200] H. Emami, M. M. Aliabadi, M. Dong, and R. B. Chinnam, “SPA-GAN: ial networks: From image-to-image translation to disease detection and
Spatial attention GAN for image-to-image translation,” IEEE Trans. Mul- localization,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 191–200.
timedia, vol. 23, pp. 391–401, 2020. [226] K. Armanious et al., “Unsupervised medical image translation using
[201] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-GAN: Unsu- cycle-medgan,” in Proc. 27th Eur. Signal Process. Conf., 2019, pp. 1–5.
pervised video retargeting,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, [227] S. Kaji and S. Kida, “Overview of image-to-image translation by use of
pp. 119–135. deep neural networks: Denoising, super-resolution, modality conversion,
[202] L. Zhang et al., “Nested scale-editing for conditional image synthesis,” and reconstruction in medical imaging,” Radiological Phys. Technol.,
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, vol. 12, no. 3, pp. 235–248, 2019.
pp. 5477–5487. [228] S. Engelhardt, R. De Simone, P. M. Full, M. Karck, and I. Wolf, “Improv-
[203] H. Su et al., “MangaGAN: Unpaired photo-to-manga translation based ing surgical training phantoms by hyperrealism: Deep unpaired image-
on the methodology of manga drawing,” in Proc. AAAI Conf. Artif Intell., to-image translation from real surgeries,” in Proc. Int. Conf. Med. Image
vol. 35, no. 3, 2021, pp. 2611–2619. Comput. Comput.- Assist. Interv. Cham: Springer, 2018, pp. 747–755.
[204] Y. Gao and J. Wu, “GAN-based unpaired chinese character image trans- [229] S. Gamrian and Y. Goldberg, “Transfer learning for related reinforcement
lation via skeleton transformation and stroke rendering,” in Proc. AAAI learning tasks via image-to-image translation,” in Proc. Int. Conf. Mach.
Conf. Artif Intell., vol. 34, no. 01, 2020, pp. 646–653. Learn., 2019, pp. 2063–2072.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3881

[230] W. Deng et al., “Image-image domain adaptation with preserved self- [255] J. Zhang et al., “Dual in-painting model for unsupervised gaze correction
similarity and domain-dissimilarity for person re-identification,” in Proc. and animation in the wild,” in Proc. 28th ACM Int. Conf. Multimedia. New
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 994–1003. York, NY, USA: Association for Computing Machinery, 2020, pp. 1588–
[231] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adap- 1596.
tation for person re-identification,” in Proc. IEEE Conf. Comput. Vis. [256] S. Hicsonmez, N. Samet, E. Akbas, and P. Duygulu, “Ganilla: Generative
Pattern Recognit., Jun. 2018, pp. 5157–5166. adversarial networks for image to illustration translation,” Image Vis.
[232] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person retrieval Comput., vol. 95, 2020, Art. no. 103886.
model hetero- and homogeneously,” in Proc. Eur. Conf. Comput. Vis., [257] X. Guo et al., “FuseGAN: Learning to fuse multi-focus image via condi-
Sep. 2018, pp. 172–188. tional generative adversarial network,” IEEE Trans. Multimedia, vol. 21,
[233] Z. Zheng et al., “Joint discriminative and generative learning for person no. 8, pp. 1982–1996, Aug. 2019.
re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- [258] L. Ding, H. Tang, Y. Liu, Y. Shi, X. X. Zhu, and L. Bruzzone, “Adversarial
nit., Jun. 2019, pp. 2138–2147. shape learning for building extraction in VHR remote sensing images,”
[234] J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptive transfer 2021, arXiv:2102.11262.
network for cross-domain person re-identification,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 7202–7211.
[235] M. Sela, E. Richardson, and R. Kimmel, “Unrestricted facial geome-
try reconstruction using image-to-image translation,” in Proc. IEEE Int. Yingxue Pang (Graduate Student Member, IEEE) re-
Conf. Comput. Vis., Oct. 2017, pp. 1576–1585. ceived the B.E. degree in electronic and information
[236] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot engineering from the Beijing University of Chemi-
adversarial learning of realistic neural talking head models,” in Proc. cal Technology, Beijing, China, in 2019. She is cur-
IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 9459–9468. rently working toward the [Link]. degree with the De-
[237] B. Duan, W. Wang, H. Tang, H. Latapie, and Y. Yan, “Cascade attention partment of Electronic Engineer and Information Sci-
guided residue learning gan for cross-modal translation,” in Proc. 25th ence, University of Science and Technology of China,
Int. Conf. Pattern Recognit., 2021, pp. 1336–1343. Hefei, China. Her current research interests include
[238] H. Tang, W. Wang, D. Xu, Y. Yan, and N. Sebe, “Gesturegan for hand image video processing, computer vision, and deep
gesture-to-gesture translation in the wild,” in Proc. 26th ACM Int. Conf. learning.
Multimedia. New York, NY, USA: Association for Computing Machin-
ery, 2018, pp. 774–782.
[239] H. Tang, S. Bai, and N. Sebe, “Dual attention GANs for semantic image
synthesis,” in Proc. 28th ACM Int. Conf. Multimedia. New York, NY,
USA: Association for Computing Machinery, 2020, pp. 1994–2002. Jianxin Lin received the B.E. and Ph.D. degrees
[240] H. Tang, X. Qi, D. Xu, P. H. Torr, and N. Sebe, “Edge guided in information and communication engineering from
GANs with semantic preserving for semantic image synthesis,” 2020, the University of Science and Technology of China,
arXiv:2003.13898. Hefei, China, in 2015 and 2020, respectively. He is
[241] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag, “Synthe- currently an Associate Professor with the School of
sizing images of humans in unseen poses,” in Proc. IEEE Conf. Comput. Computer Science and Electronic Engineering, Hu-
Vis. Pattern Recognit., Jun. 2018, pp. 8340–8348. nan University, Changsha, China. His research in-
[242] Z. Zhu et al., “Progressive pose attention transfer for person image terests include image video processing, image video
generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., synthesis, and few-shot learning.
Jun. 2019, pp. 2347–2356.
[243] W. Liu et al., “Liquid warping GAN: A unified framework for human
motion imitation, appearance transfer and novel view synthesis,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5904–5913. Tao Qin (Senior Member, IEEE) received the bache-
[244] H. Tang, S. Bai, L. Zhang, P. H. Torr, and N. Sebe, “XingGAN for person lor’s and Ph.D. degrees in electronic engineering from
image generation,” in Proc. Eur. Conf. Comput. Vis. Cham: Springer, Tsinghua University, Beijing, China. He is an Adjunct
2020, pp. 717–734. Professor (Ph.D. advisor) with the University of Sci-
[245] H. Tang, S. Bai, P. H. Torr, and N. Sebe, “Bipartite graph reasoning GANs ence and Technology of China, Hefei, China. He is a
for person image generation,” in Proc. BMVC, 2020, pp. 0–12. Senior Principal Researcher and managing the Deep
[246] H. Tang et al., “Cycle in cycle generative adversarial networks for and Reinforcement Group with Microsoft Research
keypoint-guided image generation,” in Proc. 27th ACM Int. Conf. Multi- Asia. His research interests include machine learning
media, 2019, pp. 2052–2060. (with the focus on deep learning and reinforcement
[247] H. Tang et al., “Attribute-guided sketch generation,” in Proc. 14th IEEE learning), artificial intelligence (with applications to
Int. Conf. Autom. Face Gesture Recognit., 2019, pp. 1–7. language understanding and computer vision), game
[248] M. Tao et al., “DF-GAN: Deep fusion generative adversarial networks theory and multiagent systems (with applications to cloud computing, online
for text-to-image synthesis,” 2020, arXiv:2008.05865. and mobile advertising, ecommerce), information retrieval, and computational
[249] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Manigan: Text-guided image advertising. He is a Senior Member of ACM.
manipulation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.
2020, pp. 7880–7889.
[250] G. Lample et al., “Fader networks: Manipulating images by sliding at-
tributes,” in Proc. Adv. Neural Inf. Process. Syst. 30, I. Guyon, U. V. Zhibo Chen (Senior Member, IEEE) received the
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. [Link]., and Ph.D. degrees in electronic engineering
Garnett, Eds. Curran Associates, Inc., 2017, pp. 5967–5976. from the Department of Electrical Engineering, Ts-
[251] O. Press, T. Galanti, S. Benaim, and L. Wolf, “Emerging disentanglement inghua University, Beijing, China, in 1998 and 2003,
in auto-encoder based unsupervised image content transfer,” in Proc. Int. respectively. He is currently a Professor with the Uni-
Conf. Learn. Representations, 2018. versity of Science and Technology of China, Hefei,
[252] H. Tang et al., “Expression conditional GAN for facial expression-to- China. He has more than 100 publications and more
expression translation,” in Proc. IEEE Int. Conf. Image Process., 2019, than 50 granted EU and US patent applications. His
pp. 4449–4453. research interests include image and video compres-
[253] W. Wang et al., “Every smile is unique: Landmark-guided diverse smile sion, visual quality of experience assessment, immer-
generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., sive media computing, and intelligent media comput-
2018, pp. 7083–7092. ing. He is a Member of the IEEE Visual Signal Processing and Communications
[254] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversar- Committee, and a Member of the IEEE Multimedia System and Applications
ial facial expression synthesis,” in Proc. 26th ACM Int. Conf. Multime- Committee. He was a TPC Chair of IEEE PCS 2019 and an Organization Com-
dia. New York, NY, USA: Association for Computing Machinery, 2018, mittee Member of ICIP 2017 and ICME 2013, a TPC Member in IEEE ISCAS
pp. 627–635. and IEEE VCIP.

Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like