Image-To-Image Translation Methods and Applications
Image-To-Image Translation Methods and Applications
1520-9210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See [Link] for more information.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3860 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
TABLE I
LIST OF TWO-DOMAIN I2I METHODS INCLUDING MODEL NAME, PUBLICATION YEAR, THE TYPE OF TRAINING DATA, WHETHER MULTIMODAL OR
NOT AND CORRESPONDING INSIGHTS
r Last but not least, we provide a thorough taxonomy of the parameters (i.e., a Gaussian distribution) or non-parametric
I2I applications following the same categorizations of I2I variants (each instance has its own contribution to the distri-
methods, as illustrated in Table V. bution), and it approximates that underlying distribution with
In general, our paper is organized as follows. Section I pro- particular algorithms. This approach enables the generative
vides the problem setting of the image-to-image translation task. model to generate data rather than only discriminate between
Section II introduces the generative models used for I2I meth- data (classification). For instance, the deep generative mod-
ods. Section III discusses the works on the two-domain I2I task. els have shown substantial performance improvements in mak-
Section IV focuses on works related to the multi-domain I2I task. ing predictions [43], estimating missing data [44], compressing
Then, Section VI reviews the various and fruitful applications datasets [45] and generating invisible data. In an I2I task, a gen-
of I2I tasks. Summary and outlook are given in Section VII. erative model can model the distribution of the target domain
by producing convincing “fake” data, namely, the translated im-
ages, that appear to be drawn from the distribution of the target
II. THE BACKBONE OF I2I domain.
Because an I2I task aims to learn the mappings between However, considering the length of this article and the dif-
different image domains, how to represent these mappings to ference in research foci, we inevitably omit those generative
generate the desirable results is explicitly related to the gen- models that are vaguely connected with the theme of I2I,
erative models. The generative model [40]–[42] assumes that such as deep Boltzmann machines (DBMs) [46]–[48], deep
data is created by a particular distribution that is defined by two autoregressive models (DARs) [49]–[51] and normalizing flow
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3861
TABLE II
LIST OF MULTI-DOMAIN I2I METHODS INCLUDING MODEL NAME, PUBLICATION YEAR, THE TYPE OF TRAINING DATA, WHETHER MULTIMODAL OR
NOT AND CORRESPONDING INSIGHTS
TABLE III
THE AVERAGE FID, IS, LPIPS SCORES OF DIFFERENT TWO-DOMAIN I2I METHODS TRAINED ON UT-ZAP50 K DATASET [188] IN TASK EDGE → SHOES.
THE BEST SCORES ARE IN BOLD
TABLE IV
THE AVERAGE FID, IS, LPIPS SCORES OF DIFFERENT MULTI-DOMAIN I2I METHODS TRAINED ON CELEBA DATASET [186] IN 5 DOMAINS INCLUDING BLACK
HAIR, BLOND HAIR, BROWN HAIR, GENDER (MALE OR FEMALE) AND AGE (YOUNG OR OLD). IN ADDITION TO THE FINAL AVERAGE METRIC SCORES, WE ALSO
REPORT TWO DOMAINS RESULTS (BLACK HAIR AND GENDER) FOR REFERENCE. THE BEST SCORES ARE IN BOLD
models (NFMs) [52], [53]. Therefore, we will briefly introduce A. Variational AutoEncoder
two of the most commonly used and efficient deep generative
Inspired by the Helmholtz machine [54], the variational au-
models in I2I tasks, variational autoencoders (VAEs) [52], [54]– toencoder (VAE) [55], [56] was initially proposed for a varia-
[62] and generative adversarial networks (GANs) [63]–[72], as tional inference problem in deep latent Gaussian models.
well as the intuition behind them. Both models basically aim
As shown in Fig 3, a VAE [55], [56] adopts a recognition
to construct a replica x = g(z) for generating the desired sam- model (encoder) qφ (z|x) to approximate the posterior distribu-
ples x from the latent variable z, but their specific approaches tion p(z|x) and a generative model (decoder) pθ (x|z) to map
are different. A VAE models data distribution by maximizing
the latent variable z to the data x. Specifically, a VAE trains its
the lower bound of the data log-likelihood, whereas a GAN generative model to learn a distribution p(x) to be near the given
tries to find the Nash equilibrium between a generator and data x by maximizing the log-likelihood function logpθ (x):
discriminator.
On the other hand, after obtaining the translated results from
the generative model, we need subjective and objective metrics
N
for evaluating the quality of translated images. Therefore, we logpθ (x) = logpθ (xi ),
will also briefly present common evaluation metrics in the I2I i=1
problem.
logpθ (xi ) = log pθ (xi |z)p(z)dz. (2)
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3862 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
TABLE V
APPLICATIONS OF I2I DISCUSSED IN SECTION VI
Stochastic gradient ascent (SGA) combined with the naive where DKL denotes the KL divergence that is non-negative, and
Monte Carlo gradient estimator (MCGE) can be used to find the θ and φ are neural network parameters. Naturally, we can obtain
optimal solution in Eqn. (2). However, it often fails because of a variational lower bound on the log-likelihood:
the highly skewed samples pθ (x|z) that exhibit a very high vari-
ance. A VAE therefore introduces the recognition model qφ (z|x) logpθ (xi ) ≥ L(xi , θ, φ). (5)
as a multivariate Gaussian distribution with a diagonal covari-
ance structure: Hence, a VAE differentiates and optimizes the lower bound
L(xi , θ, φ) instead of logpθ (xi ). Here is the final objective func-
qφ (z|x) = N (z|μz (x, φ), σz2 (x, φ)I). (3) tion of a VAE:
Eqn. (2) can be rewritten as: L(xi , θ, φ) = Ez∼qφ (z|xi ) [logpθ (xi |z)]
logpθ (xi ) = L(xi , θ, φ) + DKL [qφ (z|xi )||pθ (z|xi )]. (4) − DKL [qφ (z|xi )||pθ (z|xi )]. (6)
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3863
Fig. 2. An overview of image-to-image translation methods. This figure shows the relationship between different methods and where they intersect with each
other.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3864 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3865
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3866 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3867
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3868 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3869
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3870 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
Their method can use unlabeled data and less than 1% of labeled
data to complete several I2I tasks, such as image colorization,
image denoising and image super-resolution.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3871
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3872 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3873
V. EXPERIMENTAL EVALUATION
In this section, we evaluate twelve I2I models on two tasks,
including seven two-domain algorithms on edge-to-shoes trans-
Fig. 16. Illustration of the dataset used in few-shot multi-domain I2I scenarios:
lation task and five multi-domain algorithms on attribute manip-
The training set consists of multiple domains in which the source and target ulation task. We train all the models following their default set-
images are randomly sampled from arbitrary n domains; given very few images tings as original papers except the same dataset and implemen-
of the target unseen domain (unseen in the training process), few-shot multi-
domain I2I aims to translate a source content image (randomly sampled from n
tation environments. The selection criteria of methods mainly
domains) into an image analogous to this unseen domain. takes into account algorithm categories and publication years.
All experimental codes come from the public official version.
A. Datasets
UT-Zap50 K We utilize the UT-Zap50 K dataset [188] to eval-
uate the performance of two-domain I2I methods. The number
of training pairs is 49826 where each pair consists of a shoes im-
age and its corresponding edge map. And the number of testing
images is 200. We resize all images to 256 × 256 for training
and testing. In unsupervised setting, images from source domain
and target domain are not paired.
Fig. 17. The architecture of ZstGAN [79].
CelebA We employ the CelebFaces Attributes (CelebA)
and reconstruction loss with only a few labeled examples during dataset [186] to compare the performance of multi-domain I2I
training. methods. It contains 202,599 face images of celebrities with 40
with/without attribute labels for each image. We randomly di-
C. Few-Shot Multi-Domain Image-to-Image Translation vide all images into training set, validation set and test set with
ratio 8 : 1 : 1. Next, we center-crop the initial 178 × 218 size
Although prolific, the aforementioned successful multi- images to 178 × 178. Finally, after resizing all images to 128 ×
domain I2I techniques can hardly rapidly generalize from a few 128 by bicubic interpolation, we construct the multiple domains
examples. In contrast, humans can learn new tasks rapidly using dataset using the following attributes: Black hair, Blond hair,
what they learned in the past. Given a static picture of a butterfly, Brown hair, gender (male/female), and age (young/old).
you can easily imagine it flying similar to a bird or a bee after
watching a video of a flock of birds or a swarm of bees in flight.
Hence, few-shot multi-domain I2I attracts much attention. Fig. B. Metrics
16. shows the illustration of dataset usually used in this scenario. We evaluate both the visual quality and the diversity of gen-
Liu et al. [18] seek a few-shot UI2I algorithm, FUNIT, to erated images using Frechét inception distance (FID), Incep-
successfully translate source images to analogous images of the tion score (IS) and Learned Perceptual Image Patch Similarity
target class with many source class images but few target class (LPIPS).
images available. FUNIT first trains a multiclass UI2I with mul- Fréchet inception distance (FID) [92] is computed by mea-
tiple classes of images, such as those of various animal species, suring the mean and variance distance of the generated and real
based on a few-shot image translator and a multitask adversarial images in a deep feature space. A lower score means a better per-
discriminator. In the test time, it can translate any source class formance. (1) For single-modal two-domain setting, we directly
image to analogous images of the target class with a few images compared the mean and variance of generated and real sets. (2)
from a novel object class (namely, the unseen target class). For multi-modal two-domain setting, we sample the same test-
However, FUNIT fails to preserve domain invariant appear- ing set 19 times. Then compute the FID for each testing set and
ance information in the content image because of the severe in- average the scores to get the final FID score. (3) For single-modal
fluence of the style code extracted from the target image, namely, multi-domain setting, we compute the FID score in each domain
the content loss problem. Saito et al. [183] therefore proposed and then average the scores. (4) For multi-modal multi-domain
COCO-FUNIT to redesign a content-conditioned style encoder setting, we first sample each image in each domain 19 times.
that interpolates content information into a style code. Then we compute the average FID scores within each domain.
Moreover, Lin et al. found that the current I2I works trans- Finally, we average the scores again to get the result.
late from random noise, which, unlike humans, cannot easily Inception score (IS) [88] encodes the diversity across all
adapt acquired prior knowledge to solve new problems. They translated outputs. It exploits a pretrained inception classifica-
hence proposed the unsupervised zero-shot I2I (ZstGAN) [79] tion model to predict the domain label of the translated images.
shown in Fig. 17. ZstGAN uses meta-learning to transfer trans- A higher score indicates a better translated performance. The
lation knowledge from seen domains to unseen classes using a evaluation process is just as similar as FID.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3874 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
C. Results
A fair comparison is only possible by keeping all the param-
eters consistent. That said, it is difficult to declare that one algo-
rithm has an absolute superiority over the others. Besides model Fig. 20. Qualitative comparison on single modal multi-domain I2I methods.
design itself, there are still many factors influencing the per- Here we show the examples of 5 attributes.
formance, such as training time, batch size and iteration times,
FLOPs and number of parameters, etc. Therefore, our conclu-
sion only build on current experimental settings, models and multi-modal and realistic results. Also supervised method Bicy-
tasks. cleGAN achieves 0.047 more LPIPS scores than unsupervised
Two-domain I2I We qualitatively and quantitatively compare method MUNIT as shown in Table III.
pix2pix [1], BicycleGAN [16], CycleGAN [2], U-GAT-IT [37], Multi-domain I2I We qualitatively and quantitatively com-
GDWCT [126], CUT [116] and MUNIT [90] in single-modal pare StarGAN [166], AttGAN [167], STGAN [169] Dos-
and multi-modal setting respectively. GAN [178] and StarGANv2 [180] in single-modal and multi-
The single-modal qualitative comparisons are shown in modal setting respectively.
Fig. 18 where two supervised methods pix2pix and Bicycle- In Fig. 20, all methods can successfully achieve multiple
GAN achieve better FID, IS and LPIPS scores than unsuper- domains translation. However, StarGAN and AttGAN gener-
vised methods CycleGAN, U-GAT-IT and GDWCT. Without ate obvious visible artifacts while DosGAN leads to blurry re-
any supervision, the newest method CUT gets the best FID and sults. The results of STGAN are pretty excellent whereas Star-
IS scores in Table III than the rest methods including supervised GANv2 can generate realistic and vivid translated results by
and unsupervised. There could be a couple reasons for that. First, changing the image style latent code. Table IV shows that Star-
the backbone of CUT, namely StyleGAN, is a strong powerful GANv2 acquires the best FID and IS scores. Similarly, there are
GAN model for image synthesis compared to others. Besides, also many factors contributed to such result including stronger
the contrastive learning they used is an effective content con- GAN backbone, more effective training strategies, higher quality
straint for translation. dataset, etc.
As for the multi-modal setting shown in Fig. 19, we inject the We also conduct the multimodal multi-domain I2I experi-
Gaussian noise into the input of pix2pix, CycleGAN, U-GAT-IT, ments for comparison. In detail, we additionally inject noise
GDWCT and CUT to get multi-modal results. However, they vectors to StarGAN for multi-modal translation. As for AttGAN
can hardly generate the diverse outputs. On the contrary, the and STGAN, we apply the linear interpolation between two at-
multi-modal algorithms BicycleGAN and MUNIT can acquire tributes: Brown-Hair → Black-Hair as multi-modal results. As
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3875
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3876 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3877
[26] Z. Xu, T. Wang, F. Fang, Y. Sheng, and G. Zhang, “Stylization-based [52] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Non-linear independent com-
architecture for fast deep exemplar colorization,” in Proc. IEEE/CVF ponents estimation,” 2014, arXiv:1410.8516.
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 9363–9372. [53] A. Abdelhamed, M. A. Brubaker, and M. S. Brown, “Noise flow: Noise
[27] J. Lee et al., “Reference-based sketch image colorization using modeling with conditional normalizing flows,” in Proc. IEEE/CVF Int.
augmented-self reference and dense semantic correspondence,” in Proc. Conf. Comput. Vis., Oct. 2019, pp. 3165–3173.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 5801– [54] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel, “The helmholtz
5810. machine,” Neural Comput., vol. 7, no. 5, pp. 889–904, 1995.
[28] Y. Yuan et al., “Unsupervised image super-resolution using cycle-in- [55] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” Proc.
cycle generative adversarial networks,” in Proc. IEEE Conf. Comput. 2nd Int. Conf. Learn. Representations, ICLR 2014, Banff, AB, Canada,
Vis. Pattern Recognit. Workshops, Jun. 2018, pp. 701–710. vol. 1050, p. 1, 2014.
[29] Y. Zhang, S. Liu, C. Dong, X. Zhang, and Y. Yuan, “Multiple cycle- [56] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropaga-
in-cycle generative adversarial networks for unsupervised image super- tion and approximate inference in deep generative models,” in Proc. Int.
resolution,” IEEE Trans. Image Process., vol. 29, pp. 1101–1112, Conf. Mach. Learn., 2014, pp. 1278–1286.
Sep. 2019. [57] H. Larochelle and I. Murray, “The neural autoregressive distribu-
[30] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, “Image tion estimator,” in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011,
to image translation for domain adaptation,” in Proc. IEEE Conf. Comput. pp. 29–37.
Vis. Pattern Recognit., Jun. 2018, pp. 4500–4509. [58] M. Germain, K. Gregor, I. Murray, and H. Larochelle, “Made: Masked
[31] J. Cao et al., “DIDA: Disentangled synthesis for domain adaptation,” autoencoder for distribution estimation,” in Proc. Int. Conf. Mach. Learn.,
CoRR, 2018, arXiv:1805.08019. 2015, pp. 881–889.
[32] A. H. Liu, Y.-C. Liu, Y.-Y. Yeh, and Y.-C. F. Wang, “A unified feature [59] E. Nalisnick, L. Hertel, and P. Smyth, “Approximate inference for deep la-
disentangler for multi-domain image translation and manipulation,” in tent gaussian mixtures,” in Proc. NIPS Workshop Bayesian Deep Learn.,
Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 2590–2599. vol. 2, pp. 131–134, 2016.
[33] Y. Shi, D. Deb, and A. K. Jain, “WarpGAN: Automatic caricature gen- [60] D. Rezende and S. Mohamed, “Variational inference with normalizing
eration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, flows,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1530–1538.
pp. 10 762–10 771. [61] J. Tomczak and M. Welling, “Vae with a vampprior,” in Proc. Int. Conf.
[34] M. Pesko, A. Svystun, P. Andruszkiewicz, P. Rokita, and T. Trzcinski, Artif. Intell. Statist., 2018, pp. 1214–1223.
“Comixify: Transform video into comics,” Fundamenta Informaticae, [62] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein
vol. 168, no. 2-4, pp. 311–333, 2019. auto-encoders,” in Proc. Int. Conf. Learn. Representations, 2018.
[35] Z. Zheng et al., “Unpaired photo-to-caricature translation on faces in the [63] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
wild,” Neurocomputing, vol. 355, pp. 71–81, 2019. Inf. Process. Syst., 2014, pp. 2672–2680.
[36] Y. Chen, Y.-K. Lai, and Y.-J. Liu, “CartoonGAN: Generative adversarial [64] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-
networks for photo cartoonization,” in Proc. IEEE Conf. Comput. Vis. sarial networks,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 214–223.
Pattern Recognit., 2018, pp. 9465–9474. [65] X. Mao et al., “Least squares generative adversarial networks,” in Proc.
[37] J. Kim, M. Kim, H. Kang, and K. H. Lee, “U-GAT-IT: Unsupervised gen- IEEE Int. Conf. Comput. Vis., 2017, pp. 2794–2802.
erative attentional networks with adaptive layer-instance normalization [66] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
for image-to-image translation,” in Proc. Int. Conf. Learn. Representa- learning with deep convolutional generative adversarial networks,” in
tions, 2019. Proc. 4th Int. Conf. Learn. Representations, San Juan, Puerto Rico, May
[38] X. Wang and J. Yu, “Learning to cartoonize using white-box cartoon rep- 2016.
resentations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [67] M. Mirza and S. Osindero, “Conditional generative adversarial
2020, pp. 8090–8099. nets,” 2014, arXiv:1411.1784.
[39] M. Arar, Y. Ginger, D. Danon, A. H. Bermano, and D. Cohen-Or, [68] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,
“Unsupervised multi-modal image registration via geometry preserving “Improved training of wasserstein GANs,” in Proc. Adv. Neural Inf. Pro-
image-to-image translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pat- cess. Syst., 2017, pp. 5767–5777.
tern Recognit., Jun. 2020, pp. 13410–13419. [69] J. J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adver-
[40] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial sarial networks,” in Proc. 5th Int. Conf. Learn. Representations, Toulon,
on energy-based learning,” Predicting Structured Data, vol. 1, no. 0, France, Apr. 2017.
2006. [70] D. Berthelot, T. Schumm, and L. Metz, “BeGAN: Boundary equilibrium
[41] J. Xu, H. Li, and S. Zhou, “An overview of deep generative models,” generative adversarial networks,” 2017, arXiv:1703.10717.
IETE Tech. Rev., vol. 32, no. 2, pp. 131–139, 2015. [71] A. Jolicoeur-Martineau, “The relativistic discriminator: A key element
[42] A. Oussidi and A. Elhassouny, “Deep generative models: Survey,” in missing from standard GAN,” in Proc. Int. Conf. Learn. Representations,
Proc. Int. Conf. Intell. Syst. Comput. Vis., 2018, pp. 1–8. 2018.
[43] H.-M. Chu, C.-K. Yeh, and Y.-C. Frank Wang, “Deep generative mod- [72] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normal-
els for weakly-supervised multi-label classification,” in Proc. Eur. Conf. ization for generative adversarial networks,” in Proc. Int. Conf. Learn.
Comput. Vis., 2018, pp. 400–415. Representations, 2018.
[44] R. A. Yeh et al., “Semantic image inpainting with deep generative [73] P. Ghosh, M. S. Sajjadi, A. Vergari, M. Black, and B. Scholkopf, “From
models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, variational to deterministic autoencoders,” in Int. Conf. Learn. Represen-
pp. 5485–5493. tations, 2019.
[45] M. Tschannen, E. Agustsson, and M. Lucic, “Deep generative models [74] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image trans-
for distribution-preserving lossy compression,” in Proc. Adv. Neural Inf. lation networks,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 700–
Process. Syst., 2018, pp. 5929–5940. 708.
[46] R. Salakhutdinov and G. Hinton, “Deep boltzmann machines,” in Proc. [75] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and M.-H. Yang, “Diverse
Artif. Intell. Statist., 2009, pp. 448–455. image-to-image translation via disentangled representations,” in Proc.
[47] R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann ma- Eur. Conf. Comput. Vis., 2018, pp. 35–51.
chines for collaborative filtering,” in Proc. 24th Int. Conf. Mach. Learn., [76] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and L. Van Gool, “Exemplar
2007, pp. 791–798. guided unsupervised image-to-image translation with semantic consis-
[48] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for tency,” in Proc. Int. Conf. Learn. Representations, 2018.
deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. [77] W. Wu, K. Cao, C. Li, C. Qian, and C. C. Loy, “Transgaga: Geometry-
[49] A. Van den Oord et al., “Conditional image generation with pix- aware unsupervised image-to-image translation,” in Proc. IEEE Conf.
elcnn decoders,” in Proc. Adv. Neural Inf. Process. Syst., 2016, Comput. Vis. Pattern Recognit., 2019, pp. 8012–8021.
pp. 4790–4798. [78] H. Kazemi, S. Soleymani, F. Taherkhani, S. Iranmanesh, and N.
[50] A. Van Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent Nasrabadi, “Unsupervised image-to-image translation using domain-
neural networks,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1747–1756. specific variational information bound,” in Proc. Adv. Neural Inf. Process.
[51] X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel, “Pixelsnail: An Syst., 2018, pp. 10 348–10 358.
improved autoregressive generative model,” in Proc. Int. Conf. Mach. [79] J. Lin, Y. Xia, S. Liu, T. Qin, and Z. Chen, “ZSTGAN: An adversarial
Learn., 2018, pp. 864–872. approach for unsupervised zero-shot image-to-image translation,” 2019,
arXiv:1906.00184.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3878 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
[80] J. Lin, Y. Pang, Y. Xia, Z. Chen, and J. Luo, “TuiGAN: Learning versa- [106] A. Gonzalez-Garcia, J. Van De Weijer, and Y. Bengio, “Image-to-image
tile image-to-image translation with two unpaired images,” in Proc. Eur. translation for cross-domain disentanglement,” in Proc. Adv. Neural Inf.
Conf. Comput. Vis. Cham: Springer, 2020, pp. 18–35. Process. Syst., 2018, pp. 1287–1298.
[81] S. Pal and S. Mitra, “Multilayer perceptron, fuzzy sets, and classification,” [107] A. Mustafa and R. [Link], “Transformation consistency regulariza-
IEEE Trans. Neural Netw., vol. 3, no. 5, pp. 683–697, Sep. 1992. tion - a semi-supervised paradigm for image-to-image translation,” Com-
[82] Y. LeCun et al., “Backpropagation applied to handwritten zip code recog- puter Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
nition,” Neural Comput., vol. 1, no. 4, pp. 541–551, 1989. Aug. 23–28, 2020, Proceedings, Part XVIII 16, A. Vedaldi, H. Bischof,
[83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing,
in Proc. 3rd Int. Conf. Learn. Representations, San Diego, CA, USA, 2020, pp. 599–615.
May 2015. [108] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain im-
[84] T. Tieleman and G. Hinton, “Lecture 6.5-RmsProp: Divide the gradient age generation,” in Proc. 5th Int. Conf. Learn. Representations, Toulon,
by a running average of its recent magnitude,” COURSERA: Neural Netw. France, Apr. 2017.
Mach. Learn., vol. 4, no. 2, pp. 26–31, 2012. [109] M. Li et al., “Unsupervised image-to-image translation with stacked
[85] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in cycle-consistent adversarial networks,” in Proc. Eur. Conf. Comput.
Proc. Eur. Conf. Comput. Vis. Cham: Springer, 2016, pp. 649–666. Vis., 2018, pp. 184–199.
[86] C. Wang et al., “Discriminative region proposal adversarial networks for [110] A. Gokaslan, V. Ramanujan, D. Ritchie, K. In Kim, and J. Tompkin, “Im-
high-quality image-to-image translation,” in Proc. Eur. Conf. Comput. proving shape deformation in unsupervised image-to-image translation,”
Vis., Sep. 2018, pp. 770–785. in Proc. Eur. Conf. Comput. Vis., 2018, pp. 649–665.
[87] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality [111] M. Amodio and S. Krishnaswamy, “TravelGAN: Image-to-image trans-
assessment: From error visibility to structural similarity,” IEEE Trans. lation by transformation vector learning,” in Proc. IEEE Conf. Comput.
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. Vis. Pattern Recognit., 2019, pp. 8983–8992.
[88] T. Salimans et al., “Improved techniques for training gans,” in Proc. Adv. [112] Y. Zhao, R. Wu, and H. Dong, “Unpaired image-to-image translation us-
Neural Inf. Process. Syst., 2016, pp. 2234–2242. ing adversarial consistency loss,” in Proc. Eur. Conf. Comput. Vis. Cham:
[89] L. Ma et al., “Pose guided person image generation,” in Proc. Adv. Neural Springer, 2020, pp. 800–815.
Inf. Process. Syst., vol. 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wal- [113] O. Katzir, D. Lischinski, and D. Cohen-Or, “Cross-domain cascaded deep
lach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, translation,” Computer Vision - ECCV 2020-16th European Conference,
Inc., 2017. Glasgow, U.K., Aug. 23-28, 2020, Proceedings, Part II, Ser. Lecture Notes
[90] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsuper- in Computer Science, vol. 12347, A. Vedaldi, H. Bischof, T. Brox, and J.
vised image-to-image translation,” in Proc. Eur. Conf. Comput. Vis., 2018, Frahm, Eds. Springer, 2020, pp. 673–689.
pp. 172–189. [114] S. Benaim and L. Wolf, “One-sided unsupervised domain mapping,” in
[91] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style Proc. Adv. Neural Inf. Process. Syst., vol. 30, I. Guyon, U. V. Luxburg,
transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vis. Cham: S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds.
Springer, 2016, pp. 694–711. Curran Associates, Inc., 2017.
[92] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, [115] H. Fu et al., “Geometry-consistent generative adversarial networks for
“GANs trained by a two time-scale update rule converge to a local nash one-sided unsupervised domain mapping,” in Proc. IEEE/CVF Conf.
equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6626– Comput. Vis. Pattern Recognit., 2019, pp. 2427–2436.
6637. [116] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for
[93] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying unpaired image-to-image translation,” in Proc. Eur. Conf. Comput. Vis.
MMD GANs,” in Proc. Int. Conf. Learn. Representations, 2018. Cham: Springer, 2020, pp. 319–345.
[94] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink- [117] T. Park et al., “Swapping autoencoder for deep image manipulation,” Adv.
ing the inception architecture for computer vision,” in Proc. IEEE Conf. Neural Inf. Proc. Syst., Curran Associates, Inc., vol. 33, pp. 7198–7211,
Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826. 2020.
[95] T. R. Shaham, T. Dekel, and T. Michaeli, “SinGAN: Learning a generative [118] L. Jiang et al., “Tsit: A simple and versatile framework for image-to-
model from a single natural image,” in Proc. IEEE Int. Conf. Comput. image translation,” in Proc. Eur. Conf. Comput. Vis. Cham: Springer,
Vis., 2019, pp. 4570–4580. 2020, pp. 206–222.
[96] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unrea- [119] C. Zheng, T.-J. Cham, and J. Cai, “The spatially-correlative loss for vari-
sonable effectiveness of deep features as a perceptual metric,” in Proc. ous image translation tasks,” Proc. IEEE/CVF Conf. Compu. Vis. Pattern
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586–595. Recognition (CVPR), Jun. 2021, pp. 16407–16417.
[97] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks [120] J. Liang, H. Zeng, and L. Zhang, “High-resolution photorealistic image
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern translation in real-time: A laplacian pyramid translation network,” Proc.
Recognit., 2015, pp. 3431–3440. IEEE/CVF Conf. Compu. Vis. Pattern Recognition (CVPR), Jun. 2021,
[98] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo, “Reliable fidelity pp. 9392–9400.
and diversity metrics for generative models,” in Proc. Int. Conf. Mach. [121] S. Ma, J. Fu, C. W. Chen, and T. Mei, “Da-GAN: Instance-
Learn., 2020, pp. 7176–7185. level image translation by deep attention generative adversarial net-
[99] T.-C. Wang et al., “High-resolution image synthesis and semantic manip- works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
ulation with conditional gans,” in Proc. IEEE Conf. Comput. Vis. Pattern pp. 5657–5666.
Recognit., 2018, pp. 8798–8807. [122] X. Chen, C. Xu, X. Yang, and D. Tao, “Attention-GAN for object trans-
[100] B. AlBahar and J.-B. Huang, “Guided image-to-image translation with figuration in wild images,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018,
bi-directional feature transformation,” in Proc. IEEE/CVF Int. Conf. pp. 164–180.
Comput. Vis., Oct. 2019, pp. 9016–9025. [123] S. Mo, M. Cho, and J. Shin, “InstaGAN: Instance-aware image-to-image
[101] H. Tang et al., “Multi-channel attention selection GAN with cascaded translation,” in Proc. Int. Conf. Learn. Representations, 2018.
semantic guidance for cross-view image translation,” in Proc. IEEE Conf. [124] Z. Shen, M. Huang, J. Shi, X. Xue, and T. S. Huang, “Towards instance-
Comput. Vis. Pattern Recognit., 2019, pp. 2417–2426. level image-to-image translation,” in Proc. IEEE Conf. Comput. Vis. Pat-
[102] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen, “Cross- tern Recognit., 2019, pp. 3683–3692.
domain correspondence learning for exemplar-based image transla- [125] D. Bhattacharjee, S. Kim, G. Vizier, and M. Salzmann, “DUNIT:
tion,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, Detection-based unsupervised image-to-image translation,” in Proc.
pp. 5143–5153. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 4787–
[103] X. Zhou et al., “CoCosNet v2: Full-resolution correspondence learn- 4796.
ing for image translation,” Proc. IEEE/CVF Conf. Compu. Vis. Pattern [126] W. Cho, S. Choi, D. K. Park, I. Shin, and J. Choo, “Image-to-image
Recognition (CVPR), Jun. 2021, pp. 11465–11475. translation via group-wise deep whitening-and-coloring transformation,”
[104] T. R. Shaham, M. Gharbi, R. Zhang, E. Shechtman, and T. Michaeli, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 10 639–
“Spatially-adaptive pixelwise networks for fast image translation,” Proc. 10 647.
IEEE/CVF Conf. Compu. Vis. Pattern Recognition (CVPR), Jun. 2021, [127] T. F. van der Ouderaa and D. E. Worrall, “Reversible GANs for memory-
pp. 14882–14891. efficient image-to-image translation,” in Proc. IEEE Conf. Comput. Vis.
[105] A. Bansal, Y. Sheikh, and D. Ramanan, “PixeINN: Example-based image Pattern Recognit., 2019, pp. 4720–4728.
synthesis,” in Proc. Int. Conf. Learn. Representations, 2018.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3879
[128] H. Chen et al., “Distilling portable generative adversarial networks for [154] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,”
image translation,” in Proc. AAAI Conf. Artif Intell., vol. 34, no. 04, 2020, 2017.
pp. 3585–3592. [155] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
[129] Y.-C. Chen, X. Xu, and J. Jia, “Domain adaptive image-to-image data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
translation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2006.
Jun. 2020, pp. 5274–5283. [156] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Au-
[130] R. Chen, W. Huang, B. Huang, F. Sun, and B. Fang, “Reusing dis- toencoding beyond pixels using a learned similarity metric,” in Proc. Int.
criminators for encoding: Towards unsupervised image-to-image trans- Conf. Mach. Learn., 2016, pp. 1558–1566.
lation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, [157] X. Chen et al., “InfoGAN: Interpretable representation learning by in-
pp. 8168–8177. formation maximizing generative adversarial nets,” in Proc. Adv. Neural
[131] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and A. Courville, Inf. Process. Syst., 2016, pp. 2172–2180.
“Augmented cyclegan: Learning many-to-many mappings from un- [158] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,”
paired data,” in Proc. Mach. Learn. Res., vol. 80, J. Dy and A. Proc. 5th Int. Conf. Learn. Representations, ICLR 2017, Toulon, France:
Krause, Eds. Stockholmsmässan, Stockholm Sweden: PMLR, Jul. 2018, [Link]. Apr. 2017.
pp. 195–204. [159] V. Dumoulin et al., “Adversarially learned inference,” Proc. 5th Int. Conf.
[132] J. Lin, Y. Xia, T. Qin, Z. Chen, and T.-Y. Liu, “Conditional image-to- Learn. Representations, ICLR 2017, vol. 1050, Apr. 2017.
image translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [160] I. Higgins et al., “Beta-VAE: Learning basic visual concepts with a con-
Jun. 2018, pp. 5524–5532. strained variational framework,” in Proc. Int. Conf. Learn. Representa-
[133] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and M.-H. Yang, “Mode seeking tions, 2017.
generative adversarial networks for diverse image synthesis,” in Proc. [161] H. Kim and A. Mnih, “Disentangling by factorising,” in Proc. Int. Conf.
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1429–1437. Mach. Learn., 2018, pp. 2649–2658.
[134] Y. Alharbi, N. Smith, and P. Wonka, “Latent filter scaling for multimodal [162] E. L. Denton and V. Birodkar, “Unsupervised learning of disentan-
unsupervised image-to-image translation,” in Proc. IEEE Conf. Comput. gled representations from video,” in Proc. Adv. Neural Inf. Process.
Vis. Pattern Recognit., 2019, pp. 1458–1466. Syst. 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
[135] H.-Y. Chang, Z. Wang, and Y.-Y. Chuang, “Domain-specific mappings S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017,
for generative adversarial style transfer,” in Proc. Eur. Conf. Comput. Vis. pp. 4414–4423.
Cham: Springer, 2020, pp. 573–589. [163] A. V. D. Oord, Y. Li, and O. Vinyals, “Representation learning with
[136] Y. Wang et al., “Transferring GANs: Generating images from limited contrastive predictive coding,” 2018, arXiv:1807.03748.
data,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 218–234. [164] L. Hui, X. Li, J. Chen, H. He, and J. Yang, “Unsupervised multi-domain
[137] J. Lin, Y. Wang, T. He, and Z. Chen, “Learning to transfer: Unsupervised image translation with domain-specific encoders/decoders,” in Proc. 24th
meta domain translation,” Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, Int. Conf. Pattern Recognit., 2018, pp. 2044–2049.
pp. 11507–11514, Apr. 2020. [165] B. Zhao, B. Chang, Z. Jie, and L. Sigal, “Modular generative adversarial
[138] Y. Li, R. Zhang, J. C. Lu, and E. Shechtman, “Few-shot image generation networks,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 150–165.
with elastic weight consolidation,” in Proc. Adv. Neural Inf. Process. [166] Y. Choi et al., “StarGAN: Unified generative adversarial networks for
Syst., 2020, pp. 15885–15896. multi-domain image-to-image translation,” in Proc. IEEE Conf. Comput.
[139] U. Ojha et al., “Few-shot image generation via cross-domain correspon- Vis. Pattern Recognit., Jun. 2018, pp. 8789–8797.
dence,” Proc. IEEE/CVF Conf. Compu. Vis. Pattern Recognition (CVPR), [167] Z. He, W. Zuo, M. Kan, S. Shan, and X. Chen, “AttGAN: Facial attribute
Jun. 2021, pp. 10 743–10 752. editing by only changing what you want,” IEEE Trans. Image Process.,
[140] S. Benaim and L. Wolf, “One-shot unsupervised cross domain transla- vol. 28, no. 11, pp. 5464–5478, Nov. 2019.
tion,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 2104–2114. [168] P.-W. Wu, Y.-J. Lin, C.-H. Chang, E. Y. Chang, and S.-W. Liao, “REL-
[141] T. Cohen and L. Wolf, “Bidirectional one-shot unsupervised domain map- GAN: Multi-domain image-to-image translation via relative attributes,”
ping,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1784–1792. in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 5914–5922.
[142] R. Navigli, “Word sense disambiguation: A survey,” ACM Comput. Surv., [169] M. Liu et al., “STGAN: A unified selective transfer network for arbitrary
vol. 41, no. 2, pp. 1–69, 2009. image attribute editing,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
[143] G.-J. Qi and J. Luo, “Small data challenges in big data era: A survey of Recognit., Jun. 2019, pp. 3673–3682.
recent progress on unsupervised and semi-supervised methods,” IEEE [170] D. Lee, J. Kim, W.-J. Moon, and J. C. Ye, “CollaGAN: Collaborative
Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2020. GAN for missing image data imputation,” in Proc. IEEE/CVF Conf. Com-
[144] L. Schmarje, M. Santarossa, S.-M. Schröder, and R. Koch, “A survey put. Vis. Pattern Recognit., Jun. 2019, pp. 2487–2496.
on semi-, self- and unsupervised learning for image classification,” IEEE [171] J. Lin, Y. Xia, Y. Wang, T. Qin, and Z. Chen, “Image-to-image translation
Access, vol. 9, pp. 82146–82168, 2021. with multi-path consistency regularization,” in Proc. Int. Joint Conf. Artif.
[145] M. Shi and B. Zhang, “Semi-supervised learning improves gene Intell., 2019, pp. 2980–2986.
expression-based prediction of cancer recurrence,” Bioinformatics, [172] S. Chang, S. Park, J. Yang, and N. Kwak, “Sym-parameterized dynamic
vol. 27, no. 21, pp. 3017–3023, 2011. inference for mixed-domain image translation,” in Proc. IEEE/CVF Int.
[146] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi- Conf. Comput. Vis., Oct. 2019, pp. 4803–4811.
supervised learning with deep generative models,” in Proc. Adv. Neural [173] M. M. R. Siddiquee et al., “Learning fixed points in generative adver-
Inf. Process. Syst., 2014, pp. 3581–3589. sarial networks: From image-to-image translation to disease detection
[147] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi- and localization,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019,
supervised learning with ladder networks,” in Proc. Adv. Neural Inf. Pro- pp. 191–200.
cess. Syst., 2015, pp. 3546–3554. [174] T. He et al., “Deliberation learning for image-to-image translation.” in
[148] D. Berthelot et al., “Mixmatch: A holistic approach to semi-supervised Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 2484–2490.
learning,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 5049–5059. [175] R. Wu, X. Tao, X. Gu, X. Shen, and J. Jia, “Attribute-driven spontaneous
[149] R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song, “MetaGAN: motion in unpaired image translation,” in Proc. IEEE/CVF Int. Conf.
An adversarial approach to few-shot learning,” in Proc. Adv. Neural Inf. Comput. Vis., Oct. 2019, pp. 5923–5932.
Process. Syst., 2018, pp. 2365–2374. [176] J. Cao, H. Huang, Y. Li, R. He, and Z. Sun, “Informative sample min-
[150] Q. Sun, Y. Liu, T.-S. Chua, and B. Schiele, “Meta-transfer learning for ing network for multi-domain image-to-image translation,” in Proc. Eur.
few-shot learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Conf. Comput. Vis. Cham: Springer, 2020, pp. 404–419.
2019, pp. 403–412. [177] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-
[151] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot Noguer, “Ganimation: Anatomically-aware facial animation from a sin-
learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 4077–4087. gle image,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 818–833.
[152] F. Sung et al., “Learning to compare: Relation network for few-shot [178] J. Lin et al., “Exploring explicit domain supervision for latent space
learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, disentanglement in unpaired image-to-image translation,” IEEE Trans.
pp. 1199–1208. Pattern Anal. Mach. Intell., vol. 43, no. 4, pp. 1254–1266, Apr. 2021.
[153] A. Hertzmann, C. E. Jacobs, N. Oliver, B. Curless, and D. H. Salesin, [179] X. Yu, Y. Chen, S. Liu, T. Li, and G. Li, “Multi-mapping image-to-
“Image analogies,” in Proc. 28th Annu. Conf. Comput. Graph. Interactive image translation via learning disentanglement,” in Proc. Adv. Neural
Techn., 2001, pp. 327–340. Inf. Process. Syst., 2019, pp. 2994–3004.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
3880 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 24, 2022
[180] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN V2: Diverse image [205] K. Armanious et al., “MedGAN: Medical image translation using GANs,”
synthesis for multiple domains,” in Proc. IEEE/CVF Conf. Comput. Vis. Computerized Med. Imag. Graph., vol. 79, 2020, Art. no. 101684.
Pattern Recognit., 2020, pp. 8188–8197. [206] I. Manakov et al., “Noise as domain shift: Denoising medical images
[181] H.-Y. Lee et al., “DRIT: Diverse image-to-image translation via dis- by unpaired image translation,” Domain Adaptation and Representation
entangled representations,” Int. J. Comput. Vis., vol. 128, no. 10, Transfer and Medical Image Learning with Less Labels and Imperfect
pp. 2402–2417, 2020. Data. Cham: Springer, 2019, pp. 3–10.
[182] Y. Liu et al., “GMM-unit: Unsupervised multi-domain and multi-modal [207] H. Touvron, M. Douze, M. Cord, and H. Jégou, “Powers of layers for
image-to-image translation via attribute gaussian mixture modeling,” image-to-image translation,” 2020, arXiv:2008.05763.
2020, arXiv:2003.06788. [208] H. Zhang, V. Sindagi, and V. M. Patel, “Image de-raining using a condi-
[183] K. Saito, K. Saenko, and M.-Y. Liu, “COCO-FUNIT: Few-shot unsu- tional generative adversarial network,” IEEE Trans. Circuits Syst. Video
pervised image translation with a content conditioned style encoder,” Technol., vol. 30, no. 11, pp. 3943–3956, Nov. 2020.
Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, [209] R. Li, L.-F. Cheong, and R. T. Tan, “Heavy rain image restoration: In-
Aug. 23-28, 2020, Proceedings, Part III 16, A. Vedaldi, H. Bischof, T. tegrating physics model and conditional adversarial learning,” in Proc.
Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1633–1642.
2020, pp. 382–398. [210] H. Zhu et al., “Singe image rain removal with unpaired information:
[184] X. Li et al., “Attribute guided unpaired image-to-image translation with A differentiable programming perspective,” in Proc. AAAI Conf. Artif
semi-supervised learning,” 2019, arXiv:1904.12428. Intell., vol. 33, 2019, pp. 9332–9339.
[185] Y. Wang, S. Khan, A. Gonzalez-Garcia, J. van de Weijer, and F. S. Khan, [211] A. Dudhane, H. S. Aulakh, and S. Murala, “Ri-GAN: An end-to-end net-
“Semi-supervised learning for few-shot image-to-image translation,” work for single image haze removal,” in Proc. IEEE/CVF Conf. Comput.
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, Vis. Pattern Recognit. Workshops, 2019, pp. 2014–2023.
pp. 4453–4462. [212] D. Engin, A. Genç, and H. Kemal Ekenel, “Cycle-dehaze: Enhanced
[186] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the cyclegan for single image dehazing,” in Proc. IEEE Conf. Comput. Vis.
wild,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 3730–3738. Pattern Recognit. Workshops, 2018, pp. 825–833.
[187] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with [213] Y. Cho, R. Malav, G. Pandey, and A. Kim, “DehazeGAN: Underwa-
adaptive instance normalization,” in Proc. IEEE Int. Conf. Comput. Vis., ter haze image restoration using unpaired image-to-image translation,”
2017, pp. 1501–1510. IFAC-PapersOnLine, vol. 52, no. 21, pp. 82–85, 2019.
[188] A. Yu and K. Grauman, “Fine-grained visual comparisons with local [214] Y.-F. Chen, A. K. Patel, and C.-P. Chen, “Image haze removal by adaptive
learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, cyclegan,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit
pp. 192–199. Conf., 2019, pp. 1122–1127.
[189] Y. Li et al., “Asymmetric GAN for unpaired image-to-image trans- [215] Y. Cho, H. Jang, R. Malav, G. Pandey, and A. Kim, “Underwater im-
lation,” IEEE Trans. Image Process., vol. 28, no. 12, pp. 5881–5896, age dehazing via unpaired image-to-image translation,” Int. J. Control,
Dec. 2019. Automat. Syst., vol. 18, no. 3, pp. 605–614, 2020.
[190] Y. Yan, J. Xu, B. Ni, W. Zhang, and X. Yang, “Skeleton-aided articulated [216] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “De-
motion generation,” in Proc. 25th ACM Int. Conf. Multimedia. New York, blurGAN: Blind motion deblurring using conditional adversarial net-
NY, USA: Association for Computing Machinery, 2017, pp. 199–207. works,” in Proc. IEEE Conf. Comput. Vis. pattern Recognit., 2018,
[191] A. Siarohin, E. Sangineto, S. Lathuiliere, and N. Sebe, “Deformable pp. 8183–8192.
GANs for pose-based human image generation,” in Proc. IEEE Conf. [217] H. Liu, P. Navarrete Michelini, and D. Zhu, “Deep networks for image-
Comput. Vis. Pattern Recognit., 2018, pp. 3408–3416. to-image translation with mux and demux layers,” in Proc. Eur. Conf.
[192] L. Ma et al., “Disentangled person image generation,” in Proc. IEEE Comput. Vis. Workshops, Sep. 2018.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 99–108. [218] O. Kupyn, T. Martyniuk, J. Wu, and Z. Wang, “DeblurGAN-V2: Deblur-
[193] P. Esser, E. Sutter, and B. Ommer, “A variational u-net for conditional ring (orders-of-magnitude) faster and better,” in Proc. IEEE Int. Conf.
appearance and shape generation,” in Proc. IEEE Conf. Comput. Vis. Comput. Vis., 2019, pp. 8878–8887.
Pattern Recognit., Jun. 2018, pp. 8857–8866. [219] T. Madam Nimisha, K. Sunil, and A. Rajagopalan, “Unsupervised class-
[194] H. Dong et al., “Towards multi-pose guided virtual try-on network,” in specific deblurring,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 353–369.
Proc. IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 9026–9035. [220] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool,
[195] J. Huang, J. Liao, and S. Kwong, “Semantic example guided image- “DSLR-quality photos on mobile devices with deep convolutional net-
to-image translation,” IEEE Trans. Multimedia, vol. 23, pp. 1654–1665, works,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 3277–3285.
2021. [221] E. de Stoutz, A. Ignatov, N. Kobyshev, R. Timofte, and L. Van Gool,
[196] W. Chen and J. Hays, “Sketchygan: Towards diverse and realistic sketch “Fast perceptual image enhancement,” in Proc. Eur. Conf. Comput. Vis.,
to image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018.
2018, pp. 9416–9425. [222] Y.-S. Chen, Y.-C. Wang, M.-H. Kao, and Y.-Y. Chuang, “Deep photo
[197] Z. Li, C. Deng, E. Yang, and D. Tao, “Staged sketch-to-image synthe- enhancer: Unpaired learning for image enhancement from photographs
sis via semi-supervised generative adversarial networks,” IEEE Trans. with GANs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
Multimedia, vol. 23, pp. 2694–2705, 2021. pp. 6306–6314.
[198] A. Shocher et al., “Semantic pyramid for image generation,” in Proc. [223] R. Zheng, Z. Luo, and B. Yan, “Exploiting time-series image-to-image
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, pp. 7457– translation to expand the range of wildlife habitat analysis,” in Proc. AAAI
7466. Conf. Artif Intell., vol. 33, 2019, pp. 825–832.
[199] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “Pairedcyclegan: Asymmetric [224] R. Zhang, T. Pfister, and J. Li, “Harmonic unpaired image-to-image trans-
style transfer for applying and removing makeup,” in Proc. IEEE Conf. lation,” in Proc. Int. Conf. Learn. Representations, 2018.
Comput. Vis. Pattern Recognit., 2018, pp. 40–48. [225] M. M. R. Siddiquee et al., “Learning fixed points in generative adversar-
[200] H. Emami, M. M. Aliabadi, M. Dong, and R. B. Chinnam, “SPA-GAN: ial networks: From image-to-image translation to disease detection and
Spatial attention GAN for image-to-image translation,” IEEE Trans. Mul- localization,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 191–200.
timedia, vol. 23, pp. 391–401, 2020. [226] K. Armanious et al., “Unsupervised medical image translation using
[201] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-GAN: Unsu- cycle-medgan,” in Proc. 27th Eur. Signal Process. Conf., 2019, pp. 1–5.
pervised video retargeting,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, [227] S. Kaji and S. Kida, “Overview of image-to-image translation by use of
pp. 119–135. deep neural networks: Denoising, super-resolution, modality conversion,
[202] L. Zhang et al., “Nested scale-editing for conditional image synthesis,” and reconstruction in medical imaging,” Radiological Phys. Technol.,
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2020, vol. 12, no. 3, pp. 235–248, 2019.
pp. 5477–5487. [228] S. Engelhardt, R. De Simone, P. M. Full, M. Karck, and I. Wolf, “Improv-
[203] H. Su et al., “MangaGAN: Unpaired photo-to-manga translation based ing surgical training phantoms by hyperrealism: Deep unpaired image-
on the methodology of manga drawing,” in Proc. AAAI Conf. Artif Intell., to-image translation from real surgeries,” in Proc. Int. Conf. Med. Image
vol. 35, no. 3, 2021, pp. 2611–2619. Comput. Comput.- Assist. Interv. Cham: Springer, 2018, pp. 747–755.
[204] Y. Gao and J. Wu, “GAN-based unpaired chinese character image trans- [229] S. Gamrian and Y. Goldberg, “Transfer learning for related reinforcement
lation via skeleton transformation and stroke rendering,” in Proc. AAAI learning tasks via image-to-image translation,” in Proc. Int. Conf. Mach.
Conf. Artif Intell., vol. 34, no. 01, 2020, pp. 646–653. Learn., 2019, pp. 2063–2072.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.
PANG et al.: IMAGE-TO-IMAGE TRANSLATION: METHODS AND APPLICATIONS 3881
[230] W. Deng et al., “Image-image domain adaptation with preserved self- [255] J. Zhang et al., “Dual in-painting model for unsupervised gaze correction
similarity and domain-dissimilarity for person re-identification,” in Proc. and animation in the wild,” in Proc. 28th ACM Int. Conf. Multimedia. New
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 994–1003. York, NY, USA: Association for Computing Machinery, 2020, pp. 1588–
[231] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adap- 1596.
tation for person re-identification,” in Proc. IEEE Conf. Comput. Vis. [256] S. Hicsonmez, N. Samet, E. Akbas, and P. Duygulu, “Ganilla: Generative
Pattern Recognit., Jun. 2018, pp. 5157–5166. adversarial networks for image to illustration translation,” Image Vis.
[232] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person retrieval Comput., vol. 95, 2020, Art. no. 103886.
model hetero- and homogeneously,” in Proc. Eur. Conf. Comput. Vis., [257] X. Guo et al., “FuseGAN: Learning to fuse multi-focus image via condi-
Sep. 2018, pp. 172–188. tional generative adversarial network,” IEEE Trans. Multimedia, vol. 21,
[233] Z. Zheng et al., “Joint discriminative and generative learning for person no. 8, pp. 1982–1996, Aug. 2019.
re-identification,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- [258] L. Ding, H. Tang, Y. Liu, Y. Shi, X. X. Zhu, and L. Bruzzone, “Adversarial
nit., Jun. 2019, pp. 2138–2147. shape learning for building extraction in VHR remote sensing images,”
[234] J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptive transfer 2021, arXiv:2102.11262.
network for cross-domain person re-identification,” in Proc. IEEE/CVF
Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 7202–7211.
[235] M. Sela, E. Richardson, and R. Kimmel, “Unrestricted facial geome-
try reconstruction using image-to-image translation,” in Proc. IEEE Int. Yingxue Pang (Graduate Student Member, IEEE) re-
Conf. Comput. Vis., Oct. 2017, pp. 1576–1585. ceived the B.E. degree in electronic and information
[236] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky, “Few-shot engineering from the Beijing University of Chemi-
adversarial learning of realistic neural talking head models,” in Proc. cal Technology, Beijing, China, in 2019. She is cur-
IEEE/CVF Int. Conf. Comput. Vis., Oct. 2019, pp. 9459–9468. rently working toward the [Link]. degree with the De-
[237] B. Duan, W. Wang, H. Tang, H. Latapie, and Y. Yan, “Cascade attention partment of Electronic Engineer and Information Sci-
guided residue learning gan for cross-modal translation,” in Proc. 25th ence, University of Science and Technology of China,
Int. Conf. Pattern Recognit., 2021, pp. 1336–1343. Hefei, China. Her current research interests include
[238] H. Tang, W. Wang, D. Xu, Y. Yan, and N. Sebe, “Gesturegan for hand image video processing, computer vision, and deep
gesture-to-gesture translation in the wild,” in Proc. 26th ACM Int. Conf. learning.
Multimedia. New York, NY, USA: Association for Computing Machin-
ery, 2018, pp. 774–782.
[239] H. Tang, S. Bai, and N. Sebe, “Dual attention GANs for semantic image
synthesis,” in Proc. 28th ACM Int. Conf. Multimedia. New York, NY,
USA: Association for Computing Machinery, 2020, pp. 1994–2002. Jianxin Lin received the B.E. and Ph.D. degrees
[240] H. Tang, X. Qi, D. Xu, P. H. Torr, and N. Sebe, “Edge guided in information and communication engineering from
GANs with semantic preserving for semantic image synthesis,” 2020, the University of Science and Technology of China,
arXiv:2003.13898. Hefei, China, in 2015 and 2020, respectively. He is
[241] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. Guttag, “Synthe- currently an Associate Professor with the School of
sizing images of humans in unseen poses,” in Proc. IEEE Conf. Comput. Computer Science and Electronic Engineering, Hu-
Vis. Pattern Recognit., Jun. 2018, pp. 8340–8348. nan University, Changsha, China. His research in-
[242] Z. Zhu et al., “Progressive pose attention transfer for person image terests include image video processing, image video
generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., synthesis, and few-shot learning.
Jun. 2019, pp. 2347–2356.
[243] W. Liu et al., “Liquid warping GAN: A unified framework for human
motion imitation, appearance transfer and novel view synthesis,” in Proc.
IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5904–5913. Tao Qin (Senior Member, IEEE) received the bache-
[244] H. Tang, S. Bai, L. Zhang, P. H. Torr, and N. Sebe, “XingGAN for person lor’s and Ph.D. degrees in electronic engineering from
image generation,” in Proc. Eur. Conf. Comput. Vis. Cham: Springer, Tsinghua University, Beijing, China. He is an Adjunct
2020, pp. 717–734. Professor (Ph.D. advisor) with the University of Sci-
[245] H. Tang, S. Bai, P. H. Torr, and N. Sebe, “Bipartite graph reasoning GANs ence and Technology of China, Hefei, China. He is a
for person image generation,” in Proc. BMVC, 2020, pp. 0–12. Senior Principal Researcher and managing the Deep
[246] H. Tang et al., “Cycle in cycle generative adversarial networks for and Reinforcement Group with Microsoft Research
keypoint-guided image generation,” in Proc. 27th ACM Int. Conf. Multi- Asia. His research interests include machine learning
media, 2019, pp. 2052–2060. (with the focus on deep learning and reinforcement
[247] H. Tang et al., “Attribute-guided sketch generation,” in Proc. 14th IEEE learning), artificial intelligence (with applications to
Int. Conf. Autom. Face Gesture Recognit., 2019, pp. 1–7. language understanding and computer vision), game
[248] M. Tao et al., “DF-GAN: Deep fusion generative adversarial networks theory and multiagent systems (with applications to cloud computing, online
for text-to-image synthesis,” 2020, arXiv:2008.05865. and mobile advertising, ecommerce), information retrieval, and computational
[249] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, “Manigan: Text-guided image advertising. He is a Senior Member of ACM.
manipulation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun.
2020, pp. 7880–7889.
[250] G. Lample et al., “Fader networks: Manipulating images by sliding at-
tributes,” in Proc. Adv. Neural Inf. Process. Syst. 30, I. Guyon, U. V. Zhibo Chen (Senior Member, IEEE) received the
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. [Link]., and Ph.D. degrees in electronic engineering
Garnett, Eds. Curran Associates, Inc., 2017, pp. 5967–5976. from the Department of Electrical Engineering, Ts-
[251] O. Press, T. Galanti, S. Benaim, and L. Wolf, “Emerging disentanglement inghua University, Beijing, China, in 1998 and 2003,
in auto-encoder based unsupervised image content transfer,” in Proc. Int. respectively. He is currently a Professor with the Uni-
Conf. Learn. Representations, 2018. versity of Science and Technology of China, Hefei,
[252] H. Tang et al., “Expression conditional GAN for facial expression-to- China. He has more than 100 publications and more
expression translation,” in Proc. IEEE Int. Conf. Image Process., 2019, than 50 granted EU and US patent applications. His
pp. 4449–4453. research interests include image and video compres-
[253] W. Wang et al., “Every smile is unique: Landmark-guided diverse smile sion, visual quality of experience assessment, immer-
generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., sive media computing, and intelligent media comput-
2018, pp. 7083–7092. ing. He is a Member of the IEEE Visual Signal Processing and Communications
[254] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan, “Geometry guided adversar- Committee, and a Member of the IEEE Multimedia System and Applications
ial facial expression synthesis,” in Proc. 26th ACM Int. Conf. Multime- Committee. He was a TPC Chair of IEEE PCS 2019 and an Organization Com-
dia. New York, NY, USA: Association for Computing Machinery, 2018, mittee Member of ICIP 2017 and ICME 2013, a TPC Member in IEEE ISCAS
pp. 627–635. and IEEE VCIP.
Authorized licensed use limited to: National Taipei Univ. of Technology. Downloaded on January 09,2025 at [Link] UTC from IEEE Xplore. Restrictions apply.