Generative AI in Vision: A Survey On Models, Metrics and Applications
Generative AI in Vision: A Survey On Models, Metrics and Applications
Abstract
1. Introduction
Generative models have long been at the forefront of ar-
tificial intelligence, enabling the creation of synthetic data
samples with remarkable realism and diversity. Initially in-
troduced as a method for denoising images, diffusion mod-
els have evolved to become a versatile framework for gen-
erating high-quality images, text, and audio data. Over
the years, they have garnered significant attention from re- Figure 1. a) Images generated using stable diffusion[70]; b) Image
searchers and practitioners alike for their ability to capture super-resolution results from SR3[72]; c) Image inpainting results
complex data distributions and produce realistic samples. from Palette[73]
Generative models for computer vision started in 1950
through Hidden Markov models (HMMs) and Gaussian
Mixture models (GMMs). These models used hand- tice, GANs suffered several shortcomings in their architec-
designed features with limited complexities and diversity. ture [17]. The simultaneous training of generator and dis-
With the advent of deep learning, Generative Adversarial criminator models was inherently unstable; sometimes, the
Networks (GANs) and Variational Autoencoders (VAEs) generator “collapsed” and outputted lots of similar-seeming
enabled impressive image generation. However, in prac- samples. Then came diffusion models, which were in-
1
spired by physics. Diffusion systems borrow from diffu- model with a trade-off of struggling against capturing com-
sion in non-equilibrium thermodynamics, where the pro- plex data distributions.
cess increases the entropy or randomness of the system over Energy-based models offer a flexible modeling objective
time. The recent innovation of diffusion models from Ope- without any restrictions. They use an unnormalized rep-
nAI made them more practical in everyday applications. resentation of the probability distribution, making them ex-
This paper dives into a systematic review of techniques and cellent density estimators. However, the intractable objec-
methodologies involved in SOTA diffusion models. tive makes them computationally inefficient for both train-
The main contribution of this work can be summarized ing and sampling.
as follows: GANs[26], on the other hand, don’t model the density ob-
• An overview of generative vision models to get readers jective directly. They rely on using an adversarial approach,
up-to-speed with the theoretical prerequisites for going which uses a minimax game between a discriminator and
through the latest trends in diffusion models. a generator to learn the density estimation explicitly. Al-
• In-depth survey on the SOTA approaches for diffusion though they have been largely successful on a wide set of
models, including applications, training them is difficult and suffer problems
• Highlight current research gaps and future research di- like vanishing gradient and mode collapse.
rections to provoke researchers to advance this generative Diffusion models use variational or score-based approaches
vision modeling field further. to model the probability density function. They work by
perturbing data with continuous or discrete noise injection
2. Generative Models in Vision on either the data directly or a latent representation such as
the latent diffusion model[70] and learning the reverse de-
noising process.
2.1. Diffusion models
2.1.1 Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models, or DDPMs for
short, are a class of diffusion models that are based on
slowly introducing Gaussian noise to a training data sample
x0 over a large set of time steps (1 → T ), thereby obtaining
a set of noise perturbed latent intermediates of the original
sample (x1 , x2 , ..., xT ). This forward process is governed
by the forward diffusion kernel (FDK). At the end of T
time steps, we obtain the resulting xT sample, which can be
approximated to represent isotropic Gaussian noise. From
here, a deep neural network is tasked to learn the FDK, and
the parameters of this network are called the Reverse Diffu-
Figure 2. An extension of generative models classification based sion Kernel (RDK). The aim for RDK is to predict the noise
on[25] introduced by the FDK at each time step starting from xT
and slowly remove the noise to generate new samples that
Generative models in vision can be mainly classified into belong to the same probability distribution as our training
two categories Fig. 2: models that explicitly learn the prob- dataset [33, 82].
ability density function by maximizing the likelihood or im- The forward process: Let q(x0 ) represent the given sam-
plicitly, where the model doesn’t directly target to learn the ple’s probability distribution before the noise perturbation.
density but does so using other strategies. We define a Markov chain with our FDK defined as a Gaus-
Normalizing flow and autoregressive models are dense sian distribution,
probability estimators that model the exact likelihood of a T
distribution. Although, they are limited by the complexity Y
q(x1,...,T ) := q(xt | xt−1 ) (1)
of data distribution as converging the objective to achieve
i=1
exact density representation of high dimensional complex p
data, such as images, can yield to computationally heavy q(xt | xt−1 ) := N (xt ; 1 − βt xt−1 , βt I) (2)
and impractical models. p p
Variational autoencoders (VAE) alleviate the computation xt = 1 − βt xt−1 + βt ϵ (3)
issue by allowing approximation of the intractable density where, Eq. (2) defines the FDK; ϵ ∼ N (0, I). The term βt
distribution. This allows for a more efficient generative is a hyper-parameter in the diffusion process controlled by
2
the variance scheduler. By applying this kernel to our q(x0 ) process posterior distribution. The authors further minimize
repeatedly over T −1 time steps, we obtain q(xT ) which ap- the loss function in [33] where they set the variance β from
proximates to an isotropic Gaussian distribution given that Eq. (3) as a constant. The authors then re-parameterize the
our covariance matrix in the FDK is isotropic. Since the mean µθ in terms of noise, and we obtain the simplified loss
process is a Markov chain, we can seamlessly obtain any la- function given as:
Qt from x0 by
tent representation and probability distribution
Lsimple := Ex0 ,ϵ [∥ϵ − ϵθ (xt , t)∥2 ], (8)
simply substituting αt := 1 − βt and αt := s=1 αs .
√ which is equivalent to the loss introduced in the score-based
q(xt | x0 ) := N (xt ; αt xt−1 , (1 − αt I) (4)
√ √ models[83].
xt := αt x0 + 1 − αt ϵ (5)
2.1.2 Noise Conditional Score Models
The reverse process: Starting from approximately an
isotropic Gaussian distribution obtained at the time step T , Score-based models define the dataset’s probability distri-
the aim is to learn the RDK pθ (xt−1 | xt ) which will pre- bution as q(x). The score function for the probability func-
dict the noise injected at each time step and will generate tion can then be defined as the gradient of the log of this
our original sample q(x0 ) back within the finite T time probability distribution, ∇x log q(x) [83]. The objective is
steps. The fact that the reverse process starts from a ran- to train a deep neural network, parameterized by θ to ap-
dom isotropic Gaussian distribution serves as a cue that a proximate over the score function of our dataset’s proba-
trained network can produce new samples in the probability bility function, sθ (x) ∼ ∇x log q(x), also known as score
distribution of the training dataset while sampling. matching. Using the score function instead of the probabil-
The probability distribution in the reverse process to ob- ity distribution allows the network to work with a tractable
tain our original sample can be illustrated as all the possible objective by eliminating the normalizing constant [35]. The
paths are taken from pθ (xT ) to obtain pθ (x0 ) ∼ q(x0 ) at loss function is the Fischer divergence between the actual
each time step in T → 1. score and the learned score, which is derived to form,
1
Z
pθ (x0 ) := pθ (x0...T dx0...T ) (6) Eq(x) [tr(∇x sθ (x)) + ∥sθ (x)∥2 ] (9)
2
With a trained network, Langevin dynamics allows us to
This integral is intractable since it integrates over a complex
generate new samples using only the score function[83].
high-dimensional space. To solve this, the authors [33, 82]
introduced a variational lower bound (or Evidence lower ϵ √
x̃t = x̃t−1 + ∇x log q(x̃t−1 ) + ϵN (0, I) (10)
bound (ELBO)) of the negative log-likelihood similar to 2
VAEs[42] to minimize this. Using this, we obtain the varia- Under the ideal conditions of t → ∞ and ϵ → 0, we can
tional lower bound loss, which is [15], generate exact samples coming from our dataset’s distribu-
tion q(x) [97]. The generation of new samples can then hap-
LV LB = − log pθ (x0 | x1 ) + DKL (p(xT | x0 )∥pxT ) pen by simply substituting the learned score function sθ (x)
| {z } | {z } in Eq. (10).
L0 LT
X However, the authors [83] observed that this approach did
+ DKL (p(xt−1 | xt , x0 )∥pθ (xt−1 | xt )), not do well in practice because the scores generated were
t>1 often inaccurate in lower-density regions. To address this,
| {z }
Lt−1 two solutions are employed to enhance the score match-
(7) ing using either the denoising score matching [83] or sliced
scored matching [86] where the loss in Eq. (9) is either by-
Here, we observe that the LT doesn’t depend on any learn- passed or approximated using random projections.
able parameters from the network and hence can be omit- The idea of denoising score matching is to train the model
ted from the loss. The term DKL is called the Kullback- by perturbing the dataset by inserting an isotropic Gaussian
Leibler divergence, a non-symmetric measure of the statis- noise N (0, σ(1,...,L) I) and σ1 < . . . < σL where the prior
tical distances between two probability distributions. This data distribution is approximately equal to the noise per-
loss function trains the network and estimates the forward turbed distribution as the noise is inserted gradually over a
process posterior. We then define that the RDK learned by large number of steps, ∇x log qσ (x) ∼ ∇x log q(x). This
the network is a Gaussian distribution given as pθ (xt−1 | results in the modified loss function of,
xt ) = N (xt−1 ; µθ (xt , t), σθ (xt , t)) where the network, L
minimizing the variation lower bound loss, will learn to es-
X
λ(i)Eqσi (x) [∥∇x log qσi (x) − sθ (x, i)∥2 ] (11)
timate the mean µθ and the covariance σθ of the forward i=1
3
where, λ(i) ∼ σi2 is a positive weighting function. For for every step, the SDE solver will predict x(t + ∆t) and
sampling, the Langevin dynamics are updated such that the then, the corrector will use this as an initial sample to refine
trained network will produce samples similar to Eq. (10) for the sample using sθ (x, t + ∆t) by running it through the
each i = L, . . . , 1, and the prior output will be used as the corrector network.
input for the next run. This process is called the annealed
2.2. Generative Adversarial Networks
Langevin dynamics, producing a less noisy sample after ev-
ery run from i = L, . . . , 1. Generative Adversarial Networks (a.k.a. GANs)[26] are a
class of generative models that implicitly learns the prob-
2.1.3 Stochastic Differential Equations Generative ability distribution q(x) of the dataset using an adversarial
Models approach where two networks, the discriminator D and the
Generator G play a two-player min-max game. The dis-
So far, we have seen that the diffusion models perturb data criminator’s objective is to maximize the binary classifica-
over a range of time steps or iterations 1 → L [33, 83, 84]. tion of distinguishing real and generated images, whereas
However, in [87], the authors generalized the previous the generator’s objective is to fool the discriminator into
approaches by defining a stochastic differential equation misclassifying the generated images.
(SDE) to perturb data with noise in a continuum. The aim is
to govern the diffusion process in both forward and reverse min max V (D, G) =
G D
(16)
direction as a representation of an SDE. Ex∼qx [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))],
dx = f (x, t)dt + g(t)dw (12) GANs are notoriously difficult to train [2, 26, 62, 76]. The
objective is to find the Nash equilibrium (also referred
where, w signifies the Brownian motion, the function to as a saddle point in other literature) of a two-player
f (x, t) is called as the drift coefficient and g(t) is the diffu- game (D & G) where both networks try to minimize their
sion coefficient [87]. This is the SDE defining the forward cost function simultaneously Eq. (16). This results in a
process. Intuitively, it can be considered an SDE that forces highly unstable training process as optimizing D can lead
a random sample to converge in areas of high probability to the deterioration of G and vice-versa. Mode collapse is
densities. To sample, we need to solve this SDE backward another problem where the generator’s objective function
in time. The reverse process is also a SDE[1, 87] which is converges to a specific data distribution instead of the
given as, whole training set, thus only generating images belonging
to this small subset. Also, when the discriminator is trained
dx = [f (x, t) − g(t)2 ∇x log qt (x)]dt + g(t)dw (13)
to optimality, the gradients of D approach zero. This
This SDE has a negative time step since the solution must causes the problem of vanishing gradients for the generator,
be reversed (t = T → t = 0). To sample from this SDE, where it has no guidance into which direction to follow for
we need to learn the score as defined in [83] as a function of achieving optimality.
time. Analogous to the noise conditional models, where the To prune these problems, various modifications on the
score function depends on the noise scales σL,...,i,...,1 , the vanilla GAN were introduced, which either suggested
score function here will depend on time sθ (x, t), formulat- architectural optimizations or loss function optimizations
ing the loss function as, [36]. As studied by [14, 101], the loss function optimiza-
tions were divided into optimizing the discriminator’s D or
Et Eqt (x) [λ(t)∥∇x log qt (x) − sθ (x, t)∥2 ] (14) generator’s G loss objective. These include minimizing the
f-divergences along with the Jensen-Shannon divergence
After training our network sθ (x, t) ∼ ∇x log qt (x), we can [59], weight normalization for stabilizing D’s training [55],
generate a new sample by solving the reverse SDE starting WGAN and WGAN-GP[3, 30] which changed the objec-
from pure noise qT (x). Many SDE solvers exist, the sim- tive of D from binary classification to a probability output
plest of which is the Euler-Maruyama method. Similar to by applying the Earth Mover (EM) or Wasserstein distance,
[83], we choose an infinitesimally small ∆(t) to solve the EBGAN [113] introduces an energy-based formulation
generalization of this SDE. of D’s objective function where the architecture of D is
p modified to be an auto-encoder, BEGAN [5] which uses
∆(x) = [f (x, t) − g 2 (t)sθ (x, t)]∆(t) + g(t) |∆t|zt
the same auto-encoder architecture for D from EBGAN
x = x + ∆x but modifies the objective to use the Wasserstein distance
t = t + ∆t instead, SAGAN[111] introduces self-attention modules on
(15) both D and G to enhance feature maps and uses spectral
Following this, the authors improvise the sampling by intro- normalization[55].
ducing a predictor-corrector sampler where the idea is that
4
2.3. Variational Autoencoders 2.5. Normalizing flow models
Variational Autoencoders (VAEs)[41, 42] are generative Normalizing flow is a way of mapping a data’s complex
models based on learning the latent space representation by probability distribution q(x) to a simple latent distribu-
projecting a prior z on the latent vector before generating tion p(z) using a set of invertible, bijective and continu-
a distribution. They have an encoder-decoder architecture ous functions z = f (x) such that both f and f −1 are
similar to auto-encoders, while mathematically, they differ differentiable[19, 20, 69]. When the function f is a deep
a lot [21]. Instead of learning the latent vector as a discrete neural network, the model is called a normalizing flow
representation of the dataset, VAEs learn the probability dis- model. Using the rule of change of variables, the proba-
tribution of this space. More intuitively, q(x) is mapped to bility density can be explicitly given as,
learn a multi-variate Gaussian distribution represented by
the mean µz and co-variance σz where z is in the latent δfθ (x)
px (x, θ) = pz (fθ (x)) det (19)
space. While generating new samples, we want to start with δx
the latent representation of z as an isotropic Gaussian dis-
tribution N (z|µ, σ ∗ I). The regularization term and the Since normalizing flow models estimate the exact likeli-
reconstruction loss are introduced to achieve this. This reg- hood of the distribution, the training is done by minimizing
ularization term is the KL divergence between the encoder’s the negative log-likelihood given as,
estimation of the latent variables and the standard Gaussian δfθ (x)
distribution. The loss function is then formulated as, LN F = − log px (x, θ) = − log pz (fθ (x))−log det
δx
L(θ, ϕ; x) = −DKL (pϕ (z|x)||pθ (z)) + (20)
| {z } When the model is trained, the latent representation z, of-
LKL
(17) ten chosen as a multivariate Gaussian distribution[43, 60],
Epϕ (z|x) [log pθ (x|z)] can generate a sample from the dataset’s probability distri-
bution by simply applying the inverse function fθ−1 to it.
| {z }
Lreconstruction
Although flow models are based on modeling the exact data
where ϕ and θ represent the encoder and decoder param- distribution, they are often computationally expensive since
eters, respectively. The only problem here is that pθ (z) they depend on the calculation of jacobians, have scalability
is intractable. To make the latent variable a learnable pa- and expressiveness issues on large and complex data distri-
rameter, the authors [42] introduced a reparameterization butions, and because of the invertibility constraint in calcu-
trick to allow backpropagation. This is done by separating lations, require the input and output dimensions to be the
the stochastic part ϵ and reconstructing the latent vector as same[8].
z = µ + σ ∗ ϵ. When the network has been trained to opti-
mality, pθ (x|z) ∼ q(x). 2.6. Energy Based models
2.4. Autoregressive models In simplicity, Energy-based models (EBMs) [47, 56, 85] are
generative models that assign an energy function Eθ (x) to a
Autoregressive models are generative models that use se- dataset’s probability distribution q(x) and minimize the en-
quential data to calculate the likelihood of the next value ergy function for samples from the dataset while assigning
in series[46, 90]. The joint distribution of a dataset can be high energy to samples that don’t belong to it.
given as,
N
Y e−Eθ (x)
q(x) = q(xi |x1 , . . . , xn<i ) (18) pθ (x) = (21)
Zθ
i=1
In vision, this translates to generating images by sequenc- EBMs are incredibly flexible in the domain of the type
ing the generation of each pixel given the prior pixels[77, of data[23, 56, 85] and, in vision, are an excellent choice
92, 93]. More formally, the autoregressive network con- for tasks of anomaly detection as an optimally trained
sists of either recurrent or convolutional layers that jointly model can distinguish between the anomalies and ideal
learn the dataset’s density distribution in a tractable manner sample[107].
and, during inference, will run N = n2 times for an image R −E (x)the energy function is unnormalized, Zθ =
Because
of size n ∗ n to generate a sample. Images are not unlike x
e θ dx is used in Eq. (21) for normalizing the likeli-
audio[91] or text[27, 95] where the data is structured and hood which is often intractable. This makes the training and
sequenced. [92] introduces a sequential approach for image sampling of EBMs difficult, and one has to rely on compu-
synthesis by masking the pixels on the right and below and tationally heavy methods such as MCMC, contrastive di-
only considering the pixels above and on the left of the pixel vergence (CD), and score matching, which make them an
we want to predict. impractical choice for fast inference use cases.
5
3. Evaluation metrics for vision generative version called the zero-shot FID is used where µt , Σt sig-
models nify the target distribution based on textual cues as opposed
to the training dataset. Kernel Inception Distance (KID)
Evaluating generative models in vision is an active research [7] is another flavor of the FID, where the distance is mea-
topic where different tasks involve specialized metrics such sured between a polynomial representation of the inception
as DrawBench[72], PartiPrompts[106], CLIPScore[31] for layer’s distribution.
text-to-image tasks or PSNR for image reconstruction tasks.
Here, we mainly introduce metrics that measure image fi- 3.3. Precision and Recall
delity and model diversity.
Precision and recall [45, 75] are two metrics that follow the
3.1. Inception Score (IS) same motivation as IS and FID of measuring the quality
and diversity of the generated samples while overcoming is-
Inception Score (or IS) was first introduced to assess the
sues of mode dropping. It provides a two-dimensional score
quality of the images generated by GAN as an automated
where precision measures the quality of images produced
alternative process against human annotators[76]. Using a
while recall measures the diversity coverage of the genera-
feature extracting network (often the Inception model[88]),
tive model.
which is trained on the same dataset as the generative
model, the score measures two components of the generated
samples: entropy of a single sample over the class labels
4. Applications in Vision
and secondly, entropy of class distribution over a large num- Generative models find applications in various tasks such as
ber of samples (suggested close to 50K samples) to measure image denoising, inpainting, super-resolution, text-to-video
the diversity. For a well-trained model, the entropy of a synthesis, image-to-image translation, image search, and re-
class over a single sample should be low, and the entropy of verse image search. These applications can be divided into
class distribution over all generated samples should be high. two broad categories:
This indicates that the network can generate both meaning-
ful and diverse sets of images. The score is calculated as 4.1. Unconditional generation
follows:
As the name suggests, unconditional generative models are
IS = exp(Ex [DKL (pθ (y|x)∥pθ (y)))] (22) trained to learn a target distribution and synthesize new
samples without getting conditioned by any other input. All
where DKL is the KL divergence, pθ (y|x) is the predicted models described in Sec. 2 can be considered a base un-
class probability and pθ (y) is the probability of classes over conditioned model whose only focus is on learning the tar-
all generated images. This implies that the higher the IS get distribution[15, 100]. Unconditional image generation
score is, the better the model’s generative capabilities. models usually start with a seed that generates a random
noise vector. The model will then use this vector to create
3.2. Fréchet Inception Distance
output images that resemble training data distribution.
The drawback of the Inception score is that it only considers
the generated samples for evaluation and disregards com- 4.2. Conditional generation
paring them with the actual dataset. Also, the IS will only On the contrary, conditional diffusion models take a prompt
compare the class probabilities as opposed to the image fea- and some random initial noise and iteratively remove the
ture distribution, which causes it to miss the more relevant noise to construct an image. The prompt guides the denois-
image features and requires a labeled dataset. The Fréchet ing process, and once the denoising process ends after a
Inception Distance (FID)[22, 32] allows us to overcome this predetermined number of time steps, the image representa-
by comparing the extracted features from a certain layer of tion is decoded into an image. There are several forms of
the feature extractor. More specifically, it assumes that the conditional models:
extracted features belong to a Gaussian distribution with a
certain mean µ and co-variance Σ and calculates the Fréchet
distance (measures the similarity between two probabil- 4.2.1 Text-to-Image Generation
ity distributions) between samples generated by the model Text-to-image has been the most naturally prominent use
(µg ,Σg ) and the samples from the training (µt ,Σt ) dataset(∼ case for generative models. Providing conditioned textual
50K samples). It is given as, information for image generation has improved the model’s
DF ID = ∥µt − µg ∥2 + tr(Σt + Σg − 2 Σt Σg ) (23)
p generative capabilities[63, 108]. Over the years, archi-
tectures reigning the task have used recurrent layers [53],
Since FID is a similarity distance, the lower the FID is, the GANs [39, 49, 67, 78, 110], autoregressive models [18, 64,
better. For zero-shot applications [57, 65, 105], a modified 106] and the diffusion-based models [10, 28, 66, 70, 74].
6
enizer and a pre-trained BERT[16] for text encodings with
the best fine-tuned model achieving SOTA FID score of
3.22 on MS-COCO dataset.
Text conditioned Diffusion models: GLIDE[58] uses an
ADM model [17] and a transformer-based text encoder to
generate prompts in place of class labels for image synthe-
sis. DALL-E2 (a.k.a. unCLIP)[66] is a two-stage model
where the first stage uses the CLIP[63] model to generate
image embedding from text captions and the second stage
uses these image embedding as a prior to generating sam-
Figure 3. An example of a conditional generative model: Latent ples via a diffusion decoder. The authors also experimented
diffusion model[70] with using an autoregressive decoder instead of the diffu-
sion decoder, but the latter yielded better results. Stable
Diffusion[70] is a latent diffusion model (LDM). Instead
Below we present an overview of their functioning and also of directly dealing with the complex pixel representation,
present their metrics in Tab. 1. LDMs apply the DDPM model’s forward kernel on the la-
tent representation generated by the encoder Fig. 3. A se-
Table 1. Comparison of different text-to-image models ries of trained denoising U-Net are applied to denoise this
Models Architecture Best Reported corrupted latent space. The text prompts are embedded in
Zero-shot FID↓ IS↑ Params Dataset the denoising steps using cross-attention layers. The re-
StackGAN[110] GAN - 8.45 - COCO
GigaGAN[39] GAN 9.09 - 1.0B COCO trieved latent space, after denoising, is passed through a
DALL-E[64] Autoregressive 27.50 18 12B COCO decoder to generate the sample. Because LDMs use dif-
GLIDE[58] Diffusion 12.24 23.7 5B COCO
Stable Latent Diffu- 12.63 30.29 1.45B COCO fusion on the latent representation, training, and sampling
Diffusion[70] sion have proven to be computationally inexpensive as compared
DALL-E2[66] Diffusion 10.39 - 5.5B COCO
Imagen[74] Diffusion 7.27 - 3B COCO to other models. Imagen[74] encodes its textual prompts
Parti-20B[105] Autoregressive 3.22 - 20B COCO using a T5-XXL LLM model similar to CLIP. It then gen-
Re-Imagen[10] Diffusion 6.88 - 3.6B COCO
erates low-resolution images (64×64) using a series of de-
noising U-Net and then up-samples these images using a se-
Text conditioned GANs: Followed by cGANs [54], [67] ries of two super-resolution U-Net diffusion models to gen-
introduced embedding textual information to achieve text- erate images of size 256×256 and 1024×1024 respectively.
to-image generation. The model is jointly trained using im- Re-Imagen[10] focuses on retrieving k-nearest neighboring
ages and text captions. During sampling, the text prompts images from a dataset based on the text prompts provided
are converted to text encoding using an encoder and com- and uses these images as a reference to generate new sam-
pressed using a 128-dimensional fully connected layer. This ples. DALL-E3[6] attends to improving image quality by
compressed encoding is concatenated with the latent vector re-captioning text prompts into a more descriptive prompt,
z and passed to the generator to generate images. Stack- which has proven to generate higher-quality images.
GAN [110] introduced a series of two GAN networks where
the text encodings from the encoder are mapped to a Gaus-
sian distribution with random noise and are then concate- 4.2.2 Image super resolution
nated with the latent vector of the stage-I GAN. The stage-I
GAN generates a low-dimensional image, which is further ViT-based models have been shown to achieve SOTA results
embedded in the latent vector of the stage-II GAN to gen- for the task of image super-resolution[11–13, 109, 112].
erate a high-resolution image. The image compression is Even so, the generalization capabilities of generative mod-
done using the trained discriminator from stage-I. els can soon catch up the leaderboard[24, 99]. SRGAN[48]
Text conditioned autoregressive models: DALL-E [64] is a first introduced a generative framework for this task us-
two-stage autoregressive transformer model where the first ing an adversarial objective. ESRGAN[96] and RFB-
stage has a discrete VAE to tokenize images in a 32×32 ESRGAN[81] further improvise the SRGAN implementa-
grid and the second stage concatenates the text encodings tion by employing architectural modifications such as rel-
from a BP-Encoder to create text tokens. These concate- ativistic discriminator, dense residual block, and upgrad-
nated image-text tokens are jointly trained to maximize the ing the perceptual loss. GLEAN [9] introduced a novel
ELBO [42] to obtain an image-text distribution. Similarly, encoder-bank-decoder approach where the encoder’s latent
CogView [18] employs a VQ-VAE[94] for image tokeniza- vectors and multi-layer convolutional features are passed to
tion and SentencePiece[44] for generating text tokens and a StyleGAN[40] based latent bank. This generative bank
Parti[106] uses the ViT-VQGAN[104] as the image tok- combines features from the encoder at various scales and
7
generates new latent feature representations passed to the the filled mask as input, and the global discriminator takes
decoder to generate a super-resolution image. SR3[72] the whole image as input to collectively create the discrimi-
and SRDiff[50] were the first diffusion (DDPM) based nator’s objective. Following it, [102] uses a two-step gener-
SR models. SR3 uses a conditional DDPM U-Net archi- ative approach where the first generator (trained on recon-
tecture with some adaptations in the residual layers and struction loss) generates a coarse prediction and the second
conditions the denoising process using an LR image di- generator with a WGAN-GP[30] based objective is the re-
rectly at each iteration. SRDiff uses an encoder-generated finement generator trained on local and global adversarial
embedding of the LR image and conditions the denois- loss along with the reconstruction loss. StructurFlow[68]
ing step by concatenating the embeddings at each itera- bifurcates the GAN generation in two steps: the structure
tion. IDM[24] uses an implicit neural representation in ad- generator and the texture generator. The structure generator
dition to the conditioned DDPM to achieve a continuous creates a smoothened edge-preserved image, and the tex-
restoration over multiple resolutions using the current itera- ture generator fills the texture in the smooth reconstructed
tion’s features. EDiffSR[99] uses the SDE diffusion process image. The texture generator uses an additional input of
where isotropic noise is conditioned with the LR image dur- appearance flow in the latent space, which predicts the tex-
ing sampling to generate the high-resolution image. ture of the masked regions based on the texture from source
regions. CoModGAN[114] further generalizes the inpaint-
4.2.3 Image anomaly detection ing task with both input-masked image and stochastic noise-
conditioned latent vector input to the generator, which en-
AnoGAN[79] and f-AnoGAN[80] are unsupervised adver- abled them for large region image inpainting. Pallete[73] is
sarial anomaly detection networks that were trained on a DDPM[33] based image-to-image translation model ca-
healthy data and using the proposed anomaly score along pable of image inpainting. It fills the masked region with
with the residual score; the model predicts anomalies on standard Gaussian noise and performs the denoising train-
unseen data depending on the variation in the learned latent ing only on the masked region. RePaint[51] salvages the
space. DifferNet[71] uses the normalizing flow model to pre-trained unconditional DDPM model instead of training
map the density of healthy image features extracted from a model for the inpainting task. Since DDPM follows a
a feature extraction network. By training on healthy data, Markov chain for data perturbation, the masked input im-
anomalies will have a lower likelihood and will be out of age’s noise-perturbed data is known for every iteration in re-
distribution in the density space. FastFlow[103] uses a sim- verse. Using this knowledge, the denoising process is con-
ilar approach but extends the normalizing flow into a 2D ditioned to add the noise-perturbed image at each reverse
space, which allows to directly output location results of step to predict the masked region. Given the stochastic na-
anomalies. CFLOW-AD[29] uses an encoder feature ex- ture, this process can generate multiple candidates for the
tractor where features from every scale are pooled to form inpainting task.
multi-scale feature vectors, which are passed to the spe-
cific normalizing flow decoder along with the positional
4.2.5 Other tasks
encoding for localization of anomalies. The outputs from
each decoder are aggregated to generate an anomaly map. In addition to the above tasks, these models can be
AnoDDPM[98] approaches the problem using a DDPM used for various other generative tasks like image-to-
model with simplex noise for data perturbation. The dif- image translation[38, 89], image colorization[73], video
fusion model is trained to generate healthy images. During generation[34], point cloud generation[52], restoration,
inference, the simplex noise perturbs the test sample for a etc.
certain number of pre-set steps, and the denoising diffusion
model then generates an anomaly-free image using the per- 5. Future directions
turbed sample as the prior. Comparing the generated and
input samples using reconstruction error, an anomaly seg- Some of the unsolved but highly sought-after directions that
mentation map is generated. researchers can take are:
• Exploring Time-Series Forecasting Applications: Future
research could delve deeper into leveraging diffusion
4.2.4 Image inpainting
models for improved forecasting accuracy and efficiency.
Image inpainting tasks include restoration, textural synthe- • Physics-Inspired Generative Models: Future research
sis, and mask filling[4]. Context encoders [61] applied could focus on advancing physics-inspired generative
an adversarial loss along with the reconstruction loss to models to achieve unprecedented speed and quality in
achieve sharp and coherent mask filling. [37] advances the content creation.
simple discriminator by introducing a mixture of local and • Ethical Considerations: Future research could involve ad-
global context discriminators. The local discriminator has dressing issues related to bias, fairness, and the societal
8
impact of generative diffusion models. [11] Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao,
and Chao Dong. Activating more pixels in image super-
6. Conclusion resolution transformer, 2023. 7
[12] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, and
This survey paper has provided a detailed examination of Xiaokang Yang. Recursive generalization transformer for
generative AI diffusion models, shedding light on their image super-resolution, 2023.
techniques, applications, and challenges. Despite their [13] Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xi-
promising capabilities, generative AI diffusion models aokang Yang, and Fisher Yu. Dual aggregation transformer
still face significant challenges such as training stability, for image super-resolution, 2023. 7
scalability issues, and interpretability concerns. Addressing [14] Antonia Creswell, Tom White, Vincent Dumoulin, Kai
these challenges will be crucial for advancing the field Arulkumaran, Biswa Sengupta, and Anil A. Bharath. Gen-
and unlocking the full potential of diffusion models in erative adversarial networks: An overview. IEEE Signal
generating realistic and diverse data samples. By syn- Processing Magazine, 35(1):53–65, 2018. 4
thesizing current research findings and identifying key
[15] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu,
areas for future research, this survey aims to guide re-
and Mubarak Shah. Diffusion models in vision: A survey.
searchers and practitioners toward further advancements in
IEEE Transactions on Pattern Analysis and Machine Intel-
generative AI diffusion models, paving the way for innova-
ligence, 45(9):10850–10869, 2023. 3, 6
tive applications and breakthroughs in artificial intelligence.
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional trans-
formers for language understanding, 2019. 7
References [17] Prafulla Dhariwal and Alex Nichol. Diffusion models beat
[1] Brian DO Anderson. Reverse-time diffusion equation mod- gans on image synthesis, 2021. 1, 7
els. Stochastic Processes and their Applications, 12(3): [18] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
313–326, 1982. 4 Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
[2] Martin Arjovsky and Léon Bottou. Towards principled Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-
methods for training generative adversarial networks, 2017. image generation via transformers, 2021. 6, 7
4 [19] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice:
[3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Non-linear independent components estimation, 2015. 5
Wasserstein gan, 2017. 4 [20] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio.
[4] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, Density estimation using real nvp, 2017. 5
and Coloma Ballester. Image inpainting. In Proceedings [21] Carl Doersch. Tutorial on variational autoencoders, 2021.
of the 27th Annual Conference on Computer Graphics and 5
Interactive Techniques, page 417–424, USA, 2000. ACM [22] D.C Dowson and B.V Landau. The fréchet distance be-
Press/Addison-Wesley Publishing Co. 8 tween multivariate normal distributions. Journal of Multi-
[5] David Berthelot, Thomas Schumm, and Luke Metz. Be- variate Analysis, 12(3):450–455, 1982. 6
gan: Boundary equilibrium generative adversarial net- [23] Yilun Du and Igor Mordatch. Implicit generation and mod-
works, 2017. 4 eling with energy based models. In Advances in Neural
[6] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Information Processing Systems. Curran Associates, Inc.,
Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce 2019. 5
Lee, Yufei Guo, et al. Improving image generation with [24] Sicheng Gao, Xuhui Liu, Bohan Zeng, Sheng Xu, Yan-
better captions. Computer Science. https://cdn. openai. jing Li, Xiaoyan Luo, Jianzhuang Liu, Xiantong Zhen, and
com/papers/dall-e-3. pdf, 2(3):8, 2023. 7 Baochang Zhang. Implicit diffusion models for continuous
[7] Mikołaj Bińkowski, Danica J. Sutherland, Michael Arbel, super-resolution. In Proceedings of the IEEE/CVF Confer-
and Arthur Gretton. Demystifying mmd gans, 2021. 6 ence on Computer Vision and Pattern Recognition (CVPR),
[8] Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G. pages 10021–10030, 2023. 7, 8
Willcocks. Deep generative modelling: A comparative re- [25] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial
view of vaes, gans, normalizing flows, energy-based and networks, 2017. 2
autoregressive models. IEEE Transactions on Pattern Anal- [26] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
ysis and Machine Intelligence, 44(11):7327–7347, 2022. 5 Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,
[9] Kelvin C.K. Chan, Xintao Wang, Xiangyu Xu, Jinwei Gu, and Yoshua Bengio. Generative adversarial networks, 2014.
and Chen Change Loy. Glean: Generative latent bank 2, 4
for large-factor image super-resolution. In Proceedings of [27] Alex Graves. Generating sequences with recurrent neural
the IEEE/CVF Conference on Computer Vision and Pattern networks, 2014. 5
Recognition (CVPR), pages 14245–14254, 2021. 7 [28] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo
[10] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-
William W. Cohen. Re-imagen: Retrieval-augmented text- tor quantized diffusion model for text-to-image synthesis,
to-image generator, 2022. 6, 7 2022. 6
9
[29] Denis Gudovskiy, Shun Ishizaka, and Kazuki Kozuka. [46] Hugo Larochelle and Iain Murray. The neural autoregres-
Cflow-ad: Real-time unsupervised anomaly detection with sive distribution estimator. In Proceedings of the Four-
localization via conditional normalizing flows. In Proceed- teenth International Conference on Artificial Intelligence
ings of the IEEE/CVF Winter Conference on Applications and Statistics, pages 29–37, Fort Lauderdale, FL, USA,
of Computer Vision (WACV), pages 98–107, 2022. 8 2011. PMLR. 5
[30] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent [47] Yann Lecun, Sumit Chopra, and Raia Hadsell. A tutorial on
Dumoulin, and Aaron Courville. Improved training of energy-based learning. 2006. 5
wasserstein gans, 2017. 4, 8 [48] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Ca-
[31] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le ballero, Andrew Cunningham, Alejandro Acosta, Andrew
Bras, and Yejin Choi. Clipscore: A reference-free evalu- Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and
ation metric for image captioning, 2022. 6 Wenzhe Shi. Photo-realistic single image super-resolution
[32] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, using a generative adversarial network, 2017. 7
Bernhard Nessler, and Sepp Hochreiter. Gans trained by [49] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip
a two time-scale update rule converge to a local nash equi- H. S. Torr. Controllable text-to-image generation, 2019. 6
librium, 2018. 6 [50] Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun
[33] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single
fusion probabilistic models, 2020. 2, 3, 4, 8 image super-resolution with diffusion probabilistic models.
[34] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Neurocomputing, 479:47–59, 2022. 8
Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben [51] Andreas Lugmayr, Martin Danelljan, Andres Romero,
Poole, Mohammad Norouzi, David J Fleet, et al. Ima- Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: In-
gen video: High definition video generation with diffusion painting using denoising diffusion probabilistic models. In
models. arXiv preprint arXiv:2210.02303, 2022. 8 Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 11461–11471,
[35] Aapo Hyvärinen and Peter Dayan. Estimation of non-
2022. 8
normalized statistical models by score matching. Journal
of Machine Learning Research, 6(4), 2005. 3 [52] Shitong Luo and Wei Hu. Diffusion probabilistic mod-
els for 3d point cloud generation. In Proceedings of the
[36] Guillermo Iglesias, Edgar Talavera, and Alberto Dı́az-
IEEE/CVF Conference on Computer Vision and Pattern
Álvarez. A survey on gans for computer vision: Recent
Recognition, pages 2837–2845, 2021. 8
research, analysis and taxonomy. Computer Science Re-
view, 48:100553, 2023. 4 [53] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and
Ruslan Salakhutdinov. Generating images from captions
[37] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa.
with attention, 2016. 6
Globally and locally consistent image completion. ACM
[54] Mehdi Mirza and Simon Osindero. Conditional generative
Trans. Graph., 36(4), 2017. 8
adversarial nets, 2014. 7
[38] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A.
[55] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
Efros. Image-to-image translation with conditional adver-
Yuichi Yoshida. Spectral normalization for generative ad-
sarial networks, 2018. 8
versarial networks, 2018. 4
[39] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park,
[56] Jiquan Ngiam, Zhenghao Chen, Pang Koh, and Andrew Ng.
Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling
Learning deep energy models. pages 1105–1112, 2011. 5
up gans for text-to-image synthesis, 2023. 6, 7
[57] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
[40] Tero Karras, Samuli Laine, and Timo Aila. A style-based Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
generator architecture for generative adversarial networks, Mark Chen. Glide: Towards photorealistic image genera-
2019. 7 tion and editing with text-guided diffusion models, 2022.
[41] Diederik P. Kingma and Max Welling. An introduction to 6
variational autoencoders. Foundations and Trends® in Ma- [58] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav
chine Learning, 12(4):307–392, 2019. 5 Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and
[42] Diederik P Kingma and Max Welling. Auto-encoding vari- Mark Chen. Glide: Towards photorealistic image genera-
ational bayes, 2022. 3, 5, 7 tion and editing with text-guided diffusion models, 2022.
[43] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. 7
Normalizing flows: An introduction and review of current [59] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-
methods. IEEE Transactions on Pattern Analysis and Ma- gan: Training generative neural samplers using variational
chine Intelligence, 43(11):3964–3979, 2021. 5 divergence minimization, 2016. 4
[44] Taku Kudo and John Richardson. Sentencepiece: A sim- [60] George Papamakarios, Eric Nalisnick, Danilo Jimenez
ple and language independent subword tokenizer and deto- Rezende, Shakir Mohamed, and Balaji Lakshminarayanan.
kenizer for neural text processing, 2018. 7 Normalizing flows for probabilistic modeling and infer-
[45] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko ence, 2021. 5
Lehtinen, and Timo Aila. Improved precision and recall [61] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
metric for assessing generative models, 2019. 6 Darrell, and Alexei A Efros. Context encoders: Feature
10
learning by inpainting. In Proceedings of the IEEE con- [77] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P.
ference on computer vision and pattern recognition, pages Kingma. Pixelcnn++: Improving the pixelcnn with dis-
2536–2544, 2016. 8 cretized logistic mixture likelihood and other modifications,
[62] Alec Radford, Luke Metz, and Soumith Chintala. Unsuper- 2017. 5
vised representation learning with deep convolutional gen- [78] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger,
erative adversarial networks, 2016. 4 and Timo Aila. Stylegan-t: Unlocking the power of gans
[63] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya for fast large-scale text-to-image synthesis, 2023. 6
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [79] Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein,
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Ursula Schmidt-Erfurth, and Georg Langs. Unsupervised
Krueger, and Ilya Sutskever. Learning transferable visual anomaly detection with generative adversarial networks to
models from natural language supervision, 2021. 6, 7 guide marker discovery, 2017. 8
[64] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott [80] Thomas Schlegl, Philipp Seeböck, Sebastian M. Waldstein,
Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Georg Langs, and Ursula Schmidt-Erfurth. f-anogan: Fast
Sutskever. Zero-shot text-to-image generation, 2021. 6, 7 unsupervised anomaly detection with generative adversarial
[65] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey networks. Medical Image Analysis, 54:30–44, 2019. 8
Chu, and Mark Chen. Hierarchical text-conditional image [81] Taizhang Shang, Qiuju Dai, Shengchen Zhu, Tong Yang,
generation with clip latents, 2022. 6 and Yandong Guo. Perceptual extreme super resolution net-
work with receptive field block, 2020. 7
[66] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey
Chu, and Mark Chen. Hierarchical text-conditional image [82] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Mah-
generation with clip latents, 2022. 6, 7 eswaranathan, and Surya Ganguli. Deep unsupervised
learning using nonequilibrium thermodynamics, 2015. 2,
[67] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
3
geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
[83] Yang Song and Stefano Ermon. Generative modeling by
versarial text to image synthesis, 2016. 6, 7
estimating gradients of the data distribution, 2020. 3, 4
[68] Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H Li,
[84] Yang Song and Stefano Ermon. Improved techniques for
Shan Liu, and Ge Li. Structureflow: Image inpainting via
training score-based generative models, 2020. 4
structure-aware appearance flow. In Proceedings of the
[85] Yang Song and Diederik P. Kingma. How to train your
IEEE/CVF international conference on computer vision,
energy-based models, 2021. 5
pages 181–190, 2019. 8
[86] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon.
[69] Danilo Jimenez Rezende and Shakir Mohamed. Variational
Sliced score matching: A scalable approach to density and
inference with normalizing flows, 2016. 5
score estimation, 2019. 3
[70] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[87] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma,
Patrick Esser, and Björn Ommer. High-resolution image
Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-
synthesis with latent diffusion models, 2022. 1, 2, 6, 7
based generative modeling through stochastic differential
[71] Marco Rudolph, Bastian Wandt, and Bodo Rosenhahn. equations, 2021. 4
Same same but differnet: Semi-supervised defect detection [88] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
with normalizing flows, 2020. 8 Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent
[72] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sali- Vanhoucke, and Andrew Rabinovich. Going deeper with
mans, David J. Fleet, and Mohammad Norouzi. Image convolutions, 2014. 6
super-resolution via iterative refinement, 2021. 1, 6, 8 [89] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali
[73] Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Dekel. Plug-and-play diffusion features for text-driven
Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mo- image-to-image translation. In Proceedings of the
hammad Norouzi. Palette: Image-to-image diffusion mod- IEEE/CVF Conference on Computer Vision and Pattern
els, 2022. 1, 8 Recognition, pages 1921–1930, 2023. 8
[74] Chitwan Saharia, William Chan, Saurabh Saxena, Lala [90] Benigno Uria, Marc-Alexandre Côté, Karol Gregor, Iain
Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Murray, and Hugo Larochelle. Neural autoregressive dis-
Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, tribution estimation, 2016. 5
Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J [91] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen
Fleet, and Mohammad Norouzi. Photorealistic text-to- Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,
image diffusion models with deep language understanding, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A gen-
2022. 6, 7 erative model for raw audio, 2016. 5
[75] Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier [92] Aaron van den Oord, Nal Kalchbrenner, and Koray
Bousquet, and Sylvain Gelly. Assessing generative models Kavukcuoglu. Pixel recurrent neural networks, 2016. 5
via precision and recall, 2018. 6 [93] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals,
[76] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu.
Cheung, Alec Radford, and Xi Chen. Improved techniques Conditional image generation with pixelcnn decoders,
for training gans, 2016. 4, 6 2016. 5
11
[94] Aaron van den Oord, Oriol Vinyals, and Koray detection. In International conference on machine learning,
Kavukcuoglu. Neural discrete representation learning, pages 1100–1109. PMLR, 2016. 5
2018. 7 [108] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang,
[95] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob and In So Kweon. Text-to-image diffusion models in gen-
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, erative ai: A survey, 2023. 6
and Illia Polosukhin. Attention is all you need, 2023. 5 [109] Dafeng Zhang, Feiyu Huang, Shizhuo Liu, Xiaobing Wang,
[96] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, and Zhezhu Jin. Swinfir: Revisiting the swinir with fast
Chao Dong, Chen Change Loy, Yu Qiao, and Xiaoou Tang. fourier convolution and improved training for image super-
Esrgan: Enhanced super-resolution generative adversarial resolution, 2023. 7
networks, 2018. 7 [110] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
[97] Max Welling and Yee W Teh. Bayesian learning via gang Wang, Xiaolei Huang, and Dimitris Metaxas. Stack-
stochastic gradient langevin dynamics. In Proceedings gan: Text to photo-realistic image synthesis with stacked
of the 28th international conference on machine learning generative adversarial networks, 2017. 6, 7
(ICML-11), pages 681–688, 2011. 3 [111] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-
[98] Julian Wyatt, Adam Leach, Sebastian M. Schmon, and tus Odena. Self-attention generative adversarial networks,
Chris G. Willcocks. Anoddpm: Anomaly detection with de- 2019. 4
noising diffusion probabilistic models using simplex noise. [112] Leheng Zhang, Yawei Li, Xingyu Zhou, Xiaorui Zhao, and
In 2022 IEEE/CVF Conference on Computer Vision and Shuhang Gu. Transcending the limit of local window: Ad-
Pattern Recognition Workshops (CVPRW), pages 649–655, vanced super-resolution transformer with adaptive token
2022. 8 dictionary, 2024. 7
[99] Yi Xiao, Qiangqiang Yuan, Kui Jiang, Jiang He, Xianyu Jin, [113] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-
and Liangpei Zhang. Ediffsr: An efficient diffusion prob- based generative adversarial network, 2017. 4
abilistic model for remote sensing image super-resolution.
[114] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao
IEEE Transactions on Geoscience and Remote Sensing, 62:
Liang, Eric I Chang, and Yan Xu. Large scale image com-
1–14, 2024. 7, 8
pletion via co-modulated generative adversarial networks,
[100] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Run-
2021. 8
sheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-
Hsuan Yang. Diffusion models: A comprehensive survey
of methods and applications, 2023. 6
[101] Xin Yi, Ekta Walia, and Paul Babyn. Generative adversar-
ial network in medical imaging: A review. Medical Image
Analysis, 58:101552, 2019. 4
[102] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
Thomas S. Huang. Generative image inpainting with con-
textual attention. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018. 8
[103] Jiawei Yu, Ye Zheng, Xiang Wang, Wei Li, Yushuang Wu,
Rui Zhao, and Liwei Wu. Fastflow: Unsupervised anomaly
detection and localization via 2d normalizing flows, 2021.
8
[104] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang,
James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge,
and Yonghui Wu. Vector-quantized image modeling with
improved vqgan, 2022. 7
[105] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander
Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchin-
son, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason
Baldridge, and Yonghui Wu. Scaling autoregressive mod-
els for content-rich text-to-image generation, 2022. 6, 7
[106] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong,
Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander
Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchin-
son, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason
Baldridge, and Yonghui Wu. Scaling autoregressive mod-
els for content-rich text-to-image generation, 2022. 6, 7
[107] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei
Zhang. Deep structured energy based models for anomaly
12