0% found this document useful (0 votes)
27 views8 pages

Ieee Gan Literature Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views8 pages

Ieee Gan Literature Survey

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Text-to-Image Synthesis for Heritage Monuments

Using Generative Adversarial Networks: A Survey


Anjana Bharamnaikar Pooja Gani Somashekhar M. Kinagi
KLE Technological University KLE Technological University KLE Technological University
Hubballi,India Hubballi,India Hubballi,India
bharamnaikaranjana@[Link] poojagani4@[Link] kinagimsomashekhar@[Link]

Saarthak Sooji Uday Kulkarni Prathit Kulkarni


KLE Technological University KLE Technological University KLE Technological University
Hubballi,India Hubballi,India Hubballi,India
saarthaksooji02@[Link] uday_kulkarni@[Link] prathitk2003@[Link]

Vijaykumar Muttagi
KLE Technological University
Hubballi,India
vijaymuttagi2002@[Link]

Abstract—India’s architectural heritage, including iconic struc- they are vital elements of spiritual, social, and historical
tures like the temples of the Vijayanagara Empire, faces sig- development. Archiving and sharing knowledge about these
nificant preservation challenges from factors such as wars, cultural assets pose challenges due to their scale and reliability
neglect, and environmental degradation. Artificial Intelligence
(AI) offers innovative solutions for heritage conservation, pro- issues. Many sites face deterioration from wear, conflict, and
viding advanced tools for documenting, analyzing, and digi- neglect, resulting in significant loss of historical information
tally restoring these cultural assets. Among AI technologies, [1].
Generative Adversarial Networks (GANs) stand out for their Conserving and documenting India’s vast history requires
ability to reconstruct and visualize heritage sites from textual collaboration among government agencies, non-profits, and
descriptions, enabling digital restoration and preservation. GANs
can realistically recreate lost or damaged elements, offering a citizens. The sheer number of architectural and archaeological
virtual glimpse into the past that supports both education and sites poses significant challenges, compounded by low public
awareness. By providing immersive experiences through virtual engagement, which affects many treasures. This deterioration
tours and digital reconstructions, GANs can help raise global impacts the physical sites and diminishes public appreciation
awareness about India’s rich architectural legacy, while also for India’s rich history and its role in global civilization.
engaging local communities in the importance of preservation.
This technology allows for interactive exploration, empowering Several digital heritage projects [2]–[5] worldwide aim to
conservationists to make informed decisions that balance histori- integrate technology with cultural preservation. An initiative is
cal accuracy with modern conservation needs. Ultimately, GANs undertaken to safeguard cultural heritage, as it serves to bridge
play a crucial role in safeguarding India’s architectural heritage generations across time [6].
for future generations, fostering deeper public understanding and Artificial Intelligence (AI) [7] offers a transformative solu-
appreciation both locally and globally.
Index Terms—Generative Adversarial Networks (GANs), Her- tion for heritage protection and restoration. Technologies like
itage Conservation, Indian Digital Heritage Space (IHDS), Text- Machine Learning (ML) [8] are effective in documenting, ana-
to-Image Synthesis. lyzing, and assisting with the physical restoration of historical
monuments [9]. AI improves preservation by assessing struc-
I. I NTRODUCTION tural stability and supporting reconstruction efforts. Computer
India’s diverse cultures and traditions are vividly reflected vision facilitates the digitization and classification of artifacts
in its historical monuments, such as the ornate temples of the [10], while natural language processing helps interpret ancient
Vijayanagara Empire, known for their stunning architecture scripts [11]. AI-driven solutions create immersive virtual tours
and intricate carvings. These temples, like those found in and interactive experiences, providing global audiences with a
Hampi and Pattadakal, showcase the empire’s rich heritage and new perspective on India’s rich heritage [12].
artistic excellence. The rock-cut temples of Ajanta and Ellora AI, particularly ML and its subfield, Deep Learning (DL)
also represent significant cultural contributions, with their [13], has the potential to revolutionize industries, including
intricate sculptures and murals that tell stories of spirituality heritage conservation [14]. ML excels at data analysis and
and craftsmanship. These sites are not merely artistic treasures; prediction [8], while DL employs multiple neural network
layers to detect complex patterns, making it effective for from textual descriptions [24]. They utilize two networks:
image recognition and analyzing intricate designs [15]. In a generator that produces data from random noise and a
heritage conservation, DL can identify subtle wear on ar- discriminator that distinguishes real data from generated out-
tifacts and predict future deterioration, offering innovative puts. This process helps the generator improve its output
preservation methods [16]. Generative AI (GenAI) generates until the discriminator can’t easily tell them apart [25]. Both
original content from data, transforming creative industries networks are typically multilayer perceptrons, and training
through automation and new product creation [17]. Few of alternates between improving the discriminator’s accuracy and
the applications of GenAI are shown in Fig. 1. A notable the generator’s realism [22].
application of GenAI is text-to-image synthesis, which creates
The basic GAN architecture consists of a Generator creating
realistic visuals from textual descriptions [18].
synthetic data from random noise and a Discriminator evalu-
The proposed study is structured into four main sections.
ating these samples against real data. The adversarial process
Section II provides a concise review of the literature on
drives the Generator to produce more realistic data, at the same
text-to-image synthesis, with a focus on the superiority of
time, the discriminator becomes more adept at differentiating
GANs over alternative methods. Section III details the research
between real and generated samples. This ongoing feedback
methodology, featuring a comparative analysis of various GAN
loop is fundamental to the training of GANs.
architectures applied to text-to-image generation. Section IV
discusses the experimental results, highlighting the perfor- In GANs, the loss function of the discriminator improves its
mance of different GAN models. Lastly, Section V concludes ability to differentiate between real from generated data, but
the study by summarizing the key findings, emphasizing the the loss function of the generator is designed with the objective
advantages of GANs, and offering suggestions for future to create samples the discriminator cannot classify as fake [26].
research. Although adversarial and perceptual losses [20] improve image
quality, aligning images with text remains challenging and
II. T EXT TO IMAGE SYNTHESIS
often requires optimizing multiple loss functions for accurate
Text-to-image synthesis involves creating images from tex- correspondence.
tual descriptions. In this process, GenAI models [19] serve
as a bridge between language and visual content. These The following is encoded text in the GAN training process
models are trained to capture the relationship between text with text-conditioned loss: generating images from noise and
and corresponding visual features, allowing them to produce update discriminator to distinguish real vs fake images. The
realistic images that align with the provided descriptions. generator improves iteratively in generating realistic images,
One of the primary challenges in text-to-image synthesis as shown in Algorithm 1 [27].
include accurately interpreting complex textual descriptions
and mapping them to corresponding visual elements, which
involves recognizing objects, attributes, and understanding Algorithm 1 Training algorithm for GAN-CLS with step size
relationships between entities [20]. An additional challenge β, utilizing minibatch stochastic gradient descent (SGD) for
is producing images that align with the input text while also simplicity.
being aesthetically pleasing and coherent. Common issues in- Require: A batch of images I, corresponding textual descrip-
clude a lack of sharpness or fine details in the generated images tions T , non-matching textual descriptions T̃ , and the total
[21]. Large, high-quality datasets that pair text with images number of training iterations
are essential but resource-intensive to collect, limiting model 0: for k = 1 to N do
training [22]. Lastly, models must generalize to unseen text 1. ht ← ϕ(T ) {Encode the corresponding textual descrip-
descriptions and generate diverse images, avoiding overfitting tion}
while producing varied outputs [23]. 2. h˜t ← ϕ(T̃ ) {Encode the non-matching textual descrip-
tion}
3. z ∼ N (0, 1)Z {Sample random noise}
4. I˜ ← G(z, ht ) {Generate an image using the generator}
5. sreal ← D(I, ht ) {Evaluate real image with matching
text}
6. swrong ← D(I, h˜t ) {Evaluate real image with non-
matching text}
7. sfake ← D(I, ˜ ht ) {Evaluate fake image with matching
text}
log(1−swrong )+log(1−sfake )
8. LD ← log(sreal ) + 2
∂LD
9. D ← D − β ∂D {Update the discriminator}
Fig. 1. Different applications of GANs. 10. LG ← log(sfake )
GANs are effective for text-to-image synthesis as they 11. G ← G − β ∂L ∂G {Update the generator}
G

0: end for=0
learn complex data distributions and generate realistic images
III. C OMPARATIVE S TUDY ON T YPES OF GAN S
This section offers a comparative analysis of different GAN LD0 = −Ex0 ∼pdata (x0 ) [log D0 (x0 )]−Ez0 ∼pz0 (z0 ) [log(1−D0 (G0 (z0 )))]
models [28] used for text-to-image synthesis. Fig. 2 illustrates (3)
the flowchart of various types of GANs, highlighting their Where: - D0 (x0 ) is the discriminator’s output for a real im-
distinct architectures and functionalities. This comprehensive age. - D0 (G0 (z0 )) is the discriminator’s output for a generated
overview categorizes GAN variants based on their unique (fake) image. - pdata (x0 ) is the real data distribution. - pz0 (z0 )
approaches and applications, providing a clear understanding is the prior noise distribution.
of their roles in the field of generative modeling. The generator’s objective is to fool the discriminator by
generating images that the discriminator classifies as real. The
generator’s loss function is defined in equation 4.

LG0 = −Ez0 ∼pz0 (z0 ) [log D0 (G0 (z0 ))] (4)


The two-player minimax game between the generator and
discriminator is represented by the value function V (G0 , D0 ),
where the generator tries to minimize this objective and the
discriminator tries to maximize it, as shown in equation 5.

min max V (G0 , D0 ) =Ex0 ∼pdata (x0 ) [log D0 (x0 )]


G0 D0
+ Ez0 ∼pz0 (z0 ) [log(1 − D0 (G0 (z0 )))]
Fig. 2. Different Types of GANs Applied in Text-to-Image Synthesis. (5)
The training process involves updating both the generator
and the discriminator. The discriminator is trained to maximize
A. Vanilla GAN
the following objective, as shown in equation 6.
Vanilla GANs (VGANs) differ from other text-to-image
models such as diffusion or autoregressive models in their max LD0 = Ex0 ∼pdata (x0 ) [log D0 (x0 )]+Ez0 ∼pz0 (z0 ) [log(1−D0 (G0 (z0 )))]
training approach, as they employ an adversarial process D0
(6)
involving a generator G0 and a discriminator D0 . As shown
This encourages the discriminator to correctly classify real
in Fig. 3, the generator synthesizes new data, while the
images as real and generated images as fake.
discriminator evaluates this data by comparing it against real
samples. This adversarial mechanism forces the model to learn Similarly, the generator is trained to minimize the following
and generate realistic outputs by mimicking the distribution of objective, shown in equation 7.
the training data.
The generator G0 takes a random noise vector z0 sampled min LG0 = −Ez0 ∼pz0 (z0 ) [log D0 (G0 (z0 ))] (7)
G0
from a prior distribution (e.g., normal or uniform distribution)
Alternatively, the generator’s objective can be formulated as
and generates a fake image (or data sample) G0 (z0 ), as shown
equation 8, although this form can lead to vanishing gradients,
in equation 1.
and the negative log formulation is often preferred for stable
training.
G0 (z0 ) = Generator(z0 ) (1)
Where z0 ∼ pz0 (z0 ) is the random noise input. The min LG0 = Ez0 ∼pz0 (z0 ) [log(1 − D0 (G0 (z0 )))] (8)
G0
generator aims to produce data that resembles the real data
distribution.
The discriminator D0 takes an image (or data sample) and
outputs a probability D0 (x0 ), which represents the likelihood
that the input image is real (from the dataset) rather than
generated by G0 , as shown in equation 2.

D0 (x0 ) = Discriminator(x0 ) (2)


Where x0 is either a real image or a generated image
G0 (z0 ). The discriminator is trained to distinguish between
real and fake images.
The discriminator’s loss function, which it seeks to maxi- Fig. 3. Basic architecture for Vanilla GAN [29].
mize, is provided in equation 3:
B. Deep Convolutional GAN with the text embedding ϕt . It aims to reduce the misclassifi-
Deep Convolutional GANs (DCGANs) [30] represent an cation of images generated by the generator G1 , which takes
advancement of the conventional GAN framework. They uti- a random noise vector n and a conditional vector c1 derived
lize convolutional layers in both the generator and discrim- from the text embedding. The loss functions for D1 and G1
inator networks, replacing the fully connected layers found are defined by equations 9 and 10.
in standard GANs. This incorporation of convolutional layers
enables the networks to learn hierarchical feature representa- LD1 = E(Ir ,t)∼preal [log D1 (Ir , ϕt )]
tions, making DCGANs especially effective for image-related
applications [31]. + En∼pn ,t∼preal [log (1 − D1 (G1 (n, c1 ) , ϕt ))] (9)
In Fig. 4, the DCGAN architecture for text-to-image synthe- LG1 = En∼pn ,t∼preal [log (1 − D1 (G1 (n, c1 ) , ϕt ))]
sis begins with an XLNet-based text encoder, which converts + λDKL (N (µ1 (ϕt ) , Σ1 (ϕt )) ∥N (0, I)) (10)
text descriptions into 768-dimensional vectors that encapsulate
the content and context of the input. These vectors are sub- In Stage II, the enhanced generator G improves the initial
sequently processed through conditional augmentation, which image s1 from Stage I using the text embedding c. The
introduces variability and allows the model to produce diverse discriminator D assesses real images in conjunction with text
images from the same description. The augmented text embed- embeddings ϕt and differentiates them from the refined images
dings are merged with random noise, typically drawn from a produced by G. The loss functions for D and G are provided
Gaussian distribution, serving as the input for the generator. by equations 11 and 12 [33].
The generator, designed with transposed convolutional layers,
upscales this input from a low-dimensional latent space to LD = E(I,t)∼preal [log D (I, ϕt )]
high-dimensional, realistic images.
Simultaneously, the discriminator, structured as a Convolu- + Es1 ∼pG1 ,t∼preal [log (1 − D (G (s1 , c) , ϕt ))] (11)
tional Neural Network (CNN), assesses both the real images LG = Es1 ∼pG1 ,t∼preal [log (1 − D (G (s1 , c) , ϕt ))]
from the dataset and those generated by the generator [32]. Its + λDKL (N (µ (ϕt ) , Σ (ϕt )) ∥N (0, I)) (12)
role is to differentiate between authentic and synthetic images
by extracting features through convolutional layers and out-
putting a probability score. The generator and discriminator are
trained in an adversarial manner, with the generator attempting
to create images that can deceive the discriminator, while the
discriminator aims to accurately classify images as real or
fake. Over time, this iterative process enhances the generator’s
capability to produce realistic images that correspond to the
provided text descriptions.

Fig. 5. The architecture of the proposed StackGAN [21].

D. Feature-Aware GAN
Feature-Aware Generative Adversarial Networks (FA-
GANs) enhance image generation by incorporating feature-
awareness into the GAN framework. This model utilizes
feature information to guide the generation process, resulting
Fig. 4. This figure shows the architecture of Deep Convolution GAN for
text-to-image synthesis [30].
in images that better align with specified attributes. The
architecture of FA-GAN integrates feature maps into both the
generator and discriminator, ensuring that the generated im-
C. Stack GAN ages exhibit rich details and conformity with defined features.
Stacked GANs (StackGANs) [21] are sophisticated GAN As depicted in Fig. 6 and 7, the FA-GAN architecture includes
architectures specifically developed for text-to-image synthe- a self-supervised discriminator that evaluates the compatibility
sis. They tackle the challenge of producing high-resolution between input sentences and images, along with an auxiliary
images from textual descriptions through a two-stage process, decoder trained to reconstruct real images. The generator
as illustrated in Fig. 5. creates images based on a noise vector n and a sentence vector
In Stage I, the discriminator D1 assesses real image-text s, while a feature-aware loss function maximizes the similarity
pairs, assigning high probabilities to genuine images Ir paired of features between real and generated images.
The loss functions for the discriminator (Ladv−D0 ) and
the generator (Ladv−G0 ) are designed to ensure that the dis-
criminator accurately classifies real and generated images,
while the generator attempts to deceive the discriminator. The
adversarial loss for the discriminator is defined by equation
14.

Ladv−D0 = −E(y,s̃)∼Pdata [min(0, −1 + D0 (y, s̃))]


1
− E(ŷ,s̃)∼Pmis [min(0, −1 − D0 (ŷ, s̃))]
2
1
− En∼Pn ,s̃∼Pdata [min(0, −1 − D0 (G0 (n, s̃), s̃))]
2
(13)
+ LM A-G P (14)
Fig. 6. Basic proposed architecture of FA-GAN, part 1
This loss function penalizes the discriminator for misclas-
sifying real, mislabeled, and generated samples. The term
LM A-G P is related to the performance of the model.
The adversarial loss for the generator is expressed in equa-
tion 15.

Ladv−G0 = −En∼Pn ,s̃∼Pdata [D0 (G0 (n, s̃), s̃)] (15)


This loss function aims to maximize the discriminator’s
output for generated samples, thereby promoting the creation
of realistic images.
The reconstruction loss is defined by equation 16.

Lrec = E(y,s̃)∼Pdata [pl(Dec(Enc(y, s̃))), y)] (16)


This loss measures the similarity between encoded and
original images, ensuring that important features are preserved. Fig. 7. Basic proposed architecture of FA-GAN, part 2
The total loss for the discriminator LD0 combines the ad-
versarial loss and the reconstruction loss as shown in equations
17 and 18. words for generating image sub-regions, the Deep Atten-
tional Multimodal Similarity Model (DAMSM) guarantees
a correspondence between the descriptions of texts and the
LD0 = Ladv−D0 + Lrec (17) produced images, enabling fine-grained synthesis. In the At-
LG0 = Ladv−G0 + Lfa (18) tnGAN framework, the loss functions are designed to ensure
the model effectively learns to generate high-quality images
For the generator LG0 , it combines the adversarial loss with and accurately attend to relevant features by equation ??.
an additional term Lfa , which is related to the generator’s
performance in the FA-GAN model.  
N
X −1 exp t′j,k
E. AttentionGAN zj = αj,k fk′ , where αj,k =P   (19)
N −1 ′
AttentionGAN (AttnGAN) [34] is a model that generates k=0 l=0 exp tj,l
images from written descriptions of an object. This model This equation computes the attention-weighted sum of fea-
enhances the realism and relevance of the images it produces. tures, zj , where αj,k denotes the attention weights obtained
It focuses on the specific words in the text and establishes con- from the normalized exponentiated scores t′j,k . These weights
nections by directing attention to the corresponding regions of emphasize the most pertinent features for image generation.
the image, operating simultaneously on words and sentences. The overall loss function is expressed in equation 20.
The AttnGAN architecture in Figure 8 involves training a
CNN on text features encoded at both levels. The generator L = LG0 + λLD A M S M (20)
and discriminator networks perform feed-forward inference on
these text features to produce photo-realistic images. Here, LG0 represents the loss associated with the gen-
The architecture consists of two key components: the erator, while LD A M S M is an additional loss component
Attentional Generative Network, which focuses on relevant that integrates various regularization or alignment losses. The
parameter λ adjusts the influence of the LD A M S M term on iterative processes. In addition, GigaGAN is capable of synthe-
the total loss. sizing images with ultra-high resolutions-up to 16 megapixels-
The generator loss function is described by equation 21. very efficiently, making it particularly beneficial for large-scale
text-to-image synthesis tasks. [35].
p−1
X The advantages of scaling up GANs include their ability to
LG0 = L Gk (21) handle complex datasets and perform well in open-world set-
k=0
tings, unlike traditional GANs limited to single or few object
The generator loss, LG0 , is the aggregate of individual categories. Despite these advancements, scaling up Generative
losses LGk across different components or stages of the GAN Adversarial Networks (GANs) presents several challenges.
framework. These include stability issues during the training process
The individual generator loss (LGk ) is specified in equation and a noticeable performance gap in terms of realism and
22. compositionality compared to top models like DALL·E 2. This
gap is particularly evident in aspects such as photorealism and
1 1  the alignment between images and text descriptions. Scaling
LGk = − Ex̃k ∼PGk [log (Dk (x̃k ))]− Ex̃k ∼PGk log Dk x̃k , f¯

2 2 up GANs also requires significant computational resources,
(22)
including memory and power, which can be prohibitive for
This loss function comprises two segments: the uncondi-
smaller-scale applications. Moreover, the scalability often in-
tional loss, which motivates the generator to produce samples
troduces a higher risk of mode collapse, making the design
that the discriminator identifies as real, and the conditional
of robust architectures and optimization strategies critical for
loss, which ensures that the generated samples conform to
stable training over long periods. [36].
specified conditions or embeddings.
The discriminator loss function (LDk ) is provided in equa-
tion 23.
1 1
LDk = − Ey ∼P k [log Dk (yk )] − Ex̃k ∼PG [log (1 − Dk (x̃k ))]
2 k data 2 k
1 1
− Eyk ∼Pdata log Dk yk , f¯ − Ex̃k ∼PG log 1 − Dk x̃k , f¯
   
2 k 2 k
(23)

The discriminator’s loss function is divided into uncon-


ditional and conditional components. The unconditional loss Fig. 9. Architecture of Scaling-up GAN [23].
assesses the discriminator’s capacity to differentiate between
real and generated samples without any conditions, while the IV. R ESULTS
conditional loss evaluates its ability to classify samples based This section briefly talks about the experimental results for
on particular conditions or embeddings. text to image synthesis using different types of GANs.
A. Dataset Description
The Indian Digital Heritage Space (IHDS) dataset comprises
50 distinct classes of images, depicting temples and ancient
monuments across India. This dataset includes 37 UNESCO-
listed sites, making it a valuable resource for architects, his-
torians, and tourists seeking insights into India’s rich cultural
heritage. Some sample images from the dataset are shown in
Fig. 10, showcasing the diversity and significance of these
heritage sites.
B. Experimental setup for training
The experiments were performed on an NVIDIA DGX-1
Fig. 8. Architecture of AttnGAN [34].
system, which is equipped with NVIDIA Tesla V100 Tensor
Core GPUs. The DGX-1 is optimized for high-performance
F. Scaling-Up GAN AI and deep learning applications, featuring eight V100 GPUs
that provide exceptional computational power and efficiency.
Scaling-up GANs [23], such as with the GigaGAN architec-
ture Fig. 9, offers significant performance improvements over C. Training and Validation
traditional GANs and other generative models like diffusion The IHDS dataset has been used to train a variety of GAN
models. GigaGAN has much better inference speed-it can models, including FA-GAN, DCGAN, StackGAN, VGAN,
generate a 512px image in 0.13 seconds, orders of magnitudes Scaling-up GAN, and AttnGAN. For all these models, the
better than diffusion and autoregressive models that rely on proposed work has utilized Word2Vec as the text encoder.
Fig. 11. Graph for FID and inception score.

Fig. 10. Sample images of IHDS dataset [1].

The GANs generate images based on the text encodings


by mapping these high-dimensional vectors to the image
space, allowing them to learn the relationship between textual
descriptions and visual features. These models have been
assessed using various standard metrics, Fréchet Inception
Distance (FID), Inception Score (IS), R-precision, Structural
Similarity Index Measure (SSIM), and Perceptual Similarity
Index (PSI). FID quantifies the similarity between generated
images and real images, while IS evaluates the diversity and
clarity of the generated samples, R-precision assesses the Fig. 12. Graph for R-Precision,SSIM and PSI.
alignment between generated images and their corresponding
text or labels, SSIM evaluates the structural similarity between
images, and PSI quantifies perceptual similarity based on V. C ONCLUSION
human visual perception. Graphs 11 and 12, along with table In this proposed work, we researched several architectures
I and II provide the statistical details of the experiment for text-to-image synthesis using GAN on the Indian Digital
performed on the dataset. Heritage Space (IHDS) dataset. Among all models tested,
StackGAN was the best performing one in balance between
TABLE I the quality of images, diversity, and structural accuracy. This
C OMPARISON OF D IFFERENT GAN S - FID AND I NCEPTION S CORE
model captures both the global layouts and detailed details
Types of GANs FID Inception Score through its two-stage process, in which low-resolution images
VGAN 0.35 1.65(30) are generated in the first stage to be refined into high-
DCGAN 0.29 1.98(26) resolution versions in the second stage. The other models like
STACK GAN 0.22 3.45(20) AttnGAN and Scaling-up GAN also show promise but are
ATTNGAN 0.24 3.05(25) only marginally less precise and structurally degraded. The
SCALING-UP GAN 0.18 2.60(18) generation of detail images from textual descriptions has made
FA-GAN 0.37 2.35(14) StackGAN particularly useful for applications such as heritage
site classification.
VI. F UTURE WORK
TABLE II
C OMPARISON OF D IFFERENT GAN S - R-P RECISION , SSIM, AND PSI Future work could focus on incorporating a more advanced
text encoder to enhance the model’s understanding of the
Types of GANs R-Precision SSIM PSI textual descriptions. Utilizing state-of-the-art language mod-
VGAN 0.65 0.12 0.44 els, such as transformer-based encoders like BERT or GPT,
DCGAN 0.72 0.14 0.41 could significantly improve the semantic comprehension of
STACK GAN 0.83 0.24 0.36 the texts, resulting in more accurate and contextually rich
ATTNGAN 0.81 0.21 0.39 visual representations. This could also help in effectively
SCALING-UP GAN 0.79 0.20 0.37 capturing nuanced historical and cultural descriptions, leading
FA-GAN 0.78 0.19 0.42 to synthesized images that better reflect the intricate details
of heritage monuments. Additionally, exploring multi-modal
encoders that jointly learn from text and visual cues could [19] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So
further elevate the quality and coherence of the generated Kweon. Text-to-image diffusion models in generative ai: A survey. arXiv
preprint arXiv:2303.07909, 2023.
outputs. [20] Stanislav Frolov, Tobias Hinz, Federico Raue, Jörn Hees, and Andreas
Dengel. Adversarial text-to-image synthesis: A review. Neural Networks,
R EFERENCES 144:187–209, 2021.
[1] Uday Kulkarni, SM Meena, Sunil V Gurlahosur, and Uma Mudengudi. [21] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang,
Classification of cultural heritage sites using transfer learning. In 2019 Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-
IEEE fifth international conference on multimedia big data (BigMM), realistic image synthesis with stacked generative adversarial networks.
pages 391–397. IEEE, 2019. In Proceedings of the IEEE international conference on computer vision,
[2] Anupama Mallik, Santanu Chaudhury, Vijay Chandru, and Sharada pages 5907–5915, 2017.
Srinivasan. Digital Hampi: preserving Indian cultural heritage. [22] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Springer, 2017. Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.
[3] Lora Hristozova, Bilyana Popova, and Sofiya Kovacheva. Project" digital Generative adversarial nets. Advances in neural information processing
cultural and historical heritage of plovdiv municipality"-problems and systems, 27, 2014.
solutions? Cultural and Historical Heritage: Preservation, Presentation, [23] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman,
Digitalization (KIN Journal), 7(1):53–68, 2021. Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image
[4] Marinos Ioannides, Pavlos Chatzigrigoriou, Vasiliki Nikolakopoulou, synthesis. In Proceedings of the IEEE/CVF Conference on Computer
Georgios Leventis, Eirini Papageorgiou, Vasilis Athanasiou, and Chris- Vision and Pattern Recognition, pages 10124–10134, 2023.
tian Sovis. Parian marble: A virtual multimodal museum project. [24] He Huang, Philip S Yu, and Changhu Wang. An introduction to
In Digital Heritage. Progress in Cultural Heritage: Documentation, image synthesis with generative adversarial nets. arXiv preprint
Preservation, and Protection: 6th International Conference, EuroMed arXiv:1803.04469, 2018.
2016, Nicosia, Cyprus, October 31–November 5, 2016, Proceedings, [25] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Part II 6, pages 256–264. Springer, 2016. Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
[5] Pratiksha Benagi, SM Meena, Uday Kulkarni, and Sachin Shetty. Feature erative adversarial networks. Communications of the ACM, 63(11):139–
extraction and classification of heritage image from crowd source. In 144, 2020.
2018 International Conference on Current Trends towards Converging [26] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran,
Technologies (ICCTCT), pages 1–5. IEEE, 2018. Biswa Sengupta, and Anil A Bharath. Generative adversarial networks:
[6] Thomas G Weiss, Irina Bokova, Simon Adams, Marwa Al-Sabouni, An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
Kwame Anthony Appiah, Lazare Eloundou Assomo, and Francesco [27] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt
Bandarin. Cultural heritage and mass atrocities. Getty Publications, Schiele, and Honglak Lee. Generative adversarial text to image synthe-
2022. sis. In International conference on machine learning, pages 1060–1069.
[7] James H Fetzer and James H Fetzer. What is artificial intelligence? PMLR, 2016.
Springer, 1990. [28] B Thamotharan, AL Sriram, and B Sundaravadivazhagan. A comparative
[8] Issam El Naqa and Martin J Murphy. What is machine learning? study of gans (text to image gans). In International Conference on
Springer, 2015. Advances in Communication Technology and Computer Engineering,
[9] Jomana Ahmed Gaber, Sherin Moustafa Youssef, and Karma Mohamed pages 229–241. Springer, 2023.
Fathalla. The role of artificial intelligence and machine learning in [29] M Durgadevi et al. Generative adversarial network (gan): A general
preserving cultural heritage and art works via virtual restoration. ISPRS review on different variants of gan and applications. In 2021 6th
Annals of the Photogrammetry, Remote Sensing and Spatial Information International Conference on Communication and Electronics Systems
Sciences, 10:185–190, 2023. (ICCES), pages 1–8. IEEE, 2021.
[10] Luis Angel Ruiz, José Luiz Lerma, and Josep Gimeno. Application [30] Alec Radford. Unsupervised representation learning with deep convolu-
of computer vision techniques to support in the restoration of histor- tional generative adversarial networks. arXiv preprint arXiv:1511.06434,
ical buildings. INTERNATIONAL ARCHIVES OF PHOTOGRAMME- 2015.
TRY REMOTE SENSING AND SPATIAL INFORMATION SCIENCES, [31] Ronit Sawant, Asadullah Shaikh, Sunil Sabat, and Varsha Bhole. Text
34(3/B):227–230, 2002. to image generation using gan. In Proceedings of the international con-
[11] Michael Piotrowski. Natural language processing for historical texts, ference on IoT based control networks & intelligent systems-ICICNIS,
volume 17. Morgan & Claypool Publishers, 2012. 2021.
[12] Valeriia BOIKO and Larysa KORNYTSKA. Using ai to create summa- [32] Abhishek Pawar, Riya Hiwanj, Ashwini Jadhav, Asrar Shaikh, and
rized virtual art tours. I I , page 70, 2024. Sharmila Kharat. Text-to-face generation using dcgan with bert-
[13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. embedding vectors. In International Conference on Multi-Strategy
nature, 521(7553):436–444, 2015. Learning Environment, pages 383–398. Springer, 2024.
[14] Nancy A Angel, Dakshanamoorthy Ravindran, PM Durai Raj Vincent, [33] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang,
Kathiravan Srinivasan, and Yuh-Chung Hu. Recent advances in evolving Xiaolei Huang, and Dimitris N Metaxas. Stackgan++: Realistic image
computing paradigms: Cloud, edge, and fog technologies. Sensors, synthesis with stacked generative adversarial networks. IEEE transac-
22(1):196, 2021. tions on pattern analysis and machine intelligence, 41(8):1947–1962,
[15] Molla Hafizur Rahman, Charles Xie, and Zhenghui Sha. A deep learning 2018.
based approach to predict sequential design decisions. In International [34] Hao Tang, Hong Liu, Dan Xu, Philip HS Torr, and Nicu Sebe. At-
Design Engineering Technical Conferences and Computers and Infor- tentiongan: Unpaired image-to-image translation using attention-guided
mation in Engineering Conference, volume 59179, page V001T02A029. generative adversarial networks. IEEE transactions on neural networks
American Society of Mechanical Engineers, 2019. and learning systems, 34(4):1972–1987, 2021.
[16] Niannian Wang, Xuefeng Zhao, Linan Wang, and Zheng Zou. Novel [35] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila.
system for rapid investigation and damage detection in cultural heritage Stylegan-t: Unlocking the power of gans for fast large-scale text-to-
conservation based on deep learning. Journal of Infrastructure Systems, image synthesis. In International conference on machine learning, pages
25(3):04019020, 2019. 30105–30118. PMLR, 2023.
[17] Orhan Can Yilmazdogan. Elevating workforce stability: augmented [36] Keisuke Shinohara, Dean C Regan, Yan Tang, Andrea L Corrion,
reality and artificial intelligence solutions for overcoming employee David F Brown, Joel C Wong, John F Robinson, Helen H Fung, Adele
turnover challenges in hospitality and tourism. Worldwide Hospitality Schmitz, Thomas C Oh, et al. Scaling of gan hemts and schottky
and Tourism Themes, 2024. diodes for submillimeter-wave mmic applications. IEEE Transactions
[18] Amon Rapp, Chiara Di Lodovico, Federico Torrielli, and Luigi Di Caro. on Electron Devices, 60(10):2982–2996, 2013.
How do people experience the images created by generative artificial
intelligence? an exploration of people’s perceptions, appraisals, and
emotions related to a gen-ai text-to-image model and its creations.
International Journal of Human-Computer Studies, page 103375, 2024.

You might also like