Project Report Format 2023
Project Report Format 2023
OBJECTIVE
A Project Report is a documentation of a Graduate student’s project work—a record of the
original work done by the student. It provides information on the student’s research work to the
future researchers. The Dept. is committed to preserve a proper copy of the student’s report for
archiving and cataloging it in the Departmental Library, making it available to others for
academic purpose.
Standardization, readability, conformance to ethical norms, and durability are the four overriding
criteria for an acceptable form of a report.
The objective of this document is to provide a set of guidelines that help a research student to
prepare the report to satisfy the above-mentioned criteria.
PRODUCTION
Report Size
1. The minimum number of pages of the Report should be 50 pages.
Paper Size
2. The standard size of paper of a Report is 21.5 cm (8½ inch) wide and 28 cm (11 inches) long.
3. Oversized figures and tables, if any, should be reduced to fit with the size of the report but the
reduction should not be so drastic as to impair clarity of their contents. One may also fold these
pages to fit with the report size.
Single-Sided printing
4. It is suggested that the report be printed on one side of the paper. Double-sided printing can
be done only if the paper is opaque enough not to impair readability on the other side in normal
lighting conditions.
Non-Paper Material
5. Digital or magnetic materials, such as CDs and DVDs, may be included in the report. They
have to be given in a closed pocket in the inside of the back cover page of the report. It should
be borne in mind that their formats may become obsolete due to rapid change in technology,
making it impossible for the Library to guarantee their preservation and use.
6. All non-paper materials, as above, must have a label each indicating the name of the student,
the date of submission, and the copyright notice.
Page Numbering
7. Page numbers for the prefacing materials of the report shall be in small Roman numerals and
should be centered at the bottom of the pages.
8. Page numbers for the body of the report should be in Arabic numerals and should be centered
at the bottom of the pages. The pagination should start with the first page of Chapter 1 and
should continue throughout the text (including tables, figures, and appendices)
Order of Report
1. Front Page ( Refer the sample page)
2. Certificate ( Refer the sample page)
3. Acknowledgement
4. Abstract
5. Table of Contents
6. List of Table
7. List of Figures
8. List of Abbreviations
9. Chapters ( All the chapters starting from introduction to conclusion and future work)
10. Appendix
11. References
Acknowledgements
It should be limited, preferably, to one page.
Contents
Chapter numbers, chapter names, section numbers, section headings, subsection numbers, and
subsection headings, along with the corresponding page numbers, should be given in the
contents.
See Sample Page
List of Symbols
All the symbols used in the report are to be given here along with their explanations and units of
measurement (if applicable).
Abstract
1. The abstract of the report should be limited to 200-300 words in 2 line spacing
2. A list of keywords should follow the abstract.
BODY OF THE REPORT
1. The report should be written in either British or American English, not a mixed mode.
However, because of increasing acceptance of both styles and blurring of the distinction between
the two, what is important is that consistency should be maintained all throughout the text.
2. The chapters should have numbers in Arabic numerals and should be written as 1.
INTRODUCTION, 2. LITERATURE REVIEW, etc. This should be followed by the title of the
chapter (e.g., Introduction, etc.). The font size should be 14-point, bold for the titles and
centered.
CERTIFICATE
This is to certify that the thesis entitled TITLE submitted by Names (Regd. Nos) has been
carried out in partial fulfilment of the requirement for the award of degree of Bachelor of
JNTUGV, Vizianagaram is a record of bonafide work carried out by them under my guidance
& supervision. The results embodied in this report have not been submitted to any other
ACKNOWLEDGEMENT
We would like to sincerely thank our Head of the department Dr. A. V. Ramana, for
providing all the necessary facilities that led to the successful completion of our project work.
We would like to take this opportunity to thank our beloved Principal Dr. C. L. V. R. S.
V. Prasad, for providing all the necessary facilities and a great support to us in completing the
project work.
We would like to thank all the faculty members and the non-teaching staff of the
Department of Electronics and Communication Engineering for their direct or indirect support
for helping us in completion of this project work.
Finally, we would like to thank all of our friends and family members for their
continuous help and encouragement.
ABSTRACT
The current trends in wireless industry are based on multi-carrier transmission technique such as
higher data rates and better immunity to frequency selective fading. Wireless standards like
IEEE 802.11a/g/n, IEEE 802.16e and many others use one or other variation of OFDM, such as
OFDMA and MIMO-OFDM. However OFDM is handicapped with a major problem of high
peak-to-average power ratio which is a trait in-built to any multi-carrier transmission system.
High PAPR causes non-linear distortion in the signal and hence results in inter-carrier
interference and out-of-band radiation. To combat the effect of high PAPR, several PAPR
reduction techniques have been devised over the last few decades. All these techniques have
Keywords:
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENT (Bold and caps) iii
ABSTRACT (Bold and caps) iv
LIST OF TABLES (Bold and caps) v
LIST OF FIGURES (Bold and caps) vi
LIST OF SYMBOLS & ABBREVIATIONS (Bold and caps) vii
1. NTRODUCTION (Bold and caps) 1
The section should cover complete literature related to the proposed work and
organization of your paper.
3. EXPERIMENTAL STUDY (If applicable) (Bold and caps)
4. RESULTS AND DISCUSSIONS (Bold and caps)
5. CONCLUSIONS AND FUTURE SCOPE (Bold and caps)
APPENDIX 1
APPENDIX 2
REFERENCES 60
LIST OF PUBLICATIONS (in the reference format)
Add the Publication paper (Conference paper/Conference certificate/Journal paper or both)
LIST OF TABLES
LIST OF FIGURES
Introduction:
Diffusion models have revolutionized image generation, enabling the production of high-quality,
photorealistic visuals from textual descriptions. These models have demonstrated remarkable
success in text-to-image synthesis, where users provide prompts to generate diverse and realistic
images. However, when applied to image editing, these models face significant challenges due to
the inherent ambiguity of text prompts, which often fail to specify precise spatial modifications.
As a result, text-guided editing struggles with localizing changes, maintaining structural
coherence, and ensuring high-fidelity transformations, particularly in cases that demand fine-
grained adjustments.
To overcome these limitations, researchers have explored interactive image editing, which
allows users to modify images using intuitive visual inputs such as sketches, clicks, and drags.
This approach provides greater spatial control compared to text-based methods, enabling users to
directly indicate which regions of an image should be changed and how. Despite these
advantages, existing interactive editing techniques are still constrained by the fundamental
limitations of image-to-image generation models, which rely on diffusion-based text-to-image
pipelines. These methods typically require vast training datasets, employ additional reference
encoders to enforce consistency, and suffer from computational inefficiencies. Furthermore,
maintaining semantic coherence between the original and edited images is challenging,
especially when dealing with complex modifications such as object deformations, appearance
changes, and structural transformations.
In this work, we introduce a novel image editing framework that redefines interactive image
editing as an image-to-video generation problem. Our key insight is that image editing can be
framed as a temporal transition—where the source image represents the first frame and the edited
image acts as the second frame of a short video. By leveraging video diffusion priors, our
method enhances realism, preserves spatial consistency, and significantly reduces training costs.
Unlike traditional image-editing models that require large-scale paired datasets, our approach
benefits from inherent motion priors present in video data, making it more data-efficient while
ensuring high-quality transformations.
Built upon Stable Video Diffusion, our approach integrates a lightweight sparse control encoder,
which injects user-provided editing signals into the diffusion process while preserving key
structural features of the original image. To further enhance consistency and realism, we
introduce a novel matching attention mechanism, which establishes dense correspondences
between the source and edited images. This component mitigates artifacts, aligns key object
regions, and improves spatial consistency, addressing the limitations of traditional temporal
attention in handling large-scale deformations. By combining spatial, temporal, and cross-
attention mechanisms, it ensures high-fidelity edits while maintaining the natural structure and
texture of the original image.
Through extensive experimentation, we demonstrate that it achieves state-of-the-art performance
across a wide range of interactive editing tasks, including shape deformations, object
modifications, and fine-grained detail adjustments. Additionally, our method exhibits remarkable
generalization capabilities, handling out-of-domain edits such as transforming a clownfish into a
shark-like shape, modifying reflections, and generating complex structural changes with minimal
supervision. These results highlight the effectiveness, flexibility, and efficiency of our approach,
establishing a new paradigm for interactive image editing that leverages the power of video
diffusion models to redefine user-controlled image manipulation.
By framing image editing as a temporal progression rather than an isolated transformation, our
work paves the way for more natural, coherent, and efficient editing techniques, offering a
scalable solution for high-quality, realistic image modifications with minimal data requirements.
Literature Survey:
1. Sun, J., Wang, X., Shi, Y., Wang, L., Wang, J., & Liu, Y. (2022). Ide-3d: Interactive disentangled editing
for high-resolution 3d-aware portrait synthesis. ACM Transactions on Graphics (ToG), 41(6), 1-10.
The paper addresses the challenge faced by existing 3D-aware facial generation methods, which often
compromise between quality and editability. The proposed approach aims to provide both high-
resolution outputs and flexible editing capabilities.
The IDE-3D method primarily utilizes the FFHQ (Flickr-Faces-HQ) dataset for training and
evaluation. This dataset is known for its high-quality images of human faces, which helps in
achieving photorealistic results in 3D-aware portrait synthesis. The paper mentions that models are
trained with FFHQ at a resolution of 512x512, except for FENeRF, which is trained at a lower
resolution of 128x128.
The method relies on a single image to reconstruct a 3D facial volume, which is inherently an ill-
posed problem. This can lead to implausible facial geometry in some cases, indicating a limitation
in the method's ability to accurately capture complex facial structures from limited data.
2. Cheng, Y., Gan, Z., Li, Y., Liu, J., & Gao, J. (2020, October). Sequential attention GAN for interactive
image editing. In Proceedings of the 28th ACM international conference on multimedia (pp. 4383-4391).
The paper introduces Interactive Image Editing using SeqAttnGAN, enabling users to modify images
through multi-turn commands while maintaining contextual consistency and image quality.
The paper introduces two new datasets for the interactive image editing task: Zap-Seq, which contains
8,734 image sequences derived from 50,025 shoe images, and DeepFashion-Seq, which includes
4,820 sequences from around 290,000 clothing images, both paired with natural language
descriptions of the differences between consecutive images.
The model's potential difficulty in handling complex modifications that require a deeper
understanding of context and semantics, as well as the need for further exploration of how well
SeqAttnGAN generalizes to diverse image editing tasks beyond the fashion domain.
3. Jiang, Y., Huang, Z., Pan, X., Loy, C. C., & Liu, Z. (2021). Talk-to-edit: Fine-grained facial editing via
dialog. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13799-13808).
The objective of this paper is to present "Talk-to-Edit," an interactive facial editing system that
enables users to modify facial attributes in images through natural language requests while
preserving identity and enhancing editing realism.
The dataset used for training MagicQuill consists of 24,315 images categorized under 4,412 different
labels, specifically selected for their aesthetic scores above 6.5, ensuring a broad spectrum of data
for effective model training.
Its reliance on user input, which may lead to challenges in accurately interpreting ambiguous or vague
language requests, potentially resulting in unsatisfactory editing outcomes.
4. Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Cheng, K. L., Wang, W., ... & Shen, Y. (2024). Magicquill: An
intelligent interactive image editing system. arXiv preprint arXiv:2411.09703.
This paper is to present and evaluate MagicQuill, an advanced image editing system that utilizes
diffusion models and AI to enhance user experience and precision in fine-grained image
manipulation.
The dataset used in the "Talk-to-Edit" system is a large-scale visual-language facial attribute dataset
named CelebA-Dialog, which is designed to support fine-grained and language-driven facial
editing tasks.
Its current limitations in expanding editing capabilities, such as the lack of reference-based editing
and insufficient support for typography manipulation within images.
5. Ling, H., Kreis, K., Li, D., Kim, S. W., Torralba, A., & Fidler, S. (2021). Editgan: High-precision semantic
image editing. Advances in Neural Information Processing Systems, 34, 16331-16345.
The primary objective of the paper is to propose EditGAN, a novel GAN-based image editing
framework that enables high-precision semantic image editing. It allows users to modify detailed
object part segmentations with minimal labeled examples, making it scalable for various object
classes and part labels
EditGAN requires only a handful of labeled examples for training, making it a scalable tool for high-
quality, high-precision semantic image editing. It builds on a GAN framework that jointly models
images and their semantic segmentations, allowing users to edit images by modifying detailed part
segmentation masks.
EditGAN still faces challenges with certain complex edits that require more extensive optimization,
indicating a gap in efficiency for specific use cases.
6. Brooks, T., Holynski, A., & Efros, A. A. (2023). Instructpix2pix: Learning to follow image editing
instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 18392-18402).
The primary objective of the paper is to develop a model that can perform image edits based on
human-written instructions without requiring full descriptions of the input or output images. The
model aims to generate edited images directly in the forward pass, enhancing the efficiency of
image editing tasks.
The dataset used in the study consists of over 450,000 training examples generated through a two-part
method: first, using a finetuned GPT-3 to create instructions and edited captions, and second,
employing StableDiffusion in combination with Prompt-to-Prompt to generate pairs of images from
those captions.
The paper discusses the potential for incorporating human feedback, such as reinforcement learning,
to improve alignment between the model's outputs and human intentions, indicating a gap in
current capabilities.
7. Ceylan, D., Huang, C. H. P., & Mitra, N. J. (2023). Pix2video: Video editing using image diffusion.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 23206-23217).
The primary objective of the paper "Pix2Video, Video Editing using Image Diffusion" is to explore
the feasibility of editing video clips using a pre-trained image diffusion model guided by text
instructions, without requiring additional training.
The dataset used for evaluation in the study is obtained from the DAVIS dataset, which is commonly
referenced in video object segmentation tasks.
The paper acknowledges that there is still room for improvement in terms of temporal coherency and
suggests exploring additional energy terms, such as patch-based similarity and CLIP similarity,
during the latent update stage.
8. Wang, S., Saharia, C., Montgomery, C., Pont-Tuset, J., Noy, S., Pellegrini, S., ... & Chan, W. (2023).
Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In Proceedings of
the IEEE/CVF conference on computer vision and pattern recognition (pp. 18359-18369).
The primary objective of the paper is to present the Imagen Editor, a model designed for text-guided
image inpainting, which allows users to make localized edits to images based on user-defined
masks and text prompts.
The dataset proposed for evaluating text-guided image inpainting is called EditBench. It consists of
three components for each example: a masked input image, an input text prompt, and a high-quality
output image for reference. EditBench captures a wide variety of language, types of images, and
levels of difficulty, with prompts categorized along attributes, objects, and scenes.
The paper identifies gaps in the model's performance with abstract attributes and complex prompts,
suggesting that future work should focus on improving these areas.
9. Alaluf, Y., Tov, O., Mokady, R., Gal, R., & Bermano, A. (2022). Hyperstyle: Stylegan inversion with
hypernetworks for real image editing. In Proceedings of the IEEE/CVF conference on computer Vision
and pattern recognition (pp. 18511-18521).
The primary objective of the paper is to present HyperStyle, a method for image inversion that
achieves high-quality reconstructions and editability in latent space while being computationally
efficient compared to traditional optimization techniques.
The datasets used in the experiments include FFHQ for training and the CelebA-HQ test set for
quantitative evaluations in the human facial domain. For the cars domain, the Stanford Cars dataset
is utilize.
The paper notes challenges in comparing editability across different inversion methods due to varying
editing strengths, which could introduce bias.
Further research is needed to enhance robustness to diverse input conditions, particularly for images
outside the training domain.
10. Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J. Y., & Ermon, S. (2021). Sdedit: Guided image
synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
The primary objective of the paper is to introduce SDEdit, a framework for guided image synthesis
and editing that balances realism and faithfulness to user inputs, enabling the generation of photo-
realistic images from various levels of detail without the need for extensive data collection or
model retraining.
The dataset used in the experiments includes ImageNet, LSUN (cat and horse), CelebA-HQ, and
FFHQ. These datasets are utilized for stroke-based image synthesis and image compositing tasks
with SDEdit .
The paper primarily focuses on specific datasets (e.g., LSUN and CelebA-HQ) and may not fully
address the performance of SDEdit across a broader range of image types and editing tasks.
While SDEdit shows significant improvements, further exploration is needed to understand its
limitations in real-world applications and its adaptability to various user inputs and styles.
11. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Prompt-to-
prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
The primary goal of the paper is to develop a prompt-to-prompt image editing framework that allows
users to modify images using only textual prompts, without the need for spatial masks. This aims to
preserve the original image's structure and content while enabling intuitive editing.
The authors acknowledge that the challenge of inversion for text-guided diffusion models is an area
for future research, indicating a gap in the current understanding and implementationThere is a
suggestion to incorporate cross-attention in higher-resolution layers to improve localized editing,
which remains unaddressed in the current work.
12. Chen, X., Zhao, Z., Zhang, Y., Duan, M., Qi, D., & Zhao, H. (2022). Focalclick: Towards practical
interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 1300-1309).
The primary goal of FocalClick is to develop a practical interactive image segmentation method that
efficiently produces fine masks with quick responses, particularly on low-power devices. It
addresses the gap between academic approaches and industrial needs by improving both efficiency
and accuracy in mask annotation.
The dataset used in the study is based on the DAVIS dataset, specifically a new benchmark called
DAVIS-585. This dataset was created by annotating each object or accessory separately and
filtering out masks under 300 pixels, resulting in 585 test samples. The authors also simulated
defects on ground truth masks using super-pixels to generate flawed initial masks for their
experiments.
Although FocalClick improves efficiency, there remains a challenge in maintaining accuracy,
especially when reducing input sizes for faster processing.
Need for Larger Datasets: The performance gap compared to SOTA methods highlights the need for
larger and m
13. Hyunsu Kim, Yunjey Choi, Junho Kim, Sungjoo Yoo, Youngjung Uh; Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 852-861
The primary objective of the paper titled "Exploiting Spatial Dimensions of Latent in GAN for
Real-time Image Editing" is to enhance the capabilities of Generative Adversarial Networks
(GANs) for real-time image editing. The authors introduce a novel approach called
StyleMapGAN, which aims to address several limitations associated with traditional GANs.
The Paper presents several notable advantages that enhance the capabilities of image editing
using GANs Real-time Image Editing: Improved Fidelity and Accuracy High-Quality Output.
The paper has gaps like low fidelity in encoder projections, limited exploration of spatial
dimensions, and narrow performanc comparison. It also lacks real-world adaptability insights
and a clear roadmap for integrating its methods into other architectures.
14. Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2LIVE:
Textdriven layered image and video editing. arXiv preprint, arXiv:2204.02491, 2022.
The main goal of DIFFEDIT is to enable semantic image editing by automatically identifying
regions of an image that need to be edited based on a text query, enhancing the editing process
without requiring user-generated masks
DIFFEDIT leverages a diffusion model to produce more natural and subtle edits by integrating
the edited regions into the background effectively, outperforming previous methods
The paper identifies gap in existing methods require user-generated masks, DIFFEDIT addresses
automatically generating masks, but it still faces challenges in ensuring the text query aligns
well with the image content.
15. . Omri Avrahami, Dani Lischinski, Ohad Fried; Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18208-18218
The paper aims to introduce a novel solution for performing local edits in natural images
using natural language descriptions and region-of-interest (ROI) masks. This is achieved by
combining a (CLIP) with (DDPM) to generate realistic image edits based on user prompts
The paper Gives an advantages like Intuitive Interface means it gives highly intuitive interface
for users, making it easier to specify desired changes .
High Realism: The method outperforms previous solutions in terms of overall realism
Improving the ranking system to consider the entire image context could enhance results. Future
research could extend the method to 3D or videos and train CLIP to be noise-agnostic for
better robustness.
16. Jacob Austin, Daniel Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured
denoising diffusion models in discrete state-spaces. In NeurIPS, volume 34, 2021.
The paper aims to develop a framework for semantic image synthesis that generates
photorealistic images from semantic layouts, addressing the limitations of GAN-based
methods in handling complex scenes.
The framework outperforms previous methods in generating high-fidelity, diverse images, state-
of-the-art results benchmark datasets. improves image quality and balances the trade-off
between quality and diversity.
The paper highlights a gap in GAN-based methods' inability to generate high-fidelity and diverse
results for complex scenes, which the proposed framework addresses by using diffusion
models instead of adversarial learning.
ore diverse training datasets to fully leverage FocalClick's capabilities.
17. Nguyen, T., Ojha, U., Li, Y., Liu, H., & Lee, Y. J. (2024). Edit One for All: Interactive Batch Image
Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 8271-8280).
The objective of the research is to develop a method for interactive batch image editing that
allows users to apply a specified edit from one image to a large batch of images while
maintaining a consistent final state across all edited images.
The method for interactive batch image editing minimizes human supervision in the editing
process, allowing for significant time.
The research identifies a gap in the existing methods that primarily focus on single image
editing, leaving batch image editing underexplored.
18. . Liang, Y., Gan, Y., Chen, M., Gutierrez, D., & Muñoz, A. (2019, October). Generic interactive
pixel‐level image editing. In Computer Graphics Forum (Vol. 38, No. 7, pp. 23-34).
The objective of the research is to develop a generic interactive pixel-level image editing
paradigm that generates continuous additional per-pixel values from user inputs, specifically
RGB color values and user scribbles.
The paradigm allows for interactive refinement of image edits, enhancing user control and
satisfaction
It produces results that are on-par with state-of-the-art methods across various applications such
as depth of field blurring and dehazing.
The research paper identifies gaps in previous superpixel-based image editing methods,
particularly regarding the propagation of additional values, which often results in artificial
discontinuities at superpixel boundaries.
19. Shi, Y., Xue, C., Liew, J. H., Pan, J., Yan, H., Zhang, W., ... & Bai, S. (2024). Dragdiffusion:
Harnessing diffusion models for interactive point-based image editing. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8839-8849).
The objective of the research is to introduce DRAGDIFFUSION, a novel method that extends
interactive point-based editing to large-scale diffusion models. This method aims to enhance
the applicability of interactive editing by allowing users to manipulate images at a fine-grained
level through point-based interactions.
DRAGDIFFUSION enhances the applicability of interactive point-based editing by utilizing
large-scale diffusion models, which improves generalizability compared to previous methods
reliant on GANs
20. Shin, J., Choi, D., & Park, J. (2024, December). InstantDrag: Improving Interactivity in Drag-based
Image Editing. In SIGGRAPH Asia 2024 Conference Papers (pp. 1-10).
The objective of the research paper is to improve interactivity in drag-based image editing
through the development of a dedicated pipeline called InstantDrag.
Instant Drag excels at preserving consistency, particularly high-frequency features, even without
the use of a mask.
The method generates plausible images with realistic motions, enhancing the quality of the edits.
The model occasionally struggles with preserving identity or creating accurate motions for non-
facial scenes without fine-tuning, as it has been primarily trained on facial videos. This
indicates a gap in generalizability across diverse motion types and scenes.
21. . Shinagawa, S., Yoshino, K., Alavi, S. H., Georgila, K., Traum, D., Sakti, S., & Nakamura, S. (2020). An
Interactive Image Editing System Using an Uncertainty-Based Confirmation Strategy. IEEE Access, 8,
98471-98480.
The objective of this paper is to develop an interactive image editing framework using a modified
Deep Convolutional Generative Adversarial Network (DCGAN) with a Source Image Masking
module and an entropy-based confirmation strategy to enhance user control, dialogue efficiency,
and image quality in response to natural language editing requests.
The dataset used in the paper is the Avatar Image Manipulation with an Instruction dataset, which
consists of 22 types of editing tasks, such as changing a beard, eyebrows, and hair. The dataset is
structured as triplets of {source image, target image, instruction (editing request)} and was split
into training, validation, and test sets in the ratio of 4:1:1, totaling 230 samples for validation and
testing each.
It may struggle with ambiguous natural language requests, which can hinder the image editing
process and limit the effectiveness of the masking mechanism, potentially restricting the range
of changes that can be made to the images
22. Lin, J., Zhang, R., Ganz, F., Han, S., & Zhu, J. Y. (2021). Anycost gans for interactive image synthesis and
editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14986-
14996).
The primary objective of the paper is to propose "Anycost" GANs, which are designed for interactive
image synthesis and editing. The goal is to create a generator that can operate at various computational
costs while maintaining visually consistent outputs. This allows for quick previews during editing and
high-quality final outputs when needed.
The paper does not provide quantitative results for latent space-based editing, which could limit the
understanding of the model's performance in practical applications.
Spatially-Varying Trade-offs: There is a gap in the model's ability to support spatially-varying trade-
offs between fidelity and latency, which could enhance its adaptability to different editing scenarios.
23.Cui, X., Li, Z., Li, P., Hu, Y., Shi, H., & He, Z. (2023). Chatedit: Towards multi-turn interactive facial
image editing via dialogue. arXiv preprint arXiv:2303.11108.
The primary objective is to develop a multi-turn interactive facial image editing system via
dialogue. It introduces the CHATEDIT benchmark dataset, which facilitates research in this field.
The C HAT E DIT dataset comprises 12,000 examples, each consisting of a facial image, a
corresponding caption, and an annotated multi-turn dialogue. The dataset is divided into training
(10,000 examples), validation (1,000 examples), and testing sets (1,000 examples). It includes
approximately 96,174 utterances across these dialogues, with an average of 4 turns per dialogue and
8 utterances per dialogue. The dataset is constructed using images from the CelebA-HQ dataset,
focusing on 21 editable facial attributes.
24. Ivan Anokhin, Kirill V. Demochkin, Taras Khakhulin, Gleb Sterkin, Victor S. Lempitsky, and Denis
Korzhenkov. Image generators with conditionally-independent pixel synthesis. 2021 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages 14273–14282, 2020
The paper presents INVE (Interactive Neural Video Editing), a real-time video editing solution that
propagates edits made on a single frame across the entire video, improving speed and editability
compared to previous methods like the Layered Neural Atlas (LNA)
INVE speeds up training and inference, being 5 times faster than existing methods, and allows
users to make edits on one frame that automatically apply to the entire video. It simplifies video
editing for novices by reducing the need for frame-by-frame adjustments
The paper identifies gaps in INVE's support for certain editing use cases, like direct frame editing
and rigid texture tracking, which LNA also struggled with. It also notes that despite improved
speed, the mapping process may still limit fully intuitive editing
25. Mirabet-Herranz, N. (2024). Advancing Beyond People Recognition in Facial Image Processing (Doctoral
dissertation, Sorbonne Université).
The objective of the research is to develop a method for generating associative skeleton guidance maps
that facilitate human-object interaction in image editing. This involves creating an object-interactive
skeleton that can be synthesized naturally with a human figure interacting with an object.
The framework demonstrates superior performance in image editing tasks compared to existing
models, as indicated by quantitative results on metrics such as FID, KID, and CS.
The research identifies a critical limitation in existing image generation models, particularly their
inability to autonomously generate additional condition maps necessary for accurately rendering
human figures, which often requires manual input of supplementary details.
Table:
S. Title Year Objective Limitations Advantages Performa Gaps
No nce
Metrics
1. Ide-3d: 2022 The paper addresses The limitations It allows - The method relies on
Interactive the challenge faced include users to a single image to
disentangled by existing 3D- potential perform reconstruct a 3D facial
editing for aware facial distortions in interactive volume, which is
high- generation methods, facial shapes, global and inherently an ill-posed
resolution 3d- which often challenges in local editing problem. This can
aware portrait compromise between maintaining on facial lead to implausible
synthesis quality and consistency features facial geometry in
editability. The across while some cases, indicating
proposed approach different poses,maintaining a limitation in the
aims to provide both and the view method's ability to
high-resolution requirement consistency, accurately capture
outputs and flexible for standard enabling complex facial
editing capabilities front-facing adjustments structures from
images for to elements limited data
accurate like glasses,
editing. hair, and
expressions.
2 Sequential 2020 The paper introduces While It Zaq-Seq The model's potential
Attention Interactive Image SeqAttnGAN effectively 1.IS:9.58 difficulty in handling
GAN for Editing using performs well, enables 2.FID:50 complex
Interactive SeqAttnGAN, it may still interactive .31 modifications that
Image enabling users to struggle with image 3.SSIM: require a deeper
Editing modify images complex editing 0.651 understanding of
through multi-turn modifications through context and semantics,
commands while that require a multi-turn as well as the need for
maintaining deeper commands, further exploration of
contextual understanding ensuring how well
consistency and of context and high visual SeqAttnGAN
image quality semantics quality and generalizes to diverse
beyond the contextual image editing tasks
provided consistency beyond the fashion
textual in generated domain
commands images
3. Talk-to-Edit: 2021 The objective of this The paper does Its ability to 1.Bangs: Its reliance on user
Fine-Grained paper is to present not address the facilitate Is:06047 input, which may lead
Facial "Talk-to-Edit," an potential fine-grained As:0.366 to challenges in
Editing via interactive facial limitations in facial 0 accurately interpreting
Dialog editing system that handling attribute 2.Eyegla ambiguous or vague
enables users to highly manipulatio sses: language requests,
modify facial complex or n through Is:0.6229 potentially resulting in
attributes in images ambiguous natural As:0.772 unsatisfactory editing
through natural user requests, language 0 outcomes
language requests which may interactions, 3.Beard:
while preserving affect the allowing for Is:0.8324
identity and system's ability
a more As:0.689
enhancing editing to provide intuitive 1
realism. satisfactory and flexible 4.Smilin
edits in all editing g
scenarios. experience Is:0.6434
compared to As:0.502
traditional 8
methods
that require
fixed
control
patterns
4. MagicQuill: 2024 This paper is to MagicQuill Its ability to Over all Its current limitations
An Intelligent present and evaluate significantly enhance user in expanding editing
Interactive MagicQuill, an improves user satisfacti capabilities, such as
Image advanced image image editing experience on: 80% the lack of reference-
Editing editing system that efficiency and and based editing and
System utilizes diffusion precision, it precision in insufficient support
models and AI to still faces image for typography
enhance user limitations in editing manipulation within
experience and expanding through an images
precision in fine- editing intuitive
grained image capabilities, interface
manipulation. such as and real-
incorporating time
reference- prediction
based editing of user
and enhancing intentions
typography using a
support for multimodal
textual large
elements language
model.
5. EditGAN: 2021 The primary EditGAN, like EditGAN Measure Despite its
High- objective of the other GAN- requires d using a advantages, EditGAN
Precision paper is to propose based methods, significantly pretraine still faces challenges
Semantic EditGAN, a novel is limited to less d with certain complex
Image GAN-based image images that annotated ArcFace edits that require more
Editing. editing framework can be training feature extensive
that enables high- effectively data extractio optimization,
precision semantic modeled by the compared to n indicating a gap in
image editing. It GAN. This other network efficiency for specific
allows users to poses methods, to ensure use cases
modify detailed challenges needing as the
object part when applying few as 16 subject's
segmentations with it to complex labeled identity
minimal labeled scenes, such as examples. remains
examples, making it vivid intact
scalable for various cityscapes. after
object classes and editing.
part labels
6. InstructPix2P 2023 The primary The The model The The paper discusses
ix: Learning objective of the performance of allows for metrics the potential for
to Follow paper is to develop a the model is intuitive used in incorporating human
Image model that can limited by the image the paper feedback, such as
Editing perform image edits visual quality editing by to assess reinforcement
Instructions based on human- of the following the learning, to improve
written instructions generated diverse model alignment between the
without requiring dataset and the human include, model's outputs and
full descriptions of diffusion instructions, the human intentions,
the input or output model used, such as degree to indicating a gap in
images. The model which in this changing which current capabilities.
aims to generate case is Stable styles, the
edited images Diffusion replacing altered
directly in the objects, or image
forward pass, altering matches
enhancing the settings the
efficiency of image original
editing tasks image is
known as
image
consisten
cy
7. Pix2Video: 2023 The primary The key The The The paper
Video objective of the limitations approach is paper acknowledges that
Editing using paper "Pix2Video, identified in training- evaluates there is still room for
Image Video Editing using the paper is the free, the improvement in terms
Diffusion Image Diffusion" is challenge of allowing for performa of temporal coherency
to explore the maintaining generalizati nce of and suggests
feasibility of editing temporal on to a wide the exploring additional
video clips using a coherency, range of method energy terms, such as
pre-trained image especially edits using patch-based similarity
diffusion model when the without the metrics and CLIP similarity,
guided by text distance from need for such as during the latent
instructions, without the anchor video- CLIP update stage
requiring additional frame specific similarit
training. increases in fine-tuning y
longer videos, or extensive between
which can lead pre- image
to quality processing embeddi
degradation ng of
consecuti
ve
frames
(CLIP-
Image)
and the
mean-
squared
pixel
error
between
warped
frames
8. Imagen 2023 The primary The The Imagen This The paper identifies
Editor and objective of the performance of Editor is ranking- gaps in the model's
EditBench: paper is to present the Imagen preferred by based performance with
Advancing the Imagen Editor, a Editor drops human approach abstract attributes and
and model designed for significantly annotators measures complex prompts,
Evaluating text-guided image with complex over other how well suggesting that future
Text-Guided inpainting, which prompts models like the work should focus on
Image allows users to make (Mask-Rich), SD and generate improving these areas.
Inpainting localized edits to indicating that DL2, with d image
images based on while it excels preference retrieves
user-defined masks in simpler rates of the text
and text prompts. scenarios. 78% and prompt
77% . among
distractor
s
9. HyperStyle: 2022 The primary While HyperStyle Measure The paper notes
StyleGAN objective of the HyperStyle operates d using challenges in
Inversion paper is to present generalizes nearly 200 the comparing
with HyperStyle, a well, there is times faster Curricula editability
HyperNetwor method for image still a need for than rFace across different
ks for Real inversion that improvement StyleGAN2 method, inversion
Image achieves high- in handling optimizatio assessing methods due to
Editing quality unaligned n, making it how well varying editing
reconstructions and images and practical for the strengths, which
editability in latent unstructured real-time original could introduce
space while being domains. applications identity bias.
computationally Although it .It provides is Further research
efficient compared performs well, highly- preserve is needed to
to traditional there may still editable d during enhance
optimization be cases where latent codes, edits. robustness to
techniques. identity allowing for Evaluate diverse input
preservation is meaningful d conditions,
not perfect modificatio through particularly for
compared to ns while trait- images outside
some preserving specific the training
optimization identity. classifier domain
methods. HyperStyle s to
achieves determin
visually e the
comparable extent of
results to modifica
optimizatio tions
n supporte
techniques d by the
but with latent
significantly codes [8]
lower .
computation Compare
al costs d
qualitativ
ely and
quantitati
vely
against
state-of-
the-art
methods
using
metrics
like
LPIPS
and L2
loss
11 Prompt-to- 2022 The primary goal of The current The The paper The authors
Prompt the paper is to inversion framework evaluates the acknowledg
Image develop a prompt-to- process can allows for performance e that the
Editing with prompt image lead to visible text-driven based on the challenge of
Cross editing framework distortions in editing, fidelity of inversion for
Attention that allows users to some test making it the generated text-guided
Control modify images using images, which easier for images to the diffusion
only textual prompts, affects the users to original models is an
without the need for quality of the express prompts and area for
spatial masks. This output. their intent the ability to future
aims to preserve the The attention without preserve the research,
original image's maps used in needing to original indicating a
structure and content the model are provide composition gap in the
while enabling of low detailed while current
intuitive editing. resolution, masks. making edits. understandin
limiting the The method The g and
precision of retains the effectiveness implementat
localized spatial of the ionThere is
editing layout and attention a suggestion
semantics injection to
of the across incorporate
original different cross-
image when diffusion attention in
making steps is also higher-
edits, which a key metric. resolution
is a layers to
significant improve
improveme localized
nt over editing,
traditional which
methods remains
that often unaddressed
lead to in the
complete current work
alterations
13. Exploiting 2021 The primary The paper The Paper The paper The paper
Spatial objective of the presents presents evaluates has gaps
Dimensions paper titled significant several StyleMapGA like low
of Latent in "Exploiting Spatial advancements notable N metrics fidelity in
GAN for Dimensions of in real-time advantages like FID encoder
Real-time Latent in GAN for image editing that (image projections,
Image Real-time Image using GANs enhance the quality and limited
Editing Editing" is to but has capabilities diversity), exploration
enhance the limitations like of image MSE (pixel- of spatial
capabilities of dependency on editing level dimensions,
Generative segmentation using GANs accuracy), and narrow
Adversarial masks, Real-time MSE src/ref performanc
Networks (GANs) potential Image (local editingcomparison.
for real-time image artifacts, high Editing: accuracy). It also lacks
editing. The authors requirements. Improved These both real-world
introduce a novel Fidelity and pixel-level adaptability
approach called Accuracy and insights and
StyleMapGAN, High- perceptual a clear
which aims to Quality quality roadmap for
address several Output. assessment. integrating
limitations its methods
associated with into other
traditional GANs. architecture
s.
14. DIFFEDIT: 2022 The main goal of One limitation DIFFEDIT The The paper
DIFFUSION- DIFFEDIT is to noted is that leverages a performance identifies
BASED enable semantic the alignment diffusion of DIFFEDIT gap in
SEMANTIC image editing by between the model to is evaluated existing
IMAGE automatically text query and produce using metrics methods
EDIT- identifying regions the image more such as require
ING WITH of an image that caption is often natural and LPIPS user-
MASK need to be edited not perfect, subtle edits CSFID generated
GUIDANCE based on a text which can by demonstratin masks,
query, enhancing the affect the integrating g its DIFFEDIT
editing process quality of the the edited effectiveness addresses
without requiring edit regions into on datasets automaticall
user-generated the like y generating
masks background ImageNet masks, but
effectively, and COCO it still faces
outperformi challenges
ng previous in ensuring
methods the text
query aligns
well with
the image
content
15. Blended 2022 The paper aims to The model has The paper The paper Improving
Diffusion for introduce a novel high inference Gives an compares the ranking
Text-driven solution for time (30s per advantages proposed system to
Editing of performing local image) and a like method consider the
Natural edits in natural ranking system Intuitive against entire image
Images images using natural that overlooks Interface several context
language overall image means it baselines could
descriptions and quality. It also gives both enhance
region-of-interest inherits biases highly qualitatively results.
(ROI) masks. This from CLIP and intuitive and Future
is achieved by is unsuitable interface for quantitatively research
combining a (CLIP) for real-time or users, , could
with (DDPM) to low-power making it demonstratin extend the
generate realistic devices, easier to g its superior method to
image edits based on requiring faster specify performance 3D or
user prompts. diffusion desired in generating videos and
sampling. changes . realistic train CLIP
High images to be noise-
Realism: maintains agnostic for
The method background better
outperforms integrity robustness.
previous
solutions in
terms of
overall
realism
16 Semantic 2023 The paper aims to Despite The The paper The paper
Image develop a framework improvements, framework evaluates its highlights a
Synthesis via for semantic image the framework outperforms framework gap in
Diffusion synthesis that struggles with previous using FID for GAN-based
Models generates high-fidelity methods in image methods'
photorealistic images generation in generating quality, inability to
from semantic complex high- LPIPS for generate
layouts, addressing scenes like fidelity, diversity, and high-fidelity
the limitations of GANs. diverse mIOU and diverse
GAN-based methods Performance images, semantic results for
in handling complex metrics like state-of-the- interpretabilit complex
scenes. mIOU are art results y, ensuring a scenes,
resolution- benchmark comprehensi which the
sensitive, datasets. ve proposed
impacting improves performance framework
semantic image assessment. addresses
interpretabilityquality and by using
evaluation. balances the diffusion
trade-off models
between instead of
quality and adversarial
diversity. learning.
21 An 2020 The objective of this The potential Its SSIM It may
Interactive paper is to develop for ambiguity innovative With P- struggle
Image an interactive image in natural use of an value<0.001 with
Editing editing framework language entropy- ambiguous
System Using using a modified requests, based natural
an Deep Convolutional which can confirmatio language
Uncertainty- Generative complicate the n strategy requests,
Based Adversarial Network image editing within an which can
Confirmation (DCGAN) with a process, and interactive hinder the
Strategy Source Image the trade-off image image
Masking module and between image editing editing
an entropy-based quality and the framework, process and
confirmation constraints which limit the
strategy to enhance imposed by the enhances effectivenes
user control, masking user control s of the
dialogue efficiency, mechanism, and reduces masking
and image quality in which may redundant mechanism,
response to natural restrict the dialogues potentially
language editing extent of while restricting
requests. changes that maintaining the range of
can be made to high image changes that
the images quality in can be made
response to to the
natural images.
language
editing
requests
22 Anycost 2021 The primary Anycost GANs The control The paper The paper
GANs for objective of the can be over discusses the does not
Interactive paper is to propose executed at channel use of provide
Image "Anycost" GANs, different numbers reconstructio quantitative
Synthesis and which are designed computational may be n loss to results for
Editing for interactive image budgets, challenging evaluate the latent
synthesis and allowing users for users performance space-based
editing. The goal is to choose unfamiliar of the full editing,
to create a generator between with neural generator and which could
that can operate at quality and networks. its sub- limit the
various efficiency. Future generators. It understandi
computational costs This is improveme also mentions ng of the
while maintaining achieved by nts aim to measuring model's
visually consistent using subsets provide the average performanc
outputs. This allows of weights more reconstructio e in
for quick previews from the full intuitive n practical
during editing and generator controls for performance applications
high-quality final without users. across .
outputs when requiring fine- The current different Spatially-
needed. tuning. model architectures Varying
The model approximate found Trade-offs:
supports multi- s every through an There is a
resolution output pixel evolutionary gap in the
outputs and equally, algorithm. model's
adaptive- which may The LPIPS ability to
channel not loss is used support
inference, prioritize for spatially-
which important comparing varying
enhances its objects (like image trade-offs
usability faces) over quality, between
across less critical indicating a fidelity and
different background focus on latency,
hardware elements. perceptual which could
configurations This could similarity in enhance its
lead to outputs. adaptability
suboptimal to different
results in editing
certain scenarios.
scenarios.
RESULT:
Samples
MasaCtrl+ControlNet Sketch None 17.933 0.302 0.655
1. Visual Consistency (%): Measures how well the edited images maintain structural and spatial
consistency with the original content. Higher percentages indicate better preservation of the image
layout and coherence after editing.
2. Edit Accuracy (%): Evaluates how accurately the edits reflect the user-provided editing signals
(e.g., sketches or coarse edits). Higher scores demonstrate better alignment between the intended
and generated modifications.
3. Image Quality (%): Assesses the overall perceptual quality of the edited images, considering
aspects like sharpness, realism, and absence of artifacts. Higher percentages signify superior
visual fidelity and fewer distortions.
Table 2
This table presents a comparison of training and testing accuracy for three different image segmentation
methods: GAN-based, U-Net-based, and our proposed Spatio-Temporal UNet-Editor
Table 3
CONCLUSION:
REFERNCES:
[1] Sun, J., Wang, X., Shi, Y., Wang, L., Wang, J., & Liu, Y. (2022). Ide-3d: Interactive disentangled
editing for high-resolution 3d-aware portrait synthesis. ACM Transactions on Graphics (ToG), 41(6), 1-
10.
[2] Cheng, Y., Gan, Z., Li, Y., Liu, J., & Gao, J. (2020, October). Sequential attention GAN for
interactive image editing. In Proceedings of the 28th ACM international conference on multimedia (pp.
4383-4391).
[3] Jiang, Y., Huang, Z., Pan, X., Loy, C. C., & Liu, Z. (2021). Talk-to-edit: Finegrained facial editing
via dialog. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13799-
13808).
[4] Liu, Z., Yu, Y., Ouyang, H., Wang, Q., Cheng, K. L., Wang, W., ... & Shen, Y. (2024). Magicquill:
An intelligent interactive image editing system. arXiv preprint arXiv:2411.09703.
[5] Ling, H., Kreis, K., Li, D., Kim, S. W., Torralba, A., & Fidler, S. (2021). Editgan: High-precision
semantic image editing. Advances in Neural Information Processing Systems, 34, 16331-16345.
[6] Brooks, T., Holynski, A., & Efros, A. A. (2023). Instructpix2pix: Learning to follow image editing
instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(pp. 18392-18402).
[7] Ceylan, D., Huang, C. H. P., & Mitra, N. J. (2023). Pix2video: Video editing using image diffusion.
In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 23206-23217).
[8] Wang, S., Saharia, C., Montgomery, C., Pont-Tuset, J., Noy, S., Pellegrini, S., ... & Chan, W.
(2023). Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18359-
18369).
[9] Alaluf, Y., Tov, O., Mokady, R., Gal, R., & Bermano, A. (2022). Hyperstyle: Stylegan inversion
with hypernetworks for real image editing. In Proceedings of the IEEE/CVF conference on computer
Vision and pattern recognition (pp. 18511-18521).
[10] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J. Y., & Ermon, S. (2021). Sdedit: Guided image
synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
[11] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., & Cohen-Or, D. (2022). Prompt-
to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626
[12] Chen, X., Zhao, Z., Zhang, Y., Duan, M., Qi, D., & Zhao, H. (2022). Focalclick: Towards practical
interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp.1300)
[13] Hyunsu Kim, Yunjey Choi, Junho Kim, Sungjoo Yoo, Youngjung Uh; Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 852-861
[14] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2LIVE:
Textdriven layered image and video editing. arXiv preprint, arXiv:2204.02491, 2022.
[15] Omri Avrahami, Dani Lischinski, Ohad Fried; Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18208-18218
[16] Jacob Austin, Daniel Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured
denoising diffusion models in discrete state-spaces. In NeurIPS, volume 34, 2021.
[17] Nguyen, T., Ojha, U., Li, Y., Liu, H., & Lee, Y. J. (2024). Edit One for All: Interactive Batch
Image Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 8271-8280).
[18] Liang, Y., Gan, Y., Chen, M., Gutierrez, D., & Muñoz, A. (2019, October). Generic interactive
pixel‐level image editing. In Computer Graphics Forum (Vol. 38, No. 7, pp. 23-34).
[19] Shi, Y., Xue, C., Liew, J. H., Pan, J., Yan, H., Zhang, W., ... & Bai, S. (2024). Dragdiffusion:
Harnessing diffusion models for interactive point-based image editing. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 8839-8849).
[20] Shin, J., Choi, D., & Park, J. (2024, December). InstantDrag: Improving Interactivity in Drag-based
Image Editing. In SIGGRAPH Asia 2024 Conference Papers (pp. 1-10).
[21] Shinagawa, S., Yoshino, K., Alavi, S. H., Georgila, K., Traum, D., Sakti, S., & Nakamura, S.
(2020). An Interactive Image Editing System Using an Uncertainty-Based Confirmation Strategy. IEEE
Access, 8, 98471-98480.
[22] Lin, J., Zhang, R., Ganz, F., Han, S., & Zhu, J. Y. (2021). Anycost gans for interactive image
synthesis and editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 14986-14996).
[23] Cui, X., Li, Z., Li, P., Hu, Y., Shi, H., & He, Z. (2023). Chatedit: Towards multiturn interactive
facial image editing via dialogue. arXiv preprint arXiv:2303.11108.
[24] Ivan Anokhin, Kirill V. Demochkin, Taras Khakhulin, Gleb Sterkin, Victor S. Lempitsky, and
Denis Korzhenkov. Image generators with conditionally-independent pixel synthesis. 2021 IEEE/CVF
Conference on Computer Vision and PatterenRecognition (CVPR), pages 14273–14282, 2020