ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

ImageDoctor is a unified evaluation framework for text-to-image (T2I) generation.
It produces both multi-aspect scalar scores (semantic alignment, aesthetics, plausibility, overall) and spatially grounded heatmaps, following a novel “look–think–predict” paradigm inspired by human diagnosis.

📘 Overview

Recent advances in text-to-image (T2I) generation have yielded increasingly realistic and instruction-following images.
However, evaluating such results remains challenging — most existing evaluators output a single scalar score, which fails to capture localized flaws or provide interpretable feedback.

ImageDoctor fills this gap by introducing dense, grounded evaluation:

It scores each image across multiple dimensions,
Localizes artifacts and misalignments using heatmaps,
And explains its reasoning step-by-step using grounded image reasoning.

🚀 Key Features

🎯 Multi-Aspect Evaluation
Predicts four interpretable quality dimensions:
Semantic Alignment · Aesthetics · Plausibility · Overall Quality
🗺️ Spatially Grounded Feedback
Generates heatmaps highlighting artifact and misalignment regions, providing fine-grained supervision and interpretability.
🧠 Grounded Image Reasoning
Follows a look–think–predict paradigm:
- Look: Identify potential flaw regions
- Think: Analyze and reason about these regions
- Predict: Produce final scores and diagnostic heatmaps
  The model can zoom in on localized regions when reasoning, mimicking human evaluators.
⚙️ GRPO Fine-Tuning
ImageDoctor is refined through Group Relative Policy Optimization (GRPO) with a grounding reward, improving spatial awareness and preference alignment.
🧩 Versatile Applications
- ✅ Evaluation metric
- ✅ Reward function in RL for T2I models (DenseFlow-GRPO)
- ✅ Verifier for test-time scaling and re-ranking

🧱 Environments

git clone https://github.com/EthanG97/ImageDoctor.git
cd ImageDoctor
# Create a new conda environment from environment.yaml
conda env create -f environment.yaml

# Activate it
conda activate imagedoctor

🧠 Inference

python inference.py\
  --checkpoint GYX97/ImageDoctor \
  --image_path ./examples/cat.png \
  --prompt "a close-up photo of a fluffy orange cat with green eyes" \
  --output_dir ./outputs

🧾 Citation

If you use ImageDoctor in your research, please cite:

@misc{guo2025imagedoctordiagnosingtexttoimagegeneration,
  author    = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum},
  title     = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}, 
  eprint    = {2510.01010},
  archivePrefix={arXiv},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.01010},

🙏 Acknowledgement

ImageDoctor builds upon:

Qwen2.5-VL – Vision-Language foundation
RichHF-18K – Multi-aspect human preference dataset
Flow-GRPO – Reinforcement Learning base framework

📄 License

Released under the Apache 2.0 License for research and non-commercial use.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images		images
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
inference.py		inference.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

🧩 Table of Contents

📘 Overview

🚀 Key Features

🧱 Environments

🧠 Inference

🧾 Citation

🙏 Acknowledgement

📄 License

About

Uh oh!

Releases

Packages

Languages

EthanG97/ImageDoctor

Folders and files

Latest commit

History

Repository files navigation

ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

🧩 Table of Contents

📘 Overview

🚀 Key Features

🧱 Environments

🧠 Inference

🧾 Citation

🙏 Acknowledgement

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages