Skip to content

The official implementation for "ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning"

Notifications You must be signed in to change notification settings

EthanG97/ImageDoctor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ImageDoctor: Rich Feedback for Text-to-Image Generation through Grounded Image Reasoning

Arxiv Model Website

ImageDoctor is a unified evaluation framework for text-to-image (T2I) generation.
It produces both multi-aspect scalar scores (semantic alignment, aesthetics, plausibility, overall) and spatially grounded heatmaps, following a novel “look–think–predict” paradigm inspired by human diagnosis.

ImageDoctor Teaser


🧩 Table of Contents


📘 Overview

Recent advances in text-to-image (T2I) generation have yielded increasingly realistic and instruction-following images.
However, evaluating such results remains challenging — most existing evaluators output a single scalar score, which fails to capture localized flaws or provide interpretable feedback.

ImageDoctor fills this gap by introducing dense, grounded evaluation:

  • It scores each image across multiple dimensions,
  • Localizes artifacts and misalignments using heatmaps,
  • And explains its reasoning step-by-step using grounded image reasoning.

🚀 Key Features

  • 🎯 Multi-Aspect Evaluation
    Predicts four interpretable quality dimensions:
    Semantic Alignment · Aesthetics · Plausibility · Overall Quality

  • 🗺️ Spatially Grounded Feedback
    Generates heatmaps highlighting artifact and misalignment regions, providing fine-grained supervision and interpretability.

  • 🧠 Grounded Image Reasoning
    Follows a look–think–predict paradigm:

    • Look: Identify potential flaw regions
    • Think: Analyze and reason about these regions
    • Predict: Produce final scores and diagnostic heatmaps
      The model can zoom in on localized regions when reasoning, mimicking human evaluators.
  • ⚙️ GRPO Fine-Tuning
    ImageDoctor is refined through Group Relative Policy Optimization (GRPO) with a grounding reward, improving spatial awareness and preference alignment.

  • 🧩 Versatile Applications

    • ✅ Evaluation metric
    • ✅ Reward function in RL for T2I models (DenseFlow-GRPO)
    • ✅ Verifier for test-time scaling and re-ranking

🧱 Environments

git clone https://github.com/EthanG97/ImageDoctor.git
cd ImageDoctor
# Create a new conda environment from environment.yaml
conda env create -f environment.yaml

# Activate it
conda activate imagedoctor

🧠 Inference

python inference.py\
  --checkpoint GYX97/ImageDoctor \
  --image_path ./examples/cat.png \
  --prompt "a close-up photo of a fluffy orange cat with green eyes" \
  --output_dir ./outputs

🧾 Citation

If you use ImageDoctor in your research, please cite:

@misc{guo2025imagedoctordiagnosingtexttoimagegeneration,
  author    = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum},
  title     = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning}, 
  eprint    = {2510.01010},
  archivePrefix={arXiv},
  year      = {2025},
  url       = {https://arxiv.org/abs/2510.01010}, 

🙏 Acknowledgement

ImageDoctor builds upon:

  • Qwen2.5-VL – Vision-Language foundation
  • RichHF-18K – Multi-aspect human preference dataset
  • Flow-GRPO – Reinforcement Learning base framework

📄 License

Released under the Apache 2.0 License for research and non-commercial use.

About

The official implementation for "ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages