ImageDoctor is a unified evaluation framework for text-to-image (T2I) generation.
It produces both multi-aspect scalar scores (semantic alignment, aesthetics, plausibility, overall) and spatially grounded heatmaps, following a novel “look–think–predict” paradigm inspired by human diagnosis.
Recent advances in text-to-image (T2I) generation have yielded increasingly realistic and instruction-following images.
However, evaluating such results remains challenging — most existing evaluators output a single scalar score, which fails to capture localized flaws or provide interpretable feedback.
ImageDoctor fills this gap by introducing dense, grounded evaluation:
- It scores each image across multiple dimensions,
- Localizes artifacts and misalignments using heatmaps,
- And explains its reasoning step-by-step using grounded image reasoning.
-
🎯 Multi-Aspect Evaluation
Predicts four interpretable quality dimensions:
Semantic Alignment · Aesthetics · Plausibility · Overall Quality -
🗺️ Spatially Grounded Feedback
Generates heatmaps highlighting artifact and misalignment regions, providing fine-grained supervision and interpretability. -
🧠 Grounded Image Reasoning
Follows a look–think–predict paradigm:- Look: Identify potential flaw regions
- Think: Analyze and reason about these regions
- Predict: Produce final scores and diagnostic heatmaps
The model can zoom in on localized regions when reasoning, mimicking human evaluators.
-
⚙️ GRPO Fine-Tuning
ImageDoctor is refined through Group Relative Policy Optimization (GRPO) with a grounding reward, improving spatial awareness and preference alignment. -
🧩 Versatile Applications
- ✅ Evaluation metric
- ✅ Reward function in RL for T2I models (DenseFlow-GRPO)
- ✅ Verifier for test-time scaling and re-ranking
git clone https://github.com/EthanG97/ImageDoctor.git
cd ImageDoctor
# Create a new conda environment from environment.yaml
conda env create -f environment.yaml
# Activate it
conda activate imagedoctor
python inference.py\
--checkpoint GYX97/ImageDoctor \
--image_path ./examples/cat.png \
--prompt "a close-up photo of a fluffy orange cat with green eyes" \
--output_dir ./outputsIf you use ImageDoctor in your research, please cite:
@misc{guo2025imagedoctordiagnosingtexttoimagegeneration,
author = {Yuxiang Guo, Jiang Liu, Ze Wang, Hao Chen, Ximeng Sun, Yang Zhao, Jialian Wu, Xiaodong Yu, Zicheng Liu and Emad Barsoum},
title = {ImageDoctor: Diagnosing Text-to-Image Generation via Grounded Image Reasoning},
eprint = {2510.01010},
archivePrefix={arXiv},
year = {2025},
url = {https://arxiv.org/abs/2510.01010}, ImageDoctor builds upon:
- Qwen2.5-VL – Vision-Language foundation
- RichHF-18K – Multi-aspect human preference dataset
- Flow-GRPO – Reinforcement Learning base framework
Released under the Apache 2.0 License for research and non-commercial use.
