StructBench evaluates image generation and editing models on structured visuals like charts, diagrams, math figures, tables, graphs, and puzzles:
- StructVisuals: 1.34 million training data with Chain-of-Thoughts annotations
- StructEditBench: 1,714 editing evaluation exmpales with 32,031 Q&A pairs
- StructT2IBench: 1,714 T2I evaluation examples with 37,941 Q&A pairs
- StructScore: Multi-round Q&A evaluation metric using VLMs (GPT-5 or Qwen2.5-VL-72B)
Categories: Math, Graph, Chart, Puzzle, Science, Table
We recommend using a Python 3.10+ virtual environment:
conda create -n structbench python=3.10
conda activate structbenchInstall dependencies:
# For GPT-5 evaluation
pip install openai datasets Pillow tqdm huggingface_hub
# For Qwen evaluation (with vLLM acceleration)
pip install vllm transformersYour evaluation dataset should be hosted on Hugging Face Hub with the following structure:
Required columns:
qa_list: List of Q&A dictionaries, each containing:question(str): The question to ask about the imageanswerorground_truth_answer(str): The correct answerlabel(str): Either"editing"(modified regions) or"maintain"(unchanged regions)
category(str): Category label (e.g., "chart", "math", "table", "graph", "puzzle", "science"){prefix}{model_name}(PIL.Image): Your model's generated images
Example dataset structure:
{
"qa_list": [
{
"question": "What is the title of the chart?",
"answer": "Sales Report",
"label": "editing"
},
{
"question": "What is the background color?",
"answer": "white",
"label": "maintain"
}
],
"category": "chart",
"output_image_mymodel": <PIL.Image>, # Your model's output
# Optional for context:
"source_image": <PIL.Image> # Original image for editing tasks
}Note: The label field determines how accuracy is weighted:
- Final accuracy = 0.9 × editing_accuracy + 0.1 × maintain_accuracy
from datasets import load_dataset
# Load official benchmark
dataset = load_dataset("hshjerry0315/StructEditBench")
# or
dataset = load_dataset("hshjerry0315/StructT2IBench")from PIL import Image
from datasets import Dataset
def add_model_outputs(dataset, model_fn, prefix="output_image_mymodel"):
"""Add your model's generated images to the dataset."""
results = []
for item in dataset:
# Generate image with your model
generated_image = model_fn(item) # Returns PIL.Image
# Add to item
item[prefix] = generated_image
results.append(item)
return Dataset.from_list(results)
# Add your model outputs
dataset_with_outputs = add_model_outputs(dataset["train"], your_model_function)
# Push to HuggingFace for evaluation
dataset_with_outputs.push_to_hub("your-username/your-eval-dataset")# Evaluate with Qwen2.5-VL
python qwen_scoring.py \
--model_path Qwen/Qwen2.5-VL-72B-Instruct \
--dataset_path your-username/your-eval-dataset \
--output_dir results/mymodel \
--tensor_parallel_size 8 \
--prefix output_image_export OPENAI_API_KEY="your-api-key-here"
python gpt_scoring.py \
--dataset_path hshjerry0315/StructEditBench \
--output_dir results/gpt_eval \
--api_key $OPENAI_API_KEY \
--num_workers 100 \
--prefix output_image_Arguments:
| Argument | Type | Required | Default | Description |
|---|---|---|---|---|
--dataset_path |
str | ✓ | - | HuggingFace dataset path (e.g., hshjerry0315/StructEditBench) |
--output_dir |
str | ✓ | - | Output directory for results |
--api_key |
str | ✓ | - | OpenAI API key |
--num_workers |
int | 100 | Number of parallel threads | |
--prefix |
str | output_image_ |
Prefix for model image columns (e.g., output_image_mymodel) |
|
--split |
str | train |
Dataset split to evaluate | |
--debug |
flag | False | Process only 20 samples for testing | |
--output_repo_name |
str | None | Optional: Upload results to HuggingFace Hub |
python qwen_scoring.py \
--model_path Qwen/Qwen2.5-VL-72B-Instruct \
--dataset_path hshjerry0315/StructEditBench \
--output_dir results/qwen_eval \
--tensor_parallel_size 8 \
--dtype bfloat16 \
--gpu_mem_util 0.9Arguments:
| Argument | Type | Required | Default | Description |
|---|---|---|---|---|
--dataset_path |
str | ✓ | - | HuggingFace dataset path (e.g., hshjerry0315/StructT2IBench) |
--output_dir |
str | ✓ | - | Output directory for results |
--model_path |
str | ✓ | - | Qwen model path or HF repo |
--tensor_parallel_size |
int | 4 | Number of GPUs for tensor parallelism | |
--dtype |
str | bfloat16 |
Model dtype (bfloat16 or float16) | |
--gpu_mem_util |
float | 0.9 | GPU memory utilization (0-1) | |
--max_model_len |
int | 5120 | Maximum model sequence length | |
--max_new_tokens |
int | 256 | Max tokens to generate per response | |
--img_size |
int | 1024 | Image preprocessing size (512 or 1024) | |
--prefix |
str | output_image_ |
Prefix for model image columns (e.g., output_image_mymodel) |
|
--split |
str | train |
Dataset split to evaluate | |
--debug |
flag | False | Process only 20 samples for testing | |
--output_repo_name |
str | None | Optional: Upload results to HuggingFace Hub |
After evaluation, results are saved in output_dir/:
results/
├── processed_dataset/ # Full dataset with results
├── StructEditBench_mymodel_analysis.json # Summary for GPT eval
└── StructEditBench_mymodel_qwen_analysis.json # Summary for Qwen eval
The evaluated dataset contains your original data plus new columns for each model:
GPT-5 output columns:
{model_name}_list: List of Q&A results with answers, corrections, and labels{model_name}_accuracy: Weighted accuracy (0.9 × editing + 0.1 × maintain){model_name}_editing_accuracy: Accuracy on editing questions{model_name}_maintain_accuracy: Accuracy on maintain questions
Qwen output columns:
{model_name}_qwen_list: List of Q&A results{model_name}_qwen_accuracy: Weighted accuracy{model_name}_qwen_editing_accuracy: Accuracy on editing questions{model_name}_qwen_maintain_accuracy: Accuracy on maintain questions
{
"model_name": "mymodel",
"global_weighted_accuracy": 45.23,
"global_editing_accuracy": 48.50,
"global_maintain_accuracy": 42.15,
"group_accuracies": {
"chart": {
"accuracy": 50.58,
"editing_accuracy": 52.30,
"maintain_accuracy": 48.90,
"num_samples": 285
},
"math": {...},
...
},
"total_samples": 1714,
"total_evaluations": 32031
}To submit your results to the StructBench leaderboard:
Required Metrics:
- StructEditBench:
- Accuracy (%) for each category (Math, Chart, Graph, Puzzle, Science, Table)
- Overall Accuracy (%)
- PSNR for each category and overall
- StructT2IBench:
- Accuracy (%) for each category (Math, Chart, Graph, Puzzle, Science, Table)
- Overall Accuracy (%)
Submission:
Email your *_analysis.json files and model details to:
If you use StructBench in your research, please cite:
@article{zhuo2025structbench,
title={Factuality Matters: When Image Generation and Editing Meet Structured Visuals},
author={Zhuo, Le and Han, Songhao and Pu, Yuandong and Qiu, Boxiang and Paul, Sayak and Liao, Yue and Liu, Yihao and Shao, Jie and Chen, Xi and Liu, Si and Li, Hongsheng},
journal={arXiv preprint arXiv:2510.05091},
year={2025}
}This project is released under the Apache License 2.0.
