- [1/27/2025] β Our paper is accpeted by ICLR 2025 and selected as Spotlight!
- [12/09/2024] β We release code for compositional framework (Gemini/Claude + SD3/SD2.1/Flux, ISG-Agent) today!
- [11/27/2024] π We release our paper and dataset today!
- Updates & News
- Contents
- Interleaved Scene Graph
- Evaluating Your Own Model
- ISG-Agent: Exploring the Upper Bound for Interleaved Generation
- Acknowledgments
- Citation
This evaluation method and benchmark is designed for evaluating interleaved generation in four levels: Structure, Block, Image, and Holistic. It is an well established testbed for model can perform both multimodal understanding and generation such as Show-o and Anole.
Currently, we switch to Openrouter for api calling. Given that we mainly use openai/gpt-4.1 for VQA in Image and Block level as well as MLLM-as-a-Judge in Holistic level, you can simply setup by: pip install openai.
You should also set up your own config in ISG_eval/config.yaml. You should input your openrouter api, default model for judging here.
/ISG_eval
βββ images (You should download it from huggingface and place here)
βββ ISG-Bench.jsonl
βββ ...-
images: Contains images in queries and golden answer. You can download it from here and place them under ISG_eval.
-
ISG-Bench.jsonl: Contains ground truth compiled previously by ISG. One data sample is as follows. It contains
Queryfor question andGoldenfor human-annotated golden answer.
{
"id": "0000",
"Category": "Prediction",
"Query": [
{
"type": "text",
"content": "I will give you a picture of a person washing their hands. Please use a combination of 4 images and text to show what will happen next. Please generate an overall description first, then directly generate adjacent image blocks. For example, [whole description] <object1 image> <object2 image> <object3 image> <object4 image>."
},
{
"type": "image",
"content": "images/0000_q1.png"
}
],
"Golden": [
{
"type": "text",
"content": "The person continues to scrub their hands thoroughly, with the soap lathering up. The hands are cleaned under running water, and the lather is rinsed away."
},
{
"type": "image",
"content": "images/0000_g1.png"
},
{
"type": "image",
"content": "images/0000_g2.png"
},
{
"type": "image",
"content": "images/0000_g3.png"
},
{
"type": "image",
"content": "images/0000_g4.png"
}
],
"predict": {
"structure": {
"Query": [
"<query_text1>",
"<query_img1>"
],
"Answer": [
"<gen_text1>",
"<gen_img1>",
"<gen_img2>",
"<gen_img3>",
"<gen_img4>"
]
},
"block_tuple": {
"relation": [
[
"<gen_text1>",
"<query_img1>",
"is an overall description of"
],
...
]
},
"block_qa": {
"questions": [
{
"subject": "<gen_text1>",
"object": "<query_img1>",
"relation": "is an overall description of",
"Question": "Does <gen_text1> describe this image?"
},
...
]
},
"image_tuple": [
[
"entity",
"hands",
"<gen_img1>"
],
...
],
"image_qa": {
"questions": [
{
"image": "<gen_img1>",
"Question": "Are there hands in this image?",
"id": 0,
"Preliminary": []
},
...
]
}
}
}{
"id": "0000",
"Category": "Prediction",
"output": [
{
"type": "text",
"content": "<text-content>"
},
{
"type": "image",
"content": "<path_of_the_input_image>"
}
]
}Then, run the following script:
python ISG-eval.py --input_file <your file>
python calculate_performance.py --input_file <output of ISG-eval.py>We provide Gemini/Claude + SD3/SD2.1/Flux for compositional framework. You can run the following script to generate interleaved content.
python compositional_inference.py \
--text_generator <gemini/claude> \
--image_generator <sd3/sd2.1/flux> \
--input_file ./ISG_eval/ISG-Bench.jsonlPlease See ISG_agent/README.md for enviroment setup and how to use. You can also reproduct the experiment result by comparing to the chart.
| Category | Model | Avg. | Style | Prog. | 3D | Dec. | I-T C. | Temp. | VST | VQA |
|---|---|---|---|---|---|---|---|---|---|---|
| Block | ISG-AGENT | 5.515 | 5.391 | 6.181 | 6.081 | 4.243 | 6.408 | 6.816 | 5.678 | 3.321 |
| Image | ISG-AGENT | 0.574 | 0.538 | 0.752 | 0.359 | 0.617 | 0.368 | 0.670 | 0.713 | - |
| Structure | ISG-AGENT | 0.871 | 0.944 | 0.967 | 0.788 | 0.902 | 0.800 | 1.000 | 0.987 | 0.577 |
| Holistic | ISG-AGENT | 6.262 | 5.873 | 6.459 | 4.887 | 7.582 | 6.932 | 4.540 | 7.030 | 6.795 |
This project is a follow-up of MLLM-as-a-Judge. This work is partially funded by Toyota Motor Corporation. Weβd also like to extend a thank you to Jieyu Zhang, Weikai Huang, and Zixian Ma for their insightful feedback and support.
@inproceedings{
chen2025interleaved,
title={Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment},
author={Dongping Chen and Ruoxi Chen and Shu Pu and Zhaoyi Liu and Yanru Wu and Caixi Chen and Benlin Liu and Yue Huang and Yao Wan and Pan Zhou and Ranjay Krishna},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=rDLgnYLM5b}
}




