Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Updates & News

[1/27/2025] ⭐ Our paper is accpeted by ICLR 2025 and selected as Spotlight!
[12/09/2024] ⭐ We release code for compositional framework (Gemini/Claude + SD3/SD2.1/Flux, ISG-Agent) today!
[11/27/2024] 📄 We release our paper and dataset today!

Interleaved Scene Graph

This evaluation method and benchmark is designed for evaluating interleaved generation in four levels: Structure, Block, Image, and Holistic. It is an well established testbed for model can perform both multimodal understanding and generation such as Show-o and Anole.

Environment Setup

Currently, we switch to Openrouter for api calling. Given that we mainly use openai/gpt-4.1 for VQA in Image and Block level as well as MLLM-as-a-Judge in Holistic level, you can simply setup by: pip install openai.

You should also set up your own config in ISG_eval/config.yaml. You should input your openrouter api, default model for judging here.

Repository Management

/ISG_eval
├── images (You should download it from huggingface and place here)
├── ISG-Bench.jsonl
├── ...

images: Contains images in queries and golden answer. You can download it from here and place them under ISG_eval.
ISG-Bench.jsonl: Contains ground truth compiled previously by ISG. One data sample is as follows. It contains Query for question and Golden for human-annotated golden answer.

{
    "id": "0000",
    "Category": "Prediction",
    "Query": [
        {
            "type": "text",
            "content": "I will give you a picture of a person washing their hands. Please use a combination of 4 images and text to show what will happen next. Please generate an overall description first, then directly generate adjacent image blocks. For example, [whole description] <object1 image> <object2 image> <object3 image> <object4 image>."
        },
        {
            "type": "image",
            "content": "images/0000_q1.png"
        }
    ],
    "Golden": [
        {
            "type": "text",
            "content": "The person continues to scrub their hands thoroughly, with the soap lathering up. The hands are cleaned under running water, and the lather is rinsed away."
        },
        {
            "type": "image",
            "content": "images/0000_g1.png"
        },
        {
            "type": "image",
            "content": "images/0000_g2.png"
        },
        {
            "type": "image",
            "content": "images/0000_g3.png"
        },
        {
            "type": "image",
            "content": "images/0000_g4.png"
        }
    ],
    "predict": {
        "structure": {
            "Query": [
                "<query_text1>",
                "<query_img1>"
            ],
            "Answer": [
                "<gen_text1>",
                "<gen_img1>",
                "<gen_img2>",
                "<gen_img3>",
                "<gen_img4>"
            ]
        },
        "block_tuple": {
            "relation": [
                [
                    "<gen_text1>",
                    "<query_img1>",
                    "is an overall description of"
                ],
                ...
            ]
        },
        "block_qa": {
            "questions": [
                {
                    "subject": "<gen_text1>",
                    "object": "<query_img1>",
                    "relation": "is an overall description of",
                    "Question": "Does <gen_text1> describe this image?"
                },
                ...
            ]
        },
        "image_tuple": [
            [
                "entity",
                "hands",
                "<gen_img1>"
            ],
            ...
        ],
        "image_qa": {
            "questions": [
                {
                    "image": "<gen_img1>",
                    "Question": "Are there hands in this image?",
                    "id": 0,
                    "Preliminary": []
                },
                ...
            ]
        }
    }
}

Evaluating Your Own Model

Once you get your model's output, manage them as a JsonLine file, where each the answer for each id is under key `output`:

{
    "id": "0000",
    "Category": "Prediction",
    "output": [
        {
            "type": "text",
            "content": "<text-content>"
        },
        {
            "type": "image",
            "content": "<path_of_the_input_image>"
        }
    ]
}

Then, run the following script:

python ISG-eval.py --input_file <your file>
python calculate_performance.py --input_file <output of ISG-eval.py>

Compositional Framework

Gemini/Claude + SD3/SD2.1/Flux

We provide Gemini/Claude + SD3/SD2.1/Flux for compositional framework. You can run the following script to generate interleaved content.

python compositional_inference.py \
    --text_generator <gemini/claude> \
    --image_generator <sd3/sd2.1/flux> \
    --input_file ./ISG_eval/ISG-Bench.jsonl

ISG-Agent: Exploring the Upper Bound for Interleaved Generation

ISG-Agent is a compositional framework that leverage tools to generate high-quality interleaved content while strictly follows user's query.

ISG-Agent will output interleaved image and text results. You can run evaluation on ISG-Agent.

Please See ISG_agent/README.md for enviroment setup and how to use. You can also reproduct the experiment result by comparing to the chart.

Category	Model	Avg.	Style	Prog.	3D	Dec.	I-T C.	Temp.	VST	VQA
Block	ISG-AGENT	5.515	5.391	6.181	6.081	4.243	6.408	6.816	5.678	3.321
Image	ISG-AGENT	0.574	0.538	0.752	0.359	0.617	0.368	0.670	0.713	-
Structure	ISG-AGENT	0.871	0.944	0.967	0.788	0.902	0.800	1.000	0.987	0.577
Holistic	ISG-AGENT	6.262	5.873	6.459	4.887	7.582	6.932	4.540	7.030	6.795

Acknowledgments

This project is a follow-up of MLLM-as-a-Judge. This work is partially funded by Toyota Motor Corporation. We’d also like to extend a thank you to Jieyu Zhang, Weikai Huang, and Zixian Ma for their insightful feedback and support.

Citation

@inproceedings{
    chen2025interleaved,
    title={Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment},
    author={Dongping Chen and Ruoxi Chen and Shu Pu and Zhaoyi Liu and Yanru Wu and Caixi Chen and Benlin Liu and Yue Huang and Yao Wan and Pan Zhou and Ranjay Krishna},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=rDLgnYLM5b}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Updates & News

Contents

Interleaved Scene Graph

Environment Setup

Repository Management

Evaluating Your Own Model

Compositional Framework

Gemini/Claude + SD3/SD2.1/Flux

ISG-Agent: Exploring the Upper Bound for Interleaved Generation

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
ISG_agent		ISG_agent
ISG_eval		ISG_eval
figures		figures
ISG-eval.py		ISG-eval.py
calculate_performance.py		calculate_performance.py
compositional_inference.py		compositional_inference.py
readme.md		readme.md

Dongping-Chen/ISG

Folders and files

Latest commit

History

Repository files navigation

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Updates & News

Contents

Interleaved Scene Graph

Environment Setup

Repository Management

Evaluating Your Own Model

Compositional Framework

Gemini/Claude + SD3/SD2.1/Flux

ISG-Agent: Exploring the Upper Bound for Interleaved Generation

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages