MMMG Benchmark

🏆 Leaderboard | 🤗 MMMG | 📖 Paper

This repo contains the evaluation pipeline for the paper "MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation".

Introduction

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

Quick Start

In this codebase, we provide two options for evaluating your multimodal generation model on MMMG:

Evaluate with pre-generated responses: If you already have the responses generated by your model, you can directly evaluate them using our evaluation pipeline.
Generate responses and evaluate: If you don't have the responses yet, you can generate them using our generation pipeline and then evaluate them.

First install the required packages. We need python >= 3.9, OpenAI and Gemini API keys (you can apply at OpenAI API key and Gemini API key). The other requirements are usually compatible with your working environment.

pip install -r requirements.txt --upgrade-strategy only-if-needed
export OPENAI_KEY=openai_key # change to you OpenAI API key
export GEMINI_KEY=gemini_key # change to you Gemini API key

You can also manually add your API keys at Line 22-23 in utils.py to permanently store the API keys.

Evaluation with Pre-generated Responses

First, you need to put your generated responses in the required format. For reference, you can see the outputs we get for all the models in Google drive link.

For demonstration purpose, we have a subset of Gemini2 on object addition task in ./output/Gemini2/. The responses are stored in ./output/Gemini2/i_edit_add.json and the generated images are stored in ./output/Gemini2/image.You can evaluate it by

python eval_pipeline.py --model_name Gemini2 --category quick_test --job evaluate

You will see a ./output/Gemini2/it_eval.csv file, which stores the score of Gemini2 on object addition. The autoeval score for each task instance is stored in ./output/Gemini2/i_edit_add.json

To get MMMG scores, you need to pre-generated responses for all tasks in a category (i, it, a, at) and put them in the ./output/{model_name}/ folder. For example, to get interleaved text and image (it) score of Gemini2, generate for all subtasks and run

python eval_pipeline.py --model_name Gemini2 --category it --job evaluate

Generate Responses and Evaluate

If you don't have the responses yet, you can generate them using our generation pipeline and then evaluate them. The generation pipeline is implemented in eval_pipeline.py and the model-specific implementation is in model_customized.py.

Generate Responses in Required Format for Parsing

After installing the environment, implement the generate function for your model in model_customized.py. Make sure you strictly follow the format requirement specified in model_customized.py. Run the following instruction to generate all responses for all task:

python eval_pipeline.py --model_name model_name --category category --job generate
# model_name is the same name as your implemented model class name in model_customized.py
# category can be one of i, it, a, at, representing image, interleaved image-text, sound + music and speech + interleaved speech-text generation.

You should see a ./output/{model_name}/ folder under the root dir, which stores the generated responses.

Evaluate Generate Responses

Similar to evaluation with pre-generated responses, you can evaluate the generated responses by running:

python eval_pipeline.py --model_name model_name --category category  --job evaluate

You should see a ./output/{model_name}/{category}.csv file, which stores the evaluation scores of your model. To submit your model's scores to leaderboard, please refer to leaderboard.

Baseline Models

We provide the implementation of all baselines in model.py, model_image.py, model_audio.py and model_interleaved.py. Your can use the implemented model class name for evaluation directly. To run these baselines models, please first download all the models files from Google drive link and place them under the root dir, your file structure should look like this:

root/
├── models/
│   ├── Anole/
│   ├── Seed/
│   └── ...

Then setup model-specific environment by the setup.sh file under each model folder. Environmental configs of models without a corresponding model folder are in ./models/Others/setup.sh and make sure you pass the correct API keys. To access the our evaluation results of baseline models, please download from Google drive link.

Human Evaluation

To replicate the human evaluation pipeline reported in paper, please run:

pip install gradio
python eval_pipeline.py --model_name model_name --category category --job human

Contact

Jihan Yao: [email protected]
Yushi Hu: [email protected]

Citation

BibTeX:

@misc{yao2025mmmgcomprehensivereliableevaluation,
      title={MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation}, 
      author={Jihan Yao and Yushi Hu and Yujie Yi and Bin Han and Shangbin Feng and Guang Yang and Bingbing Wen and Ranjay Krishna and Lucy Lu Wang and Yulia Tsvetkov and Noah A. Smith and Banghua Zhu},
      year={2025},
      eprint={2505.17613},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.17613}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
datasets		datasets
libs		libs
output/Gemini2		output/Gemini2
prompts		prompts
seed_instruction		seed_instruction
.gitignore		.gitignore
eval.py		eval.py
eval_audio.py		eval_audio.py
eval_image.py		eval_image.py
eval_interleaved.py		eval_interleaved.py
eval_pipeline.py		eval_pipeline.py
interface.py		interface.py
model.py		model.py
model_audio.py		model_audio.py
model_customized.py		model_customized.py
model_image.py		model_image.py
model_interleaved.py		model_interleaved.py
plot.py		plot.py
prompt.py		prompt.py
readme.md		readme.md
requirements.txt		requirements.txt
requirements_light.txt		requirements_light.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMMG Benchmark

Introduction

Quick Start

Evaluation with Pre-generated Responses

Generate Responses and Evaluate

Generate Responses in Required Format for Parsing

Evaluate Generate Responses

Baseline Models

Human Evaluation

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

yaojh18/MMMG

Folders and files

Latest commit

History

Repository files navigation

MMMG Benchmark

Introduction

Quick Start

Evaluation with Pre-generated Responses

Generate Responses and Evaluate

Generate Responses in Required Format for Parsing

Evaluate Generate Responses

Baseline Models

Human Evaluation

Contact

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages