🏆 Leaderboard | 🤗 MMMG | 📖 Paper
This repo contains the evaluation pipeline for the paper "MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation".
Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.
In this codebase, we provide two options for evaluating your multimodal generation model on MMMG:
- Evaluate with pre-generated responses: If you already have the responses generated by your model, you can directly evaluate them using our evaluation pipeline.
- Generate responses and evaluate: If you don't have the responses yet, you can generate them using our generation pipeline and then evaluate them.
First install the required packages. We need python >= 3.9, OpenAI and Gemini API keys (you can apply at OpenAI API key and Gemini API key). The other requirements are usually compatible with your working environment.
pip install -r requirements.txt --upgrade-strategy only-if-needed
export OPENAI_KEY=openai_key # change to you OpenAI API key
export GEMINI_KEY=gemini_key # change to you Gemini API keyYou can also manually add your API keys at Line 22-23 in utils.py to permanently store the API keys.
First, you need to put your generated responses in the required format. For reference, you can see the outputs we get for all the models in Google drive link.
For demonstration purpose, we have a subset of Gemini2 on object addition task in ./output/Gemini2/. The responses are stored in ./output/Gemini2/i_edit_add.json and the generated images are stored in ./output/Gemini2/image.You can evaluate it by
python eval_pipeline.py --model_name Gemini2 --category quick_test --job evaluateYou will see a ./output/Gemini2/it_eval.csv file, which stores the score of Gemini2 on object addition. The autoeval score for each task instance is stored in ./output/Gemini2/i_edit_add.json
To get MMMG scores, you need to pre-generated responses for all tasks in a category (i, it, a, at) and put them in the ./output/{model_name}/ folder. For example, to get interleaved text and image (it) score of Gemini2, generate for all subtasks and run
python eval_pipeline.py --model_name Gemini2 --category it --job evaluateIf you don't have the responses yet, you can generate them using our generation pipeline and then evaluate them. The generation pipeline is implemented in eval_pipeline.py and the model-specific implementation is in model_customized.py.
After installing the environment, implement the generate function for your model in model_customized.py. Make sure you strictly follow the format requirement specified in model_customized.py. Run the following instruction to generate all responses for all task:
python eval_pipeline.py --model_name model_name --category category --job generate
# model_name is the same name as your implemented model class name in model_customized.py
# category can be one of i, it, a, at, representing image, interleaved image-text, sound + music and speech + interleaved speech-text generation.You should see a ./output/{model_name}/ folder under the root dir, which stores the generated responses.
Similar to evaluation with pre-generated responses, you can evaluate the generated responses by running:
python eval_pipeline.py --model_name model_name --category category --job evaluateYou should see a ./output/{model_name}/{category}.csv file, which stores the evaluation scores of your model. To submit your model's scores to leaderboard, please refer to leaderboard.
We provide the implementation of all baselines in model.py, model_image.py, model_audio.py and model_interleaved.py. Your can use the implemented model class name for evaluation directly. To run these baselines models, please first download all the models files from Google drive link and place them under the root dir, your file structure should look like this:
root/
├── models/
│ ├── Anole/
│ ├── Seed/
│ └── ...
Then setup model-specific environment by the setup.sh file under each model folder. Environmental configs of models without a corresponding model folder are in ./models/Others/setup.sh and make sure you pass the correct API keys. To access the our evaluation results of baseline models, please download from Google drive link.
To replicate the human evaluation pipeline reported in paper, please run:
pip install gradio
python eval_pipeline.py --model_name model_name --category category --job human- Jihan Yao: [email protected]
- Yushi Hu: [email protected]
BibTeX:
@misc{yao2025mmmgcomprehensivereliableevaluation,
title={MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation},
author={Jihan Yao and Yushi Hu and Yujie Yi and Bin Han and Shangbin Feng and Guang Yang and Bingbing Wen and Ranjay Krishna and Lucy Lu Wang and Yulia Tsvetkov and Noah A. Smith and Banghua Zhu},
year={2025},
eprint={2505.17613},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.17613},
}