The Evaluation Suite of Large Multimodal Models

[📖 arXiv Paper] [📊 MM-RLHF Data] [📝 Homepage]

[🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]

The Evaluation Suite of Large Multimodal Models

Welcome to the docs for mmrlhf-eval: the evaluation suite for the MM-RLHF project.

Annoucement

[2025-03] 📝📝 This project is built upon the lmms_eval framework. We have established a dedicated "Hallucination and Safety Tasks" category, incorporating three key benchmarks - AMBER, MMHal-Bench, and ObjectHallusion. Additionally, we introduce our novel MM-RLHF-SafetyBench task, a comprehensive safety evaluation protocol specifically designed for MLLM. Detailed specifications of the MM-RLHF-SafetyBench are documented in current_tasks.

Installation

For development, you can install the package by cloning the repository and running the following command:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .

If you want to test LLaVA, you will have to clone their repo from LLaVA and

# for llava 1.5
# git clone https://github.com/haotian-liu/LLaVA
# cd LLaVA
# pip install -e .

# for llava-next (1.6)
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .

Evaluation and Safety Benchmark

AMBER Dataset

To run evaluations for the AMBER dataset, you need to download the image data from the following link and place it in the lmms_eval/tasks/amber folder:

AMBER dataset image download

Once the image data is downloaded and placed in the correct folder, you can proceed with evaluating AMBER-based tasks.

CHIAR Metric for Object Hallucination and AMBER

For benchmarks that require the calculation of the CHIAR metric (such as Object Hallucination and AMBER), you'll need to install and configure the required Natural Language Toolkit (NLTK) resources. Run the following commands to download necessary NLTK data:

python3 - <<EOF
import nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
EOF

Safety Benchmarks Evaluation

For Safety-related benchmarks (e.g., MM-RLHF-Safety), here is an example of how to run the evaluations using the Qwen2VL model. Follow the sequence of commands below to evaluate the model on various safety tasks:

python3 -m accelerate.commands.launch \
    --num_processes=4 \
    --main_process_port 12346 \
    -m lmms_eval \
    --model qwen2_vl \
    --model_args pretrained="Qwen/Qwen2-VL-7B-Instruct" \
    --tasks Safe_unsafes,Unsafes,Risk_identification,Adv_target,Adv_untarget,Typographic_ASR,Typographic_RtA \
    --batch_size 1 \


python3 -m accelerate.commands.launch \
    --num_processes=4 \
    --main_process_port 12346 \
    -m lmms_eval \
    --model qwen2_vl \
    --model_args pretrained="Qwen/Qwen2-VL-7B-Instruct" \
    --tasks Multimodel_ASR,Multimodel_RtA \
    --batch_size 1 \

python3 -m accelerate.commands.launch \
    --num_processes=4 \
    --main_process_port 12346 \
    -m lmms_eval \
    --model qwen2_vl \
    --model_args pretrained="Qwen/Qwen2-VL-7B-Instruct" \
    --tasks Crossmodel_ASR,Crossmodel_RtA \
    --batch_size 1 \

Add Customized Model and Dataset

Please refer to our documentation.

Citations

@article{zhang2025mm,
  title={MM-RLHF: The Next Step Forward in Multimodal LLM Alignment},
  author={Zhang, Yi-Fan and Yu, Tao and Tian, Haochen and Fu, Chaoyou and Li, Peiyan and Zeng, Jianshu and Xie, Wulin and Shi, Yang and Zhang, Huanyu and Wu, Junkang and others},
  journal={arXiv preprint arXiv:2502.10391},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
docs		docs
lmms_eval		lmms_eval
miscs		miscs
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Evaluation Suite of Large Multimodal Models

Annoucement

Installation

Evaluation and Safety Benchmark

AMBER Dataset

CHIAR Metric for Object Hallucination and AMBER

Safety Benchmarks Evaluation

Add Customized Model and Dataset

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The Evaluation Suite of Large Multimodal Models

Annoucement

Installation

Evaluation and Safety Benchmark

AMBER Dataset

CHIAR Metric for Object Hallucination and AMBER

Safety Benchmarks Evaluation

Add Customized Model and Dataset

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages