[📖 arXiv Paper] [📊 MM-RLHF Data] [📝 Homepage]
[🏆 Reward Model] [🔮 MM-RewardBench] [🔮 MM-SafetyBench] [📈 Evaluation Suite]
Welcome to the docs for mmrlhf-eval: the evaluation suite for the MM-RLHF project.
- [2025-03] 📝📝 This project is built upon the lmms_eval framework. We have established a dedicated "Hallucination and Safety Tasks" category, incorporating three key benchmarks - AMBER, MMHal-Bench, and ObjectHallusion. Additionally, we introduce our novel MM-RLHF-SafetyBench task, a comprehensive safety evaluation protocol specifically designed for MLLM. Detailed specifications of the MM-RLHF-SafetyBench are documented in current_tasks.
For development, you can install the package by cloning the repository and running the following command:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .If you want to test LLaVA, you will have to clone their repo from LLaVA and
# for llava 1.5
# git clone https://github.com/haotian-liu/LLaVA
# cd LLaVA
# pip install -e .
# for llava-next (1.6)
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .To run evaluations for the AMBER dataset, you need to download the image data from the following link and place it in the lmms_eval/tasks/amber folder:
Once the image data is downloaded and placed in the correct folder, you can proceed with evaluating AMBER-based tasks.
For benchmarks that require the calculation of the CHIAR metric (such as Object Hallucination and AMBER), you'll need to install and configure the required Natural Language Toolkit (NLTK) resources. Run the following commands to download necessary NLTK data:
python3 - <<EOF
import nltk
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
EOFFor Safety-related benchmarks (e.g., MM-RLHF-Safety), here is an example of how to run the evaluations using the Qwen2VL model. Follow the sequence of commands below to evaluate the model on various safety tasks:
python3 -m accelerate.commands.launch \
--num_processes=4 \
--main_process_port 12346 \
-m lmms_eval \
--model qwen2_vl \
--model_args pretrained="Qwen/Qwen2-VL-7B-Instruct" \
--tasks Safe_unsafes,Unsafes,Risk_identification,Adv_target,Adv_untarget,Typographic_ASR,Typographic_RtA \
--batch_size 1 \
python3 -m accelerate.commands.launch \
--num_processes=4 \
--main_process_port 12346 \
-m lmms_eval \
--model qwen2_vl \
--model_args pretrained="Qwen/Qwen2-VL-7B-Instruct" \
--tasks Multimodel_ASR,Multimodel_RtA \
--batch_size 1 \
python3 -m accelerate.commands.launch \
--num_processes=4 \
--main_process_port 12346 \
-m lmms_eval \
--model qwen2_vl \
--model_args pretrained="Qwen/Qwen2-VL-7B-Instruct" \
--tasks Crossmodel_ASR,Crossmodel_RtA \
--batch_size 1 \Please refer to our documentation.
@article{zhang2025mm,
title={MM-RLHF: The Next Step Forward in Multimodal LLM Alignment},
author={Zhang, Yi-Fan and Yu, Tao and Tian, Haochen and Fu, Chaoyou and Li, Peiyan and Zeng, Jianshu and Xie, Wulin and Shi, Yang and Zhang, Huanyu and Wu, Junkang and others},
journal={arXiv preprint arXiv:2502.10391},
year={2025}
}