Skip to content

Official implementation of "MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning"

Notifications You must be signed in to change notification settings

lynl7130/MoDoMoDo

Repository files navigation

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

🏠Home | 📄Paper | Current Version: v1.0

This repository is the official PyTorch implementation of the paper: MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning.

Reinforcement Learning with Verifiable Rewards (RLVR) has pushed language-only models to state-of-the-art results on reasoning tasks, yet extending it to multimodal LLMs is non-trivial: verifiable VL datasets are scarce and highly heterogeneous, and existing efforts usually fine-tune on just one task domain, which limits generalization. This focus can be inadequate for achieving the desirable generalization and comprehensive reasoning capabilities of MLLMs. While pooling several diverse datasets could cover a broader range of vision-language skills, using multiple training datasets introduces challenges, including potential conflicting objectives resulting from interactions among diverse datasets, as well as corresponding unstable behaviors during training This tension makes the dataset mixture itself a core design question —- How to mix diverse datasets in RLVR to achieve the wide-range of multimodal capabilities?

Release Notes

[06/2024] 🚀 First-Time Release of the Training and Evaluation Code of MoDoMoDo!

Installation

MoDoMoDo has been tested on A100s and H100s.

First, clone this repo:

git clone https://github.com/lynl7130/MoDoMoDo

# Prepare result folders:
mkdir -p <repo>/MoDoMoDo/lmms-eval/results
mkdir -p <repo>/MoDoMoDo/outputs
mkdir -p <repo>/MoDoMoDo/output_figures

# Prepare Environment Variables
export OPENAI_API_KEY=?
export HF_TOKEN=?
export WANDB_API_KEY=?

Note, OPENAI_API_KEY would require purchase. Feel free to skip it if you don't want to evalute on mathvista.

Next, there're two options to install the environment.

Option 1: Install with Conda and Pip

# create conda environment
conda create -n modomodo python=3.10
conda activate modomodo

# install pytorch based on cuda version
# for example, for cuda 12.1:
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

# install packages with special condition
pip install vllm==0.7.2 --no-deps
pip install flash-attn==2.7.3 --no-build-isolation

# install all other packages
# enter cloned repo
cd <repo>/MoDoMoDo
pip install -r requirements.txt

Option 2: Docker Installation

cd <repo>/MoDoMoDo/docker
sudo docker build -t modomodo-image .

# Run Docker Container with mounted volumes and host networking
sudo docker run --gpus all -it \
  --shm-size=1024m \
  --network host \
  -e WANDB_API_KEY=$WANDB_API_KEY \
  -e HF_TOKEN=$HF_TOKEN \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  -v <repo>/MoDoMoDo:/app/MoDoMoDo \
  modomodo-image

Note, --gpus all assumes Docker 19.03+ with nvidia‑container‑toolkit installed. If you’re on an older setup, add --runtime=nvidia.

Download datasets and base models

Prepare the 5 verifiable datasets MoDoMoDo uses:

python slurms/prepare_data.py

This script would save all datasets to be under <repo>/MoDoMoDo/share_data/:

📚 Dataset download summary

Dataset repo Split Storage† # Items
GeoQAV Problems yiqingliang/geoqav-problems-dataset train 42 MB 1,969
ScienceQA Problems yiqingliang/scienceqa-problems-dataset train 398 MB 6,218
ScienceQA (test) yiqingliang/scienceqa-problems-dataset-test test 129 MB 2,017
LISA Problems yiqingliang/lisa-problems-dataset train 572 MB 1,326
LISA (test) yiqingliang/lisa-problems-dataset-test test 1.27 GB 3,397
SAT Problems yiqingliang/sat-problems-dataset train 3 GB 15,000
SAT (test) yiqingliang/sat-problems-dataset-test test 337 MB 1,928
SAT (mini) yiqingliang/sat-problems-dataset-mini train 31.2 MB 64
ViRFT‑COCO laolao77/ViRFT_COCO train 1.15 GB 5,997

† Approximate; may not match the exact values on your machine.

LISA & COCO: All bounding box values are normalized to range from 0 to 1000, adaptive to image height and width, starting from top left corner. (x1, y1): Top-Left, (x1, y2): Bottom-Left.

If by any chance you don't want to download all of them, uncomment some items in slurms/prepare_data.py:

data_paris = [
    ...
]

Train MoDoMoDo

First, select a configuration $config following Name Convention: ${date}_${exp}_Instruct_fv. This name would corresponds to a yaml file configs/${config}.yaml.

  • An example for $config: 250509_Norm_Instruct_fv
  • This naming convention could ensure the later visualization code can find the ckpt results

Then, run training on 4 GPUs (recommend to check below notes before running!)

bash slurms/train_by_config.sh "$config" 4 12346

The training would be logged in wandb. Do wandb init if prompted before first training. The checkpoints would be saved to share_models/${config}.

Note: we need to use different ports if you want to run multiple training at the same time.

  • vLLM port: YAML port, default: 8000
  • DDP port: slurms/train_by_config.sh argument controlled --master_port, default: 12346

Data Mixture Control

reward_weights and reward_funcs must have same length. They would control how each reward function is weighted invariant to the dataset.

interleave_probs and dataset_names must have same length, They would control how likely each dataset is sampled during each training example sampling.

By default, mix_strategy: "interleave_under", so if one of the dataset is exhausted, the training would end.

GPU Usage and vLLM support

slurms/train_by_config.sh would assume you have NUM_DEVICES GPUs with first NUM_DEVICES-1 GPUs used for training, the last GPU used to host vLLM for generation acceleration.

This script would be compatible with configuration yamls containing use_vllm: true.

If you want to change the number of GPUs, change NUM_DEVICES=4 in slurms/train_by_config.sh by passing in argument and change num_generations hyperparameter in YAML config.
An example on 2 GPUs:

CUDA_VISIBLE_DEVICES=0,1 bash slurms/train_by_config.sh 250505_Norm_2gpu_Instruct_fv 2 12345

Be aware, num_generations hyperparameter has to be as least per_device_eval_batch_size and divides per_device_eval_batch_size x (NUM_DEVICES-1).

If you don't want vLLM

Use slurms/train_by_config_novllm.sh instead of slurms/train_by_config.sh for training. An example:

CUDA_VISIBLE_DEVICES=0,1 bash slurms/train_by_config_novllm.sh 250505_Norm_2gpu_novllm_Instruct_fv 2 12347

Make sure in YAML:

  • max_prompt_length is set to null.
  • use_vllm is set to false.
  • Be aware, num_generations hyperparameter has to be as least per_device_eval_batch_size and divides per_device_eval_batch_size x NUM_DEVICES.

If OOM...

If you run into OOM, consider turning off vLLM or tuning

  • per_device_train_batch_size
  • gradient_accumulation_steps
  • num_generations

Publish Trained Checkpoints

To push trained checkpoints (suppose saving every 500 steps and last) using above configuration $config to huggingface repo as $organization/$save-500, ...:

python slurms/push_ckpt_to_hub.py --repo_name "$config" --save_name "$save" --token "$token" --organization "$organization"

Evaluate HF Hub Models (Qwen2-VL style)

Note, each job would occupy a port. So remember to select different ports when evaluating multiple experiments.

If we want to evalute $organization/$save_500 checkpoint with 4 GPUs:

# on scienceqa_test, lisa_test, sat_test 
CUDA_VISIBLE_DEVICES=0,1,2,3 source slurms/test_by_ckpt_lmms_reason_final.sh $organization/$save-500 4 29500

# on mmmu,mathvista,chartqa,infovqa
CUDA_VISIBLE_DEVICES=4,5,6,7 source slurms/test_by_ckpt_lmms_reason.sh $organization/$save-500 4 29501

These would save results to <repo>/MoDoMoDo/outputs folder. It's normal for the evualtion to take hours...
And feel free to use less gpus for evaluation.

If you want to evaluate checkpoints following other styles, try to change --model qwen2_vl_reason in test_by_ckpt_lmms_reason.sh and test_by_ckpt_lmms_reason_final.sh. We have additionally supported evaluation of

  • qwen2_5_vl_reason: Qwen2.5-VL
  • internvl2_reason: InternVL2

Regex Grab Logs and Create markdown Results

Assume for each checkpoint, you have finished above both scripts' evaluatoin:

python extract_metrics.py
python generate_markdown.py --row-avg last # this would use last-row mode to aggregate ckpt score
python generate_markdown.py # this would use step-averaged mode to aggregate ckpt score

This would save the xxx.md that could be used for Data Mixture Prediction, Visualization.

Check the arguments of generate_markdown.py for fancier markdown creation.

Data Mixture Prediction Based on markdown Results

You would need to specify which markdown you use for each script you run below.

  1. Heuristic: check compute_weights/*.py or compute_weights_no1/*.py. To reproduce our weights, check latex/250430_gold.md.
  2. Model-based: check check_linear/*.py. To reproduce our weights, check latex/250515_gold.md.

Note:

  • Seed series do not need Data Mixture Prediction.
  • Be very careful with which xxx.md are you using!

Visualize Results as Images Based on markdown Results

Refer to latex/create_*.py These files also strongly rely on markdown selection.

Add one Dataset (using SAT dataset as an example)

  1. Make sure your dataset strictly follow the verifiable format.

  2. in slurms/prepare_data.py:

data_pairs = [
    ["yiqingliang/sat-problems-dataset", "share_data/sat-problems-dataset", token], #token is required for private dataset
    ...
]

Then, run:

python slurms/prepare_data_2503.py
  1. edit src/open_r1/dataset_info.json, add an entry:
"share_data/sat-problems-dataset":{
        "file_name": "share_data/sat-problems-dataset",
        "formatting": "SAT",
        "load_from": "disk",
        "file_ext": "arrow"
    }
  1. edit src/open_r1/dataset_utils/converter.py
  • Add "SAT" option in DatasetAttr.formatting literals (corresponds to "formatting")

  • Add an entry to SYSTEM_PROMPT:

"SAT": ("A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
),
  • (optional) Add class SATDatasetConverter(DatasetConverter) with proper arguments, if existing DatasetConverters could not serve the new dataset well.

  • Add "SAT": SATDatasetConverter entry to DATASET_CONVERTERS

  1. edit src/open_r1/dataset_utils/processor.py
  • (optional) Add prepartion function
def prepare_images_SAT(x):
    return x["image"]

  • Add "SAT": prepare_images_SAT entry to Image_Prepare_Funcs
  1. (Optional) Add src/open_r1/rewards/sat.py

  2. (Optional) Add entries in src/open_r1/rewards/__init__.py

BibTex

If you find our repository useful, please consider giving it a star ⭐ and citing our paper:

@misc{liang2025modomodomultidomaindatamixtures,
      title={MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning}, 
      author={Yiqing Liang and Jielin Qiu and Wenhao Ding and Zuxin Liu and James Tompkin and Mengdi Xu and Mengzhou Xia and Zhengzhong Tu and Laixi Shi and Jiacheng Zhu},
      year={2025},
      eprint={2505.24871},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.24871}, 
}

Contributors and Acknowledgement

MoDoMoDo's Amazing Core Contributors:

Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, Jiacheng Zhu

Are from (unordered)

  • Brown University
  • Massachusetts Institute of Technology
  • NVIDIA Research
  • Salesforce Research
  • Carnegie Mellon University
  • Princeton University
  • Texas A&M University
  • California Institute of Technology

We thank open-r1, trl, PhysBench, lmms-eval, LLaMA-Factory, Visual-RFT, VLM-R1, R1-V for code reference.

About

Official implementation of "MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published