Understanding VS. Generation: Navigating Optimization Dilemma in Multimodal Models

This repository contains the code for the paper Understanding VS. Generation: Navigating Optimization Dilemma in Multimodal Models

Understanding VS. Generation: Navigating Optimization Dilemma in Multimodal Models
Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu
PKU, Tencent

Overview

We propose the Reason-Reflect-Refine (R3) framework. R3 re-frames the single-step generation process into a multi-step process of "generate-understand-regenerate". We optimize R3 with Tree-RL strategy, which simultaneously optimizes model's generation and understanding capabilities.

This repository contains the code for training Bagel model with R3 framework. We utilize Tree-RL to split the rollout process into multiple stages. We adopt GRPO to optimize the text reasoning process and Mix-GRPO to optimize the diffusion process.

Installation

conda create -n bagel_r3 python=3.10 -y
conda activate bagel_r3
pip install -r requirements.txt
pip install flash_attn==2.5.8 --no-build-isolation

Download the model checkpoint from Hugging Face and place it in the checkpoints directory.

Usage

We provide a doc in tutorial.md detailing how to train R3 on various datasets.

Citation

If you find this work useful, please consider citing:

@misc{ye2026understandingvsgenerationnavigating,
      title={Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models}, 
      author={Sen Ye and Mengde Xu and Shuyang Gu and Di He and Liwei Wang and Han Hu},
      year={2026},
      eprint={2602.15772},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.15772}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
figs		figs
modeling		modeling
scripts		scripts
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inferencer.py		inferencer.py
requirements.txt		requirements.txt
tutorial.md		tutorial.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding VS. Generation: Navigating Optimization Dilemma in Multimodal Models

Overview

Installation

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Understanding VS. Generation: Navigating Optimization Dilemma in Multimodal Models

Overview

Installation

Usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages