- Paper: arXiv
- Model: 🤗 Hugging Face(coming soon)
- Data: 🤗 Data(coming soon)
Unified Vision-Language Models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding–generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference.We curate a high-quality dataset of 16K UG pairs named as PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLMs RL baselines.
- Install Python dependencies
pip install -r requirements.txt- Install system dependencies
sudo apt-get install python3-tk -y
sudo apt install libgl1-mesa-glx -y- Download reward model weights
mkdir -p reward_weight
cd reward_weight
wget https://huggingface.co/xswu/HPSv2/resolve/main/HPS_v2.1_compressed.pt
cd ..- Install HPSv2 reward package
cd rewards/HPSv2
pip install -e .
cd ../../Run the training script:
bash train.shOr customize your training with:
torchrun --nproc_per_node=8 \
open_r1/grpo.py \
--deepspeed "configs/zero3.json" \
--output_dir ./checkpoints/your_run_name \
--model_name_or_path deepseek-ai/Janus-Pro-1B \
--pair_data_path data/your_data.jsonl \
--max_prompt_length 512 \
--num_generations_text 8 \
--num_generations_image 8 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--bf16 true \
--max_steps 8000 \
--learning_rate 1e-6PairUni/
├── janus/ # Janus model implementations
│ ├── models/ # Core model architectures
│ └── janusflow/ # Flow-based generation
├── open_r1/ # PairGRPO training framework
│ ├── grpo.py # Main training script
│ ├── dataset.py # Dataset loader
│ └── trainer/ # Custom trainer implementation
├── rewards/ # Reward models
│ ├── HPSv2/ # Image quality reward
│ ├── reward_understand.py
│ └── reward_generate.py
├── configs/ # Training configurations
└── train.sh # Training launch script
The training data should be in JSONL format with paired examples:
{
"similarity": 0.88,
"generate_ann": {
"image_path": "data/images/geneval_train_e52c9d7d6c674fd8b2c8b5d2ec43efac.png",
"prompt": "a photo of a towel and a zebra",
"question": "Which statement best describes the contrast between the material draped on the animal and the animal’s own surface pattern?\nA. The fabric is smooth and plain, whereas the coat shows bold stripes.\nB. Both the fabric and the coat display identical striping.\nC. The fabric is covered with polka dots, while the coat is entirely plain.\nD. The fabric appears coarse and burlap-like, while the coat looks scaly.\n\nAnswer with the option's letter from the given choices directly.",
"answer": "A",
"tag": "geneval_train"
},
"understand_ann": {
"image_path": "data/images/detection_f2436089737d4f0181f246926c8a2558.png",
"prompt": "In open savanna grassland, a small cluster of five plains zebras stands closely together, black-and-white striped bodies angling different directions amid tall yellowish grass under daylight, with erect manes and ears.",
"question": "What type of pattern dominates the animals’ coats?\nA. Stripes\nB. Polka dots\nC. Solid gray\nD. Checkered\n\nAnswer with the option's letter from the given choices directly.",
"answer": "A",
"tag": "detection"
}
}If you find this work useful, please cite:
@article{pairuni2024,
title={PairUni: Unified Multimodal Training with GRPO},
author={Your Name},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

