Difficulty Adaptive Rollout Sampling (DARS) with Breadth-Scaling to unlock simultaneous Pass@1-K gains in RLVR.
We treat DARS as the focal loss in RLVR.
openr1.parquet: training data
capacity_val.parquet: testing data
then,
mkdir -p /opt/envs/dars
tar xzf dars.tar.gz -C /opt/envs/dars
source /opt/envs/dars/bin/activate
# or
conda activate /opt/envs/dars
cd dars
pip install -e ./verl
pip install -e .
This choice only works for CUDA-12.4
pip install -e ./verl
pip install packaging
pip install ninja
pip install flash-attn --no-build-isolation
pip install -e .
You may need to install flash-attention through:
https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1
More experiment scripts refer to ./experiments/**.sh
resampling_func 1: equal treatment schedule, we set n_max = 32 for training DARS-1.5B/7B.
resampling_func 2: hardness weighted schedule, we set n_max = 64 for training DARS-1.5B/7B.
#!/bin/bash
set -x
# Warning: Export VLLM_ATTENTION_BACKEND on every machine before starting Ray cluster.
# vLLM without XFORMERS will results in CUDA errors.
export WANDB_API_KEY="your key here"
export VLLM_ATTENTION_BACKEND=XFORMERS
export MODEL_PATH="$MODEL_PATH/Qwen2.5-Math-1.5B"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# Train over a single node
python3 -m verl.trainer.main_ppo_dars \
algorithm.adv_estimator=grpo \
data.train_files=$DATA_PATH/openr1.parquet \
data.val_files=$DATA_PATH/capacity_val.parquet \
data.train_batch_size=3072 \
data.val_batch_size=512 \
data.max_prompt_length=1024 \
data.max_response_length=3072 \
data.shuffle=False \
+data.resampling_func=1 \
+data.use_template=True \
+data.reward_impl_version=2 \
+actor_rollout_ref.ref.use_ref=False \
actor_rollout_ref.actor.ppo_epochs=2 \
actor_rollout_ref.model.path=$MODEL_PATH \
actor_rollout_ref.actor.optim.lr=5e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=3072 \
actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24576 \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.kl_loss_coef=0.000 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.temperature=1.0 \
actor_rollout_ref.rollout.val_temperature=1.0 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
actor_rollout_ref.rollout.n=8 \
actor_rollout_ref.rollout.n_val=128 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.kl_ctrl.kl_coef=0.000 \
+algorithm.n_max=32 \
+algorithm.grpo_use_std=False \
trainer.critic_warmup=0 \
+trainer.del_last_ckpt=False \
+trainer.log_train=True \
trainer.logger=['console','wandb'] \
trainer.project_name='DARS' \
trainer.experiment_name='Qwen2.5-Math-1.5B-openr1-nothink-3k-func1-bs3k-pp2' \
+trainer.val_before_train=False \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=14 \
trainer.test_freq=14 \
trainer.default_hdfs_dir=null \
trainer.total_training_steps=86 \
trainer.total_epochs=30 "${@:1}"
refer to dars-baseline branch This is to acquire our own RLVR baseline replication results.
python ./analysis_results.py --data_path [your valid generation json path]
| Model | AIME24 | MATH-500 | Olympiad | AMC | Minerva | Avg@128 | Pass@128 |
|---|---|---|---|---|---|---|---|
Qwen-Math-1.5B-Base |
4.0 | 35.1 | 16.2 | 20.8 | 9.5 | 21.1 | 77.9 |
Qwen-Math-1.5B-Ins |
10.2 | 67.9 | 34.7 | 42.6 | 24.4 | 43.5 | 79.9 |
Oat-Zero-1.5B |
16.4 | 73.0 | 35.5 | 47.4 | 26.8 | 46.3 | 80.3 |
Qwen-Math-1.5B-RLVR |
14.7 | 75.9 | 39.4 | 47.5 | 31.2 | 49.6 | 79.6 |
DARS-1.5B-ET |
15.8 | 76.0 | 40.9 | 47.2 | 30.0 | 50.0 | 81.2 |
DARS-1.5B-ET-Breadth |
18.6 | 79.4 | 42.9 | 50.6 | 31.7 | 52.5 | 80.8 |
DARS-1.5B-HW |
17.7 | 76.4 | 40.0 | 48.4 | 30.8 | 50.0 | 82.1 |
DARS-1.5B-HW-Breadth |
19.3 | 79.0 | 42.7 | 51.9 | 31.6 | 52.4 | 82.2 |
Qwen-Math-7B-Base |
11.6 | 52.3 | 19.7 | 35.2 | 15.3 | 30.1 | 82.1 |
Qwen2.5-Math-7B-Ins |
12.9 | 81.5 | 39.9 | 47.0 | 34.1 | 52.0 | 82.3 |
SimpleRL-Zero-7B |
23.3 | 72.8 | 36.1 | 52.8 | 26.8 | 46.9 | 82.5 |
Oat-Zero-7B |
31.3 | 79.2 | 42.5 | 59.4 | 33.7 | 53.4 | 79.7 |
Qwen-Math-7B-RLVR |
26.8 | 82.2 | 44.3 | 57.2 | 35.7 | 55.3 | 81.4 |
DARS-7B-ET |
26.9 | 83.2 | 46.6 | 57.3 | 38.5 | 57.0 | 81.7 |
DARS-7B-ET-Breadth |
33.3 | 83.8 | 47.8 | 61.3 | 38.4 | 58.1 | 82.1 |
DARS-7B-HW |
30.1 | 83.5 | 47.1 | 59.4 | 37.2 | 57.3 | 83.5 |
DARS-7B-HW-Breadth |
33.0 | 84.5 | 48.4 | 63.0 | 36.9 | 58.4 | 83.4 |
Llama-3.1-8B-Base |
0.23 | 6.13 | 1.54 | 2.76 | 2.72 | 3.25 | 52.7 |
DARS-Llama-ET-Breadth |
1.46 | 39.4 | 12.0 | 13.2 | 20.1 | 22.0 | 67.2 |
DARS-Llama-HW-Breadth |
1.11 | 39.0 | 12.0 | 13.3 | 19.8 | 21.8 | 68.7 |
This repo builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for datasets and backbones, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.
For questions, feedback, or collaboration opportunities, feel free to reach out:
- Zhicheng Yang: [email protected]
If you find our model or code useful, please kindly cite our paper:
@misc{yang2025depthbreadthsynergyrlvrunlocking,
title={Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration},
author={Zhicheng Yang and Zhijiang Guo and Yinya Huang and Yongxin Wang and Dongchun Xie and Yiwei Wang and Xiaodan Liang and Jing Tang},
year={2025},
eprint={2508.13755},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.13755},
}




