Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

Difficulty Adaptive Rollout Sampling (DARS) with Breadth-Scaling to unlock simultaneous Pass@1-K gains in RLVR.

We treat DARS as the focal loss in RLVR.

✨Getting Started

Prepare Data

download https://huggingface.co/datasets/yangzhch6/DARS-Dataset

openr1.parquet: training data
capacity_val.parquet: testing data

Installation

Choice 1: Install from conda pack (Recommended)

download https://huggingface.co/yangzhch6/dars-env

then,

mkdir -p /opt/envs/dars
tar xzf dars.tar.gz -C /opt/envs/dars

source /opt/envs/dars/bin/activate
# or
conda activate /opt/envs/dars

cd dars
pip install -e ./verl
pip install -e .

This choice only works for CUDA-12.4

Choice 2: Install from scratch

pip install -e ./verl
pip install packaging
pip install ninja
pip install flash-attn --no-build-isolation
pip install -e .

You may need to install flash-attention through:

https://github.com/Dao-AILab/flash-attention/releases/tag/v2.7.4.post1

🔧Usage

DARS-Breadth Training

More experiment scripts refer to ./experiments/**.sh

resampling_func 1: equal treatment schedule, we set n_max = 32 for training DARS-1.5B/7B.

resampling_func 2: hardness weighted schedule, we set n_max = 64 for training DARS-1.5B/7B.

#!/bin/bash
set -x

# Warning: Export VLLM_ATTENTION_BACKEND on every machine before starting Ray cluster.
# vLLM without XFORMERS will results in CUDA errors.
export WANDB_API_KEY="your key here"
export VLLM_ATTENTION_BACKEND=XFORMERS
export MODEL_PATH="$MODEL_PATH/Qwen2.5-Math-1.5B"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Train over a single node
python3 -m verl.trainer.main_ppo_dars \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_PATH/openr1.parquet \
    data.val_files=$DATA_PATH/capacity_val.parquet \
    data.train_batch_size=3072 \
    data.val_batch_size=512 \
    data.max_prompt_length=1024 \
    data.max_response_length=3072 \
    data.shuffle=False \
    +data.resampling_func=1 \
    +data.use_template=True \
    +data.reward_impl_version=2 \
    +actor_rollout_ref.ref.use_ref=False \
    actor_rollout_ref.actor.ppo_epochs=2 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=5e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=3072 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=24576 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.kl_loss_coef=0.000 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.temperature=1.0 \
    actor_rollout_ref.rollout.val_temperature=1.0 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.n_val=128 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.000 \
    +algorithm.n_max=32 \
    +algorithm.grpo_use_std=False \
    trainer.critic_warmup=0 \
    +trainer.del_last_ckpt=False \
    +trainer.log_train=True \
    trainer.logger=['console','wandb'] \
    trainer.project_name='DARS' \
    trainer.experiment_name='Qwen2.5-Math-1.5B-openr1-nothink-3k-func1-bs3k-pp2' \
    +trainer.val_before_train=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=14 \
    trainer.test_freq=14 \
    trainer.default_hdfs_dir=null \
    trainer.total_training_steps=86 \
    trainer.total_epochs=30 "${@:1}"

Baseline Training

refer to dars-baseline branch This is to acquire our own RLVR baseline replication results.

📃EvaluationET

python ./analysis_results.py --data_path [your valid generation json path]

Experimental Results

Model	AIME24	MATH-500	Olympiad	AMC	Minerva	Avg@128	Pass@128

`Qwen-Math-1.5B-Base`	4.0	35.1	16.2	20.8	9.5	21.1	77.9
`Qwen-Math-1.5B-Ins`	10.2	67.9	34.7	42.6	24.4	43.5	79.9
`Oat-Zero-1.5B`	16.4	73.0	35.5	47.4	26.8	46.3	80.3
`Qwen-Math-1.5B-RLVR`	14.7	75.9	39.4	47.5	31.2	49.6	79.6
`DARS-1.5B-ET`	15.8	76.0	40.9	47.2	30.0	50.0	81.2
`DARS-1.5B-ET-Breadth`	18.6	79.4	42.9	50.6	31.7	52.5	80.8
`DARS-1.5B-HW`	17.7	76.4	40.0	48.4	30.8	50.0	82.1
`DARS-1.5B-HW-Breadth`	19.3	79.0	42.7	51.9	31.6	52.4	82.2

`Qwen-Math-7B-Base`	11.6	52.3	19.7	35.2	15.3	30.1	82.1
`Qwen2.5-Math-7B-Ins`	12.9	81.5	39.9	47.0	34.1	52.0	82.3
`SimpleRL-Zero-7B`	23.3	72.8	36.1	52.8	26.8	46.9	82.5
`Oat-Zero-7B`	31.3	79.2	42.5	59.4	33.7	53.4	79.7
`Qwen-Math-7B-RLVR`	26.8	82.2	44.3	57.2	35.7	55.3	81.4
`DARS-7B-ET`	26.9	83.2	46.6	57.3	38.5	57.0	81.7
`DARS-7B-ET-Breadth`	33.3	83.8	47.8	61.3	38.4	58.1	82.1
`DARS-7B-HW`	30.1	83.5	47.1	59.4	37.2	57.3	83.5
`DARS-7B-HW-Breadth`	33.0	84.5	48.4	63.0	36.9	58.4	83.4

`Llama-3.1-8B-Base`	0.23	6.13	1.54	2.76	2.72	3.25	52.7
`DARS-Llama-ET-Breadth`	1.46	39.4	12.0	13.2	20.1	22.0	67.2
`DARS-Llama-HW-Breadth`	1.11	39.0	12.0	13.3	19.8	21.8	68.7

DARS Models

Model	Huggingface	Base Model

`DARS-1.5B-ET`	https://huggingface.co/yangzhch6/DARS-1.5B-ET	Qwen2.5-Math-1.5B
`DARS-1.5B-HW`	https://huggingface.co/yangzhch6/DARS-1.5B-HW	Qwen2.5-Math-1.5B
`DARS-1.5B-ET-Breadth`	https://huggingface.co/yangzhch6/DARS-1.5B-ET-Breadth	Qwen2.5-Math-1.5B
`DARS-1.5B-HW-Breadth`	https://huggingface.co/yangzhch6/DARS-1.5B-HW-Breadth	Qwen2.5-Math-1.5B

`DARS-7B-ET`	https://huggingface.co/yangzhch6/DARS-7B-ET	Qwen2.5-Math-7B
`DARS-7B-HW`	https://huggingface.co/yangzhch6/DARS-7B-HW	Qwen2.5-Math-7B
`DARS-7B-ET-Breadth`	https://huggingface.co/yangzhch6/DARS-7B-ET-Breadth	Qwen2.5-Math-7B
`DARS-7B-HW-Breadth`	https://huggingface.co/yangzhch6/DARS-7B-HW-Breadth	Qwen2.5-Math-7B

`DARS-Llama-ET-Breadth`	https://huggingface.co/yangzhch6/DARS-Llama-ET-Breadth	Llama-3.1-8B
`DARS-Llama-HW-Breadth`	https://huggingface.co/yangzhch6/DARS-Llama-HW-Breadth	Llama-3.1-8B

🌻Acknowledgement

This repo builds upon veRL and deepscaler, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for datasets and backbones, OpenR1-Math-220k, Qwen2.5-Math, and DeepSeek-R1 model.

📬 Contact

For questions, feedback, or collaboration opportunities, feel free to reach out:

Zhicheng Yang: [email protected]

Citation

If you find our model or code useful, please kindly cite our paper:

@misc{yang2025depthbreadthsynergyrlvrunlocking,
      title={Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration}, 
      author={Zhicheng Yang and Zhijiang Guo and Yinya Huang and Yongxin Wang and Dongchun Xie and Yiwei Wang and Xiaodan Liang and Jing Tang},
      year={2025},
      eprint={2508.13755},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.13755}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
deepscaler		deepscaler
experiments		experiments
figs		figs
verl		verl
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis_results.py		analysis_results.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

✨Getting Started

Prepare Data

Installation

Choice 1: Install from conda pack (Recommended)

Choice 2: Install from scratch

🔧Usage

DARS-Breadth Training

Baseline Training

📃EvaluationET

Experimental Results

DARS Models

🌻Acknowledgement

📬 Contact

Citation

About

Uh oh!

Releases

Packages

Languages

License

yangzhch6/DARS

Folders and files

Latest commit

History

Repository files navigation

Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration

✨Getting Started

Prepare Data

Installation

Choice 1: Install from conda pack (Recommended)

Choice 2: Install from scratch

🔧Usage

DARS-Breadth Training

Baseline Training

📃EvaluationET

Experimental Results

DARS Models

🌻Acknowledgement

📬 Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages