[🌐 Website] • [🤗 Open Source] • [📜 Paper] • [🐱 GitHub] • [🐦 Twitter] • [📕 Rednote]
Repo for "Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR"
Figure 1: We train Qwen2.5-32B-Instruct on the DAPO-17k dataset using our SVS strategy and standard RLVR. SvS achieves significant improvements in Pass@32 and Pass@1 (average 32 times) scores on AIME benchmarks.
- [2025/12/13] 🔥🔥🔥 We open-sourced three SvS model checkpoints at different scales, along with an additional 7B checkpoint for coding tasks, available at [Models]. Training parquet data are attached in the respective repos.
- [2025/08/25] We provide the full code for training and evaluation for SvS.
- [2025/08/19] Our full code and datasets are under review by Microsoft and will be released upon approval.
- [2025/08/19] SwS paper, repo, website and datasets (variational DAPO-17k) released.
Figure 2: The data workflow of our SVS in a training iteration, comprising original problem solving,
variational problem synthesis, synthetic problem solving, and policy update data filtering.
We present an example of variational problem synthesis and the reward-shaping strategy in the following figure. If a synthetic problem is either trivially solvable (too simple) or no solution aligning with the original answer (unsolvable) can be sampled, it receives a negative reward.
Figure 3: Illustrations of a challenging problem, its correct solution from policy, the synthetic variational problems from the solution, and the reward-shaping strategy for the synthetic problems.
| Model | Pass@1 | Pass@32 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AIME24 | AIME25 | BAIME | Math24o | OlymE | OlymH | Avg. | AIME24 | AIME25 | BAIME | Math24o | OlymE | OlymH | Avg. | ||
| Open-Source Models | |||||||||||||||
| Qwen2.5-32B | 4.3 | 1.2 | 2.4 | 8.0 | 3.7 | 1.6 | 3.5 | 38.9 | 15.6 | 18.7 | 34.0 | 24.6 | 15.2 | 24.5 | |
| Qwen2.5-32B-IT | 10.0 | 13.0 | 7.4 | 26.0 | 8.6 | 2.0 | 11.2 | 40.2 | 34.6 | 24.0 | 67.8 | 35.2 | 9.5 | 35.2 | |
| SimpleRL-32B | 22.1 | 13.9 | 8.3 | 25.5 | 9.4 | 3.7 | 13.8 | 62.0 | 38.5 | 27.4 | 69.9 | 42.5 | 19.4 | 43.3 | |
| ORZ-32B | 24.2 | 26.3 | 10.9 | 16.1 | 12.2 | 1.1 | 15.1 | 55.7 | 47.0 | 29.4 | 58.0 | 45.9 | 12.3 | 41.4 | |
| MATH-12k | |||||||||||||||
| → RLVR | 22.2 | 15.8 | 11.5 | 34.5 | 11.7 | 4.1 | 16.6 | 47.4 | 36.4 | 29.2 | 66.0 | 36.2 | 16.4 | 38.6 | |
| → SvS | 30.3 | 21.7 | 13.8 | 42.7 | 20.1 | 3.3 | 22.0 | 63.6 | 55.1 | 41.5 | 79.2 | 63.6 | 24.8 | 54.6 | |
| Δ | +8.1 | +5.9 | +2.3 | +8.2 | +8.4 | -0.8 | +5.4 | +16.2 | +18.7 | +12.3 | +13.2 | +27.4 | +8.4 | +16.0 | |
| DAPO-17k | |||||||||||||||
| → RLVR | 28.8 | 30.0 | 14.0 | 39.6 | 17.9 | 4.8 | 22.5 | 52.5 | 42.4 | 35.9 | 71.2 | 47.1 | 18.3 | 44.6 | |
| → SvS | 39.3 | 40.5 | 19.2 | 44.1 | 21.8 | 2.7 | 27.9 | 70.8 | 65.2 | 45.9 | 76.5 | 43.4 | 16.7 | 53.1 | |
| Δ | +10.5 | +10.5 | +5.2 | +4.5 | +3.9 | -2.1 | +5.4 | +18.3 | +22.8 | +10.0 | +5.3 | -3.7 | -1.6 | +8.5 | |
We recommend using Conda to manage your environment. We use vLLM (0.10.0) to accelerate inference. Run the following commands to setup your environment:
git [email protected]:MasterVito/SvS.git && cd SvS
conda create -n svs python=3.10.16
conda activate svs
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6 for example
pip install -r requirements.txtWe provide a script for inference, simply config the model_name_or_path and data_path (default as using MATH-500 and AIME24 & AIME25 for evaluation) in scripts/evaluation.sh and run the following command:
bash scripts/evaluation.shWe also open-source our complete training scripts for the community. We provide the training data used in our paper in data. For example, to train the Qwen2.5-32B-Instruct model, run the following command:
bash scripts/run_svs_qwen2.5_32b.shYou can also train the Qwen2.5-3B-Instruct and Llama-3.1-8B-Instruct models using the scripts provided in scripts.
If you find this repository helpful, please consider citing our paper:
@misc{liang2025svs,
title={Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR},
author={Xiao Liang and Zhongzhi Li and Yeyun Gong and Yelong Shen and Ying Nian Wu and Zhijiang Guo and Weizhu Chen},
year={2025},
eprint={2508.14029},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14029},
}
We sincerely appreciate the outstanding work of veRL and SwS. The challenging problem augmentation strategy is inspired by SwS, and the training code is adapted from the veRL repository.