Skip to content

MasterVito/SvS

Repository files navigation

SwS-Logo
SvS: Self-play with Variational problem Synthesis

[🌐 Website][🤗 Open Source][📜 Paper][🐱 GitHub][🐦 Twitter][📕 Rednote]

Repo for "Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR"



Figure 1: We train Qwen2.5-32B-Instruct on the DAPO-17k dataset using our SVS strategy and standard RLVR. SvS achieves significant improvements in Pass@32 and Pass@1 (average 32 times) scores on AIME benchmarks.

🔥 News

  • [2025/12/13] 🔥🔥🔥 We open-sourced three SvS model checkpoints at different scales, along with an additional 7B checkpoint for coding tasks, available at [Models]. Training parquet data are attached in the respective repos.
  • [2025/08/25] We provide the full code for training and evaluation for SvS.
  • [2025/08/19] Our full code and datasets are under review by Microsoft and will be released upon approval.
  • [2025/08/19] SwS paper, repo, website and datasets (variational DAPO-17k) released.

💡 Introduction

The SvS framework leverages the policy itself to online augment training problems through self-play. Specifically, the policy synthesizes variational problems from its correct solutions to under-performing training set problems and then attempts to solve these synthetic problems. These variational problems preserve the semantics and, crucially, the ground-truth answers of the original ones, while their structures and descriptions may differ significantly, thereby eliciting novel or diverse reasoning strategies from the policy. Finally, original problem solving, variational problem synthesis, and synthetic problem solving are integrated for policy updating, enabling the model to jointly learn problem solving and synthesis. Our SvS framework continuously maintains policy entropy within a narrow range and substantially improves Pass@k on AIME24 (+18.3%) and AIME25 (+22.8%).


Figure 2: The data workflow of our SVS in a training iteration, comprising original problem solving, variational problem synthesis, synthetic problem solving, and policy update data filtering.


We present an example of variational problem synthesis and the reward-shaping strategy in the following figure. If a synthetic problem is either trivially solvable (too simple) or no solution aligning with the original answer (unsolvable) can be sampled, it receives a negative reward.


Figure 3: Illustrations of a challenging problem, its correct solution from policy, the synthetic variational problems from the solution, and the reward-shaping strategy for the synthetic problems.


📊 Experiments on Qwen2.5-32B-Instruct

Model Pass@1 Pass@32
AIME24AIME25BAIMEMath24oOlymEOlymHAvg. AIME24AIME25BAIMEMath24oOlymEOlymHAvg.
Open-Source Models
Qwen2.5-32B4.31.22.48.03.71.63.5 38.915.618.734.024.615.224.5
Qwen2.5-32B-IT10.013.07.426.08.62.011.2 40.234.624.067.835.29.535.2
SimpleRL-32B22.113.98.325.59.43.713.8 62.038.527.469.942.519.443.3
ORZ-32B24.226.310.916.112.21.115.1 55.747.029.458.045.912.341.4
MATH-12k
→ RLVR22.215.811.534.511.74.116.6 47.436.429.266.036.216.438.6
→ SvS30.321.713.842.720.13.322.0 63.655.141.579.263.624.854.6
   Δ+8.1+5.9+2.3+8.2+8.4-0.8+5.4 +16.2+18.7+12.3+13.2+27.4+8.4+16.0
DAPO-17k
→ RLVR28.830.014.039.617.94.822.5 52.542.435.971.247.118.344.6
→ SvS39.340.519.244.121.82.727.9 70.865.245.976.543.416.753.1
   Δ+10.5+10.5+5.2+4.5+3.9-2.1+5.4 +18.3+22.8+10.0+5.3-3.7-1.6+8.5

🚀 Quick Start

⚙️ Setup

We recommend using Conda to manage your environment. We use vLLM (0.10.0) to accelerate inference. Run the following commands to setup your environment:

git [email protected]:MasterVito/SvS.git && cd SvS
conda create -n svs python=3.10.16
conda activate svs
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6 for example
pip install -r requirements.txt

🪁 Evaluation

We provide a script for inference, simply config the model_name_or_path and data_path (default as using MATH-500 and AIME24 & AIME25 for evaluation) in scripts/evaluation.sh and run the following command:

bash scripts/evaluation.sh

⚡️ Training

We also open-source our complete training scripts for the community. We provide the training data used in our paper in data. For example, to train the Qwen2.5-32B-Instruct model, run the following command:

bash scripts/run_svs_qwen2.5_32b.sh

You can also train the Qwen2.5-3B-Instruct and Llama-3.1-8B-Instruct models using the scripts provided in scripts.


☕️ Citation

If you find this repository helpful, please consider citing our paper:

@misc{liang2025svs,
      title={Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR}, 
      author={Xiao Liang and Zhongzhi Li and Yeyun Gong and Yelong Shen and Ying Nian Wu and Zhijiang Guo and Weizhu Chen},
      year={2025},
      eprint={2508.14029},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14029}, 
}

🙏 Acknowledgement

We sincerely appreciate the outstanding work of veRL and SwS. The challenging problem augmentation strategy is inspired by SwS, and the training code is adapted from the veRL repository.

🌟 Star History

Star History Chart

About

Official Repo for SvS: A Self-play with Variational Problem Synthesis strategy for RLVR training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published