SvS: Self-play with Variational problem Synthesis

[🌐 Website] • [🤗 Open Source] • [📜 Paper] • [🐱 GitHub] • [🐦 Twitter] • [📕 Rednote]

Repo for "Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR"

Figure 1: We train Qwen2.5-32B-Instruct on the DAPO-17k dataset using our SVS strategy and standard RLVR. SvS achieves significant improvements in Pass@32 and Pass@1 (average 32 times) scores on AIME benchmarks.

🔥 News

[2025/12/13] 🔥🔥🔥 We open-sourced three SvS model checkpoints at different scales, along with an additional 7B checkpoint for coding tasks, available at [Models]. Training parquet data are attached in the respective repos.
[2025/08/25] We provide the full code for training and evaluation for SvS.
[2025/08/19] Our full code and datasets are under review by Microsoft and will be released upon approval.
[2025/08/19] SwS paper, repo, website and datasets (variational DAPO-17k) released.

💡 Introduction

The SvS framework leverages the policy itself to online augment training problems through self-play. Specifically, the policy synthesizes variational problems from its correct solutions to under-performing training set problems and then attempts to solve these synthetic problems. These variational problems preserve the semantics and, crucially, the ground-truth answers of the original ones, while their structures and descriptions may differ significantly, thereby eliciting novel or diverse reasoning strategies from the policy. Finally, original problem solving, variational problem synthesis, and synthetic problem solving are integrated for policy updating, enabling the model to jointly learn problem solving and synthesis. Our SvS framework continuously maintains policy entropy within a narrow range and substantially improves Pass@k on AIME24 (+18.3%) and AIME25 (+22.8%).

Figure 2: The data workflow of our SVS in a training iteration, comprising original problem solving, variational problem synthesis, synthetic problem solving, and policy update data filtering.

We present an example of variational problem synthesis and the reward-shaping strategy in the following figure. If a synthetic problem is either trivially solvable (too simple) or no solution aligning with the original answer (unsolvable) can be sampled, it receives a negative reward.

Figure 3: Illustrations of a challenging problem, its correct solution from policy, the synthetic variational problems from the solution, and the reward-shaping strategy for the synthetic problems.

📊 Experiments on Qwen2.5-32B-Instruct

Model	Pass@1							Pass@32
Model	AIME24	AIME25	BAIME	Math24o	OlymE	OlymH	Avg.	AIME24	AIME25	BAIME	Math24o	OlymE	OlymH	Avg.
*Open-Source Models*
Qwen2.5-32B	4.3	1.2	2.4	8.0	3.7	1.6	3.5	38.9	15.6	18.7	34.0	24.6	15.2	24.5
Qwen2.5-32B-IT	10.0	13.0	7.4	26.0	8.6	2.0	11.2	40.2	34.6	24.0	67.8	35.2	9.5	35.2
SimpleRL-32B	22.1	13.9	8.3	25.5	9.4	3.7	13.8	62.0	38.5	27.4	69.9	42.5	19.4	43.3
ORZ-32B	24.2	26.3	10.9	16.1	12.2	1.1	15.1	55.7	47.0	29.4	58.0	45.9	12.3	41.4
*MATH-12k*
→ RLVR	22.2	15.8	11.5	34.5	11.7	4.1	16.6	47.4	36.4	29.2	66.0	36.2	16.4	38.6
→ SvS	30.3	21.7	13.8	42.7	20.1	3.3	22.0	63.6	55.1	41.5	79.2	63.6	24.8	54.6
Δ	+8.1	+5.9	+2.3	+8.2	+8.4	-0.8	+5.4	+16.2	+18.7	+12.3	+13.2	+27.4	+8.4	+16.0
*DAPO-17k*
→ RLVR	28.8	30.0	14.0	39.6	17.9	4.8	22.5	52.5	42.4	35.9	71.2	47.1	18.3	44.6
→ SvS	39.3	40.5	19.2	44.1	21.8	2.7	27.9	70.8	65.2	45.9	76.5	43.4	16.7	53.1
Δ	+10.5	+10.5	+5.2	+4.5	+3.9	-2.1	+5.4	+18.3	+22.8	+10.0	+5.3	-3.7	-1.6	+8.5

🚀 Quick Start

⚙️ Setup

We recommend using Conda to manage your environment. We use vLLM (0.10.0) to accelerate inference. Run the following commands to setup your environment:

git [email protected]:MasterVito/SvS.git && cd SvS
conda create -n svs python=3.10.16
conda activate svs
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126 # CUDA 12.6 for example
pip install -r requirements.txt

🪁 Evaluation

We provide a script for inference, simply config the model_name_or_path and data_path (default as using MATH-500 and AIME24 & AIME25 for evaluation) in scripts/evaluation.sh and run the following command:

bash scripts/evaluation.sh

⚡️ Training

We also open-source our complete training scripts for the community. We provide the training data used in our paper in data. For example, to train the Qwen2.5-32B-Instruct model, run the following command:

bash scripts/run_svs_qwen2.5_32b.sh

You can also train the Qwen2.5-3B-Instruct and Llama-3.1-8B-Instruct models using the scripts provided in scripts.

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@misc{liang2025svs,
      title={Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR}, 
      author={Xiao Liang and Zhongzhi Li and Yeyun Gong and Yelong Shen and Ying Nian Wu and Zhijiang Guo and Weizhu Chen},
      year={2025},
      eprint={2508.14029},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.14029}, 
}

🙏 Acknowledgement

We sincerely appreciate the outstanding work of veRL and SwS. The challenging problem augmentation strategy is inspired by SwS, and the training code is adapted from the veRL repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docker		docker
docs		docs
eval		eval
examples		examples
outputs		outputs
prompts		prompts
recipe		recipe
scripts		scripts
tests		tests
verl		verl
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SvS: Self-play with Variational problem Synthesis

🔥 News

💡 Introduction

📊 Experiments on Qwen2.5-32B-Instruct

🚀 Quick Start

⚙️ Setup

🪁 Evaluation

⚡️ Training

☕️ Citation

🙏 Acknowledgement

🌟 Star History

About

Uh oh!

Releases

Packages

Languages

License

MasterVito/SvS

Folders and files

Latest commit

History

Repository files navigation

SvS: Self-play with Variational problem Synthesis

🔥 News

💡 Introduction

📊 Experiments on Qwen2.5-32B-Instruct

🚀 Quick Start

⚙️ Setup

🪁 Evaluation

⚡️ Training

☕️ Citation

🙏 Acknowledgement

🌟 Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages