SPA (Self-Play Agent) is a reinforcement learning recipe for training Large Language Model (LLM) agents in out-of-distribution (OOD) environments. By equipping agents with an internal world model through self-play supervised finetuning (SFT), SPA enables better grounding, broader exploration, and more reliable generalization.
LLM agents often struggle when deployed in environments that differ from their pre-training distribution. Standard reinforcement learning tends to overfit to narrow solution paths, improving Pass@1 slightly but causing Pass@k to degrade. This reflects brittle exploration and weak generalization.
SPA addresses this by introducing a world model with two key components:
- State Representation: structured abstractions (e.g., symbolic coordinates in Sokoban) that lower perplexity and make spatial relations explicit.
- Transition Modeling: predicting next states during self-play, enabling the agent to internalize environment dynamics before policy optimization.
This initialization makes subsequent PPO training more stable and effective.
SPA significantly improves performance across challenging environments:
- Sokoban: Pass@1 success rate from 25.6% → 59.8%
- FrozenLake: Pass@1 success rate from 22.1% → 70.9%
- Sudoku: Pass@1 success rate from 0.0% → 59.6%
These improvements are consistent across different LLM families, including Qwen and LLaMA models.
SPA training consists of three stages:
- Data Generation: Collect self-play trajectories with
<observation>and<prediction>states. - Supervised Finetuning (SFT): Train the agent to predict next states and actions.
- PPO Optimization: Reinforce policies initialized with the learned world model.
This exploration-before-exploitation process allows agents to first learn environment rules, then optimize for rewards.
Clone RAGEN and place SPA inside:
git clone [email protected]:RAGEN-AI/RAGEN.git
cd RAGEN
git clone [email protected]:shiqichen17/SPA.gitFrom the RAGEN root directory:
bash scripts/setup_ragen.sh
pip uninstall -y torch torchvision torchaudio && pip install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
pip uninstall -y vllm flash-attn flash_attn
pip install vllm==0.8.5.post1
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl
python -c "import torch; import flash_attn; import vllm; print('✅ All modules loaded successfully.')"Note: Use the versions above exactly to avoid runtime errors.
From the SPA directory:
cd SPA
bash run_spa.sh <CONFIG_NAME> [CKPT] [GENERATE_DATA]Arguments:
CONFIG_NAME(required): Environment config -_2_sokoban,_10_sudoku, or_3_frozen_lakeCKPT(optional, default:last): Checkpoint to use (lastfor latest, or step number like1000)GENERATE_DATA(optional, default:False): Set toTrueto run full pipeline,Falsefor PPO only
Examples:
# Full pipeline (generate data → SFT → PPO)
bash run_spa.sh _2_sokoban last True
# PPO training only with existing checkpoint
bash run_spa.sh _2_sokoban last False
# Use specific checkpoint step
bash run_spa.sh _10_sudoku 2000 FalseThis script runs the full pipeline (when GENERATE_DATA=True):
- Generate self-play training data
- Perform SFT world-model training
- Run PPO policy optimization
We provide pretrained models and training datasets for all three environments on Hugging Face:
| Environment | 📊 SFT Training Data | 🤖 Model (after self-play finetuning) |
|---|---|---|
| Sokoban | SPA-sokoban-data | SPA-sokoban-qwen2.5-1.5b-instruct |
| FrozenLake | SPA-frozenlake-data | SPA-frozenlake-qwen2.5-1.5b-instruct |
| Sudoku | SPA-sudoku-data | SPA-sudoku-qwen2.5-1.5b-instruct |
These resources allow you to:
- Use the pretrained models directly for inference or further finetuning
- Reproduce the SFT stage using the provided training data
- Skip data generation and start from the SFT or PPO stages
Note: The FrozenLake and Sudoku datasets include trajectory filtering to remove trajectories not following the format, while the Sokoban dataset contains unfiltered raw trajectories from self-play data generation.
SPA supports a variety of environments integrated through RAGEN:
- Sokoban (grid-based spatial puzzles)
- FrozenLake (navigation under stochastic transitions)
- Sudoku (4×4 logical puzzles)
For Sokoban, SPA generates structured reasoning traces:
<think>
<observation>
######
#___O#
#__X_#
###P_#
###__#
######
Player (P) at (3,3); box (X) at (2,3); goal at (1,4).
</observation>
<prediction>
######
#___O#
#____#
###X_#
###P_#
######
</prediction>
</think>
<answer>Up</answer>
This explicit observation → prediction → action format grounds decision-making in environment dynamics.
Key configuration files are located in config/:
base.yaml: core training settings_2_sokoban.yaml,_3_frozen_lake.yaml, etc.: environment-specific configsenvs.yaml: environment registry
Important parameters:
model_path: base model (e.g.,Qwen/Qwen2.5-1.5B-Instruct)trainer.total_training_steps: PPO stepsagent_proxy.max_turn: max turns per episodees_manager.train.env_groups: number of environment groups
If you use SPA in your work, please cite:
@misc{chen2025spa,
title={Internalizing World Models via Self-Play Finetuning for Agentic RL},
author={Shiqi Chen and Tongyao Zhu and Zian Wang and Jinghan Zhang and Kangrui Wang and Siyang Gao and Teng Xiao and Yee Whye Teh and Junxian He and Manling Li},
year={2025},
eprint={2510.15047},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.15047},
}
This project is licensed under the Apache 2.0 License. See the LICENSE file for details.
SPA is built on top of the RAGEN framework, extending it with explicit world-model pretraining for improved RL scalability.