Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.
Figure 1: Overview of our framework
- ARLArena profoundly analyzes the existing sufferings of Agentic RL from the perspective of policy gradient.
- ARLArena comprehensively compares existing Agentic RL algorithms and provides a systematic discussion and analysis across multiple dimensions.
- ARLArena universally provides experimental results and findings on multiple agentic tasks.
- ✅ Support Training Multi-turn Math+Code Interpreter Agents
- ✅ Support Training Multi-turn Embody Agents
- ✅ Support Training Multi-turn Multi-modal Game Agents
- ✅ Support Training Multi-turn Web Agents
- ✅ Support Training Multi-turn Search Agents
- ➡️ Support Software Enginnering Agents
- Cross-domain agentic reasoning
- Multiple tool integration reasoning
ARLArena is based on the following main dependencies:
Python=3.11, VeRL=0.4.0, PyTorch=2.6.0, and vLLM=0.8.5# 1. Build the webshop environments
bash prepare_all_web.sh
# 2. Run the demo code with:
conda activate agentrl_web.sh
bash examples/shop_agent_trainer/train_xxxx.sh# 1. Build the environments
bash prepare_all_embody.sh
# 2. Run the demo code with:
bash examples/world_agent_trainer/train_xxx.sh
-
We use Sandbox Fusion as an asynchronous code interpreter. You can follow the Guidance to run the CI.
-
The training datasets are Math3-5 from SimpleRL in
datasets.
# 3. Install the requirements
bash prepare_all_science.sh
# 4. Run the demo code with:
bash examples/simpletir_trainer/train_xxx.sh# 1. Install the requirements
bash prepare_all_game.sh
# 2. Run the demo code with:
bash examples/game_agent_trainer/train_xxx.sh#! 1. Build the RAG server environments
bash prepare_all_search.sh
# 2. Run the demo code with:
bash examples/search_agent_trainer/train_xxx.sh🔹 All of the methods utilized is in recipe, you can warp the verl worker for your code to join our codebase. The folder under recipe can represent either a method for different tasks or a series methods for one task. You can refer to Easy Extension for examples.
🔹 All of the environments utilized is in agent_system, you can warp the env for your code to join our codebase.
🔹 Add specific dependencies to requirements_xxx.txt
🔹 Feel free to add the folder of the third-party tools, e.g., sandbox for code implementation.
Figure 2: A summary of policy optimization methods studied in ARLArena.
Figure 3: Performance comparison of policy optimization methods across four agentic tasks, evaluated on the SFT version of Qwen3-4B.
Figure 4: Training curves on ALFWorld (left) and Sokoban (right).
@article{wang2026arlarena,
title={ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning},
author={Wang, Xiaoxuan and Zhang, Han and Wang, Haixin and Shi, Yidan and Li, Ruoyan and Han, Kaiqiao and Tong, Chenyi and Deng, Haoran and Sun, Renliang and Taylor, Alexander and others},
journal={arXiv preprint arXiv:2602.21534},
year={2026}
}- TinyZero: a reproduction of DeepSeek R1 Zero recipe for reasoning tasks
- SkyThought: RL training for Sky-T1-7B by NovaSky AI team.
- simpleRL-reason: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
- Easy-R1: Multi-modal RL training framework
- OpenManus-RL: LLM Agents RL tunning framework for multiple agent environments.
- rllm: async RL training with verl-pipeline
- PRIME: Process reinforcement through implicit rewards
- RAGEN: a general-purpose reasoning agent training framework
- Logic-RL: a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
- Search-R1: RL with reasoning and searching (tool-call) interleaved LLMs
- DeepRetrieval: RL Training of Search Agent with Search/Retrieval Outcome
- ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
- Code-R1: Reproducing R1 for Code with Reliable Rewards
- Skywork-OR1: Skywork open reaonser series
- ToRL: Scaling tool-integrated RL
- verl-agent: A scalable training framework for long-horizon LLM/VLM agents, along with a new algorithm GiGPO
- PF-PPO: Policy Filtration for PPO based on the reliability of reward signals for more efficient and robust RLHF.
- GUI-R1: GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents
- DeepResearcher: Scaling deep research via reinforcement learning in real-world environments
- VAGEN: Training VLM agents with multi-turn reinforcement learning
- ReTool: ReTool: reinforcement learning for strategic tool use in LLMs
- Seed-Coder: RL training of Seed-Coder boosts performance on competitive programming
- all-hands/openhands-lm-32b-v0.1: A strong, open coding agent model, trained with multi-turn fine-tuning
- RM-R1: RL training of reasoning reward models
- Absolute Zero Reasoner: A no human curated data self-play framework for reasoning
- LUFFY: Learning to Reason under Off-Policy Guidance
- verl-tool: An unified and easy-to-extend tool-agent training framework based on verl
- DeepMath: DeepMath-103K data and series models for math reasoning



