🤖 ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Abstract

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

Figure 1: Overview of our framework

Our Framework Design

ARLArena profoundly analyzes the existing sufferings of Agentic RL from the perspective of policy gradient.
ARLArena comprehensively compares existing Agentic RL algorithms and provides a systematic discussion and analysis across multiple dimensions.
ARLArena universally provides experimental results and findings on multiple agentic tasks.

🔥 Key Features

✅ Support Training Multi-turn Math+Code Interpreter Agents
✅ Support Training Multi-turn Embody Agents
✅ Support Training Multi-turn Multi-modal Game Agents
✅ Support Training Multi-turn Web Agents
✅ Support Training Multi-turn Search Agents

🔧 Upcoming Features and Changes

➡️ Support Software Enginnering Agents

📅 TODO

Cross-domain agentic reasoning
Multiple tool integration reasoning

💡 Getting Started

ARLArena is based on the following main dependencies:

Python=3.11, VeRL=0.4.0, PyTorch=2.6.0, and vLLM=0.8.5

🚀 Existing Support

🛒 Web Agents

# 1. Build the webshop environments
bash prepare_all_web.sh

# 2. Run the demo code with:
conda activate agentrl_web.sh
bash examples/shop_agent_trainer/train_xxxx.sh

🤖 Embodied Agents

# 1. Build the environments
bash prepare_all_embody.sh

# 2. Run the demo code with:
bash examples/world_agent_trainer/train_xxx.sh

🧮 Math+CI

We use Sandbox Fusion as an asynchronous code interpreter. You can follow the Guidance to run the CI.
The training datasets are Math3-5 from SimpleRL in datasets.

# 3. Install the requirements
bash prepare_all_science.sh

# 4. Run the demo code with:
bash examples/simpletir_trainer/train_xxx.sh

🎮 OpenAI Game Agents

# 1. Install the requirements
bash prepare_all_game.sh

# 2. Run the demo code with:
bash examples/game_agent_trainer/train_xxx.sh

🕸️ Search Agents

#! 1. Build the RAG server environments
bash prepare_all_search.sh


# 2. Run the demo code with:
bash examples/search_agent_trainer/train_xxx.sh

🌊 Easy Extension

🔹 All of the methods utilized is in recipe, you can warp the verl worker for your code to join our codebase. The folder under recipe can represent either a method for different tasks or a series methods for one task. You can refer to Easy Extension for examples.

🔹 All of the environments utilized is in agent_system, you can warp the env for your code to join our codebase.

🔹 Add specific dependencies to requirements_xxx.txt

🔹 Feel free to add the folder of the third-party tools, e.g., sandbox for code implementation.

📊 Further Details

Figure 2: A summary of policy optimization methods studied in ARLArena.

Figure 3: Performance comparison of policy optimization methods across four agentic tasks, evaluated on the SFT version of Qwen3-4B.

Figure 4: Training curves on ALFWorld (left) and Sokoban (right).

✍️ Citation

@article{wang2026arlarena,
  title={ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning},
  author={Wang, Xiaoxuan and Zhang, Han and Wang, Haixin and Shi, Yidan and Li, Ruoyan and Han, Kaiqiao and Tong, Chenyi and Deng, Haoran and Sun, Renliang and Taylor, Alexander and others},
  journal={arXiv preprint arXiv:2602.21534},
  year={2026}
}

🌟 Star History

🎆 Awesome work for reference

TinyZero: a reproduction of DeepSeek R1 Zero recipe for reasoning tasks
SkyThought: RL training for Sky-T1-7B by NovaSky AI team.
simpleRL-reason: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Easy-R1: Multi-modal RL training framework
OpenManus-RL: LLM Agents RL tunning framework for multiple agent environments.
rllm: async RL training with verl-pipeline
PRIME: Process reinforcement through implicit rewards
RAGEN: a general-purpose reasoning agent training framework
Logic-RL: a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset.
Search-R1: RL with reasoning and searching (tool-call) interleaved LLMs
DeepRetrieval: RL Training of Search Agent with Search/Retrieval Outcome
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Code-R1: Reproducing R1 for Code with Reliable Rewards
Skywork-OR1: Skywork open reaonser series
ToRL: Scaling tool-integrated RL
verl-agent: A scalable training framework for long-horizon LLM/VLM agents, along with a new algorithm GiGPO
PF-PPO: Policy Filtration for PPO based on the reliability of reward signals for more efficient and robust RLHF.
GUI-R1: GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents
DeepResearcher: Scaling deep research via reinforcement learning in real-world environments
VAGEN: Training VLM agents with multi-turn reinforcement learning
ReTool: ReTool: reinforcement learning for strategic tool use in LLMs
Seed-Coder: RL training of Seed-Coder boosts performance on competitive programming
all-hands/openhands-lm-32b-v0.1: A strong, open coding agent model, trained with multi-turn fine-tuning
RM-R1: RL training of reasoning reward models
Absolute Zero Reasoner: A no human curated data self-play framework for reasoning
LUFFY: Learning to Reason under Off-Policy Guidance
verl-tool: An unified and easy-to-extend tool-agent training framework based on verl
DeepMath: DeepMath-103K data and series models for math reasoning

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
agent_system		agent_system
docker		docker
docs		docs
examples		examples
public		public
recipe		recipe
sandbox		sandbox
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
prepare_all_embody.sh		prepare_all_embody.sh
prepare_all_game.sh		prepare_all_game.sh
prepare_all_science.sh		prepare_all_science.sh
prepare_all_search.sh		prepare_all_search.sh
prepare_all_sql.sh		prepare_all_sql.sh
prepare_all_web.sh		prepare_all_web.sh
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_ragen.txt		requirements_ragen.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_webshop.txt		requirements_webshop.txt
setup.py		setup.py
setup_conda.sh		setup_conda.sh
setup_env.sh		setup_env.sh
update_hf.py		update_hf.py
upload_resumed_ckpt.py		upload_resumed_ckpt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Abstract

Our Framework Design

🔥 Key Features

🔧 Upcoming Features and Changes

📅 TODO

💡 Getting Started

🚀 Existing Support

🛒 Web Agents

🤖 Embodied Agents

🧮 Math+CI

🎮 OpenAI Game Agents

🕸️ Search Agents

🌊 Easy Extension

📊 Further Details

✍️ Citation

🌟 Star History

🎆 Awesome work for reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Abstract

Our Framework Design

🔥 Key Features

🔧 Upcoming Features and Changes

📅 TODO

💡 Getting Started

🚀 Existing Support

🛒 Web Agents

🤖 Embodied Agents

🧮 Math+CI

🎮 OpenAI Game Agents

🕸️ Search Agents

🌊 Easy Extension

📊 Further Details

✍️ Citation

🌟 Star History

🎆 Awesome work for reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages