Skip to content

WillDreamer/ARL-Arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

GitHub Repo stars

Abstract

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

ARLArena Framework

Figure 1: Overview of our framework

Our Framework Design

  • ARLArena profoundly analyzes the existing sufferings of Agentic RL from the perspective of policy gradient.
  • ARLArena comprehensively compares existing Agentic RL algorithms and provides a systematic discussion and analysis across multiple dimensions.
  • ARLArena universally provides experimental results and findings on multiple agentic tasks.

🔥 Key Features

  • ✅ Support Training Multi-turn Math+Code Interpreter Agents
  • ✅ Support Training Multi-turn Embody Agents
  • ✅ Support Training Multi-turn Multi-modal Game Agents
  • ✅ Support Training Multi-turn Web Agents
  • ✅ Support Training Multi-turn Search Agents

🔧 Upcoming Features and Changes

  • ➡️ Support Software Enginnering Agents

📅 TODO

  • Cross-domain agentic reasoning
  • Multiple tool integration reasoning

💡 Getting Started

ARLArena is based on the following main dependencies:

Python=3.11, VeRL=0.4.0, PyTorch=2.6.0, and vLLM=0.8.5

🚀 Existing Support

🛒 Web Agents

# 1. Build the webshop environments
bash prepare_all_web.sh

# 2. Run the demo code with:
conda activate agentrl_web.sh
bash examples/shop_agent_trainer/train_xxxx.sh

🤖 Embodied Agents

# 1. Build the environments
bash prepare_all_embody.sh

# 2. Run the demo code with:
bash examples/world_agent_trainer/train_xxx.sh

🧮 Math+CI

  1. We use Sandbox Fusion as an asynchronous code interpreter. You can follow the Guidance to run the CI.

  2. The training datasets are Math3-5 from SimpleRL in datasets.

# 3. Install the requirements
bash prepare_all_science.sh

# 4. Run the demo code with:
bash examples/simpletir_trainer/train_xxx.sh

🎮 OpenAI Game Agents

# 1. Install the requirements
bash prepare_all_game.sh

# 2. Run the demo code with:
bash examples/game_agent_trainer/train_xxx.sh

🕸️ Search Agents

#! 1. Build the RAG server environments
bash prepare_all_search.sh


# 2. Run the demo code with:
bash examples/search_agent_trainer/train_xxx.sh

🌊 Easy Extension

🔹 All of the methods utilized is in recipe, you can warp the verl worker for your code to join our codebase. The folder under recipe can represent either a method for different tasks or a series methods for one task. You can refer to Easy Extension for examples.

🔹 All of the environments utilized is in agent_system, you can warp the env for your code to join our codebase.

🔹 Add specific dependencies to requirements_xxx.txt

🔹 Feel free to add the folder of the third-party tools, e.g., sandbox for code implementation.

📊 Further Details

table

Figure 2: A summary of policy optimization methods studied in ARLArena.

table

Figure 3: Performance comparison of policy optimization methods across four agentic tasks, evaluated on the SFT version of Qwen3-4B.

success

Figure 4: Training curves on ALFWorld (left) and Sokoban (right).

✍️ Citation

@article{wang2026arlarena,
  title={ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning},
  author={Wang, Xiaoxuan and Zhang, Han and Wang, Haixin and Shi, Yidan and Li, Ruoyan and Han, Kaiqiao and Tong, Chenyi and Deng, Haoran and Sun, Renliang and Taylor, Alexander and others},
  journal={arXiv preprint arXiv:2602.21534},
  year={2026}
}

🌟 Star History

Star History Chart

🎆 Awesome work for reference

  • TinyZero: a reproduction of DeepSeek R1 Zero recipe for reasoning tasks GitHub Repo stars
  • SkyThought: RL training for Sky-T1-7B by NovaSky AI team. GitHub Repo stars
  • simpleRL-reason: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild GitHub Repo stars
  • Easy-R1: Multi-modal RL training framework GitHub Repo stars
  • OpenManus-RL: LLM Agents RL tunning framework for multiple agent environments. GitHub Repo stars
  • rllm: async RL training with verl-pipeline GitHub Repo stars
  • PRIME: Process reinforcement through implicit rewards GitHub Repo stars
  • RAGEN: a general-purpose reasoning agent training framework GitHub Repo stars
  • Logic-RL: a reproduction of DeepSeek R1 Zero on 2K Tiny Logic Puzzle Dataset. GitHub Repo stars
  • Search-R1: RL with reasoning and searching (tool-call) interleaved LLMs GitHub Repo stars
  • DeepRetrieval: RL Training of Search Agent with Search/Retrieval Outcome GitHub Repo stars
  • ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning GitHub Repo stars
  • Code-R1: Reproducing R1 for Code with Reliable Rewards GitHub Repo stars
  • Skywork-OR1: Skywork open reaonser series GitHub Repo stars
  • ToRL: Scaling tool-integrated RL GitHub Repo stars
  • verl-agent: A scalable training framework for long-horizon LLM/VLM agents, along with a new algorithm GiGPO GitHub Repo stars
  • PF-PPO: Policy Filtration for PPO based on the reliability of reward signals for more efficient and robust RLHF.
  • GUI-R1: GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents GitHub Repo stars
  • DeepResearcher: Scaling deep research via reinforcement learning in real-world environments GitHub Repo stars
  • VAGEN: Training VLM agents with multi-turn reinforcement learning GitHub Repo stars
  • ReTool: ReTool: reinforcement learning for strategic tool use in LLMs
  • Seed-Coder: RL training of Seed-Coder boosts performance on competitive programming GitHub Repo stars
  • all-hands/openhands-lm-32b-v0.1: A strong, open coding agent model, trained with multi-turn fine-tuning
  • RM-R1: RL training of reasoning reward models GitHub Repo stars
  • Absolute Zero Reasoner: A no human curated data self-play framework for reasoningGitHub Repo stars
  • LUFFY: Learning to Reason under Off-Policy GuidanceGitHub Repo stars
  • verl-tool: An unified and easy-to-extend tool-agent training framework based on verlGitHub Repo stars
  • DeepMath: DeepMath-103K data and series models for math reasoningGitHub Repo stars

About

[ICML2026] ARLArena

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors