Problem: Hesitation is defeat! Multi-turn RL for LLM agents is powerful, but critically limited by poor exploration.
Key idea: Training fails mostly when agents repeat low-value actions or ignore task-level uncertainty.
Our method: T2PO directly controls exploration at both the token and turn level using uncertainty signals, greatly improving stability and sample efficiency.
| 🔹 Token-level: T2PO tracks marginal uncertainty and triggers interventions when it dips below a threshold. |
| 🔸 Turn-level: T2PO resamples turns with negligible exploration progress, preventing wasted updates. |
| 📊 Benchmarks: Substantial gains on WebShop, ALFWorld, SearchQA and more—significantly better stability and learning efficiency. |
- ✅ Support Training Multi-turn Embody Agents
- ✅ Support Training Multi-turn Search Agents
- ✅ Support Training Multi-turn Multi-modal Game Agents
- ✅ Support Training Multi-turn Web Agents
- ✅ Support Evaluating Commerical LLMs as Agents
Our work is based on the following main dependencies:
Python=3.11, VeRL=0.4.0, PyTorch=2.6.0, and vLLM=0.8.5👉 Click to expand installation guide (optional)
# (Optional) Install conda
bash set_conda.sh
# Install main dependencies
bash setup_env.sh
# Install extra requirements for specific tasks
conda activate verl
pip install -r requirements_xxx.txt🤖 Embodied Agents
# 1. Build the environments
bash prepare_all_embody.sh
# 2. Run the demo code
conda activate agentrl_embody
bash examples/world_agent_trainer/train_xxx.sh🛒 Web Agents
# 1. Build the webshop environments
bash prepare_all_web.sh
# 2. Run the demo code
conda activate agentrl_web
bash examples/shop_agent_trainer/train_xxxx.sh🕸️ Search Agents
# 1. Build the RAG server environments
bash prepare_all_search.sh
# 2. Run the demo code
conda activate agentrl_search
bash examples/search_agent_trainer/train_xxx.sh🎮 Multi-modal Game Agents
# 1. Install the requirements
bash prepare_all_game.sh
# 2. Run the demo code
bash examples/game_agent_trainer/train_xxx.sh✨ Extensible by Design:
- All task recipes live in
recipe. Wrap the VERL worker to plug in your own method. [usage] - Add new environments under
agent_system. - Extra dependencies go into
requirements_xxx.txt. - Third-party tools? Place them in
AgentRL/sandbox.
📈 Expand for MLFlow analysis setup
# Install requirements
pip install mlflow
# Start server
mlflow server \
--host 0.0.0.0 --port 5000 \
--backend-store-uri sqlite:////tmp/mlruns.db \
--default-artifact-root /tmp/mlruns
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
# Trainer config
actor_rollout_ref.rollout.trace.backend: mlflow # or weave
actor_rollout_ref.rollout.trace.token2text: True
trainer.logger: ['console', 'mlflow']@article{wang2026t,
title={T $^2$ PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning},
author={Wang, Haixin and Cui, Hejie and Zhang, Chenwei and Liu, Xin and Jin, Shuowei and Geng, Shijie and Zhang, Xinyang and Zalmout, Nasser and Shi, Zhenyu and Sun, Yizhou},
journal={arXiv preprint arXiv:2605.02178},
year={2026}
}