T²PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic RL

|

Figure 1: Overview of the T²PO framework ｜ ICML 2026 Spotlight

Problem: Hesitation is defeat! Multi-turn RL for LLM agents is powerful, but critically limited by poor exploration.

Key idea: Training fails mostly when agents repeat low-value actions or ignore task-level uncertainty.

Our method: T²PO directly controls exploration at both the token and turn level using uncertainty signals, greatly improving stability and sample efficiency.

🛠️ T²PO Framework Design

🔹 Token-level: T²PO tracks marginal uncertainty and triggers interventions when it dips below a threshold.

🔸 Turn-level: T²PO resamples turns with negligible exploration progress, preventing wasted updates.

📊 Benchmarks: Substantial gains on WebShop, ALFWorld, SearchQA and more—significantly better stability and learning efficiency.

🔥 Key Features

✅ Support Training Multi-turn Embody Agents
✅ Support Training Multi-turn Search Agents
✅ Support Training Multi-turn Multi-modal Game Agents
✅ Support Training Multi-turn Web Agents
✅ Support Evaluating Commerical LLMs as Agents

💡 Getting Started

Our work is based on the following main dependencies:

Python=3.11, VeRL=0.4.0, PyTorch=2.6.0, and vLLM=0.8.5

👉 Click to expand installation guide (optional)

# (Optional) Install conda
bash set_conda.sh

# Install main dependencies
bash setup_env.sh

# Install extra requirements for specific tasks
conda activate verl
pip install -r requirements_xxx.txt

🚀 Existing Support

🤖 Embodied Agents

# 1. Build the environments
bash prepare_all_embody.sh

# 2. Run the demo code
conda activate agentrl_embody
bash examples/world_agent_trainer/train_xxx.sh

🛒 Web Agents

# 1. Build the webshop environments
bash prepare_all_web.sh

# 2. Run the demo code
conda activate agentrl_web
bash examples/shop_agent_trainer/train_xxxx.sh

🕸️ Search Agents

# 1. Build the RAG server environments
bash prepare_all_search.sh

# 2. Run the demo code
conda activate agentrl_search
bash examples/search_agent_trainer/train_xxx.sh

🎮 Multi-modal Game Agents

# 1. Install the requirements
bash prepare_all_game.sh

# 2. Run the demo code
bash examples/game_agent_trainer/train_xxx.sh

🌊 Easy Extension

✨ Extensible by Design:

All task recipes live in recipe. Wrap the VERL worker to plug in your own method. [usage]
Add new environments under agent_system.
Extra dependencies go into requirements_xxx.txt.
Third-party tools? Place them in AgentRL/sandbox.

📊 Further Analysis

📈 Expand for MLFlow analysis setup

# Install requirements
pip install mlflow

# Start server
mlflow server \
  --host 0.0.0.0 --port 5000 \
  --backend-store-uri sqlite:////tmp/mlruns.db \
  --default-artifact-root /tmp/mlruns

export MLFLOW_TRACKING_URI=http://127.0.0.1:5000

# Trainer config
actor_rollout_ref.rollout.trace.backend: mlflow  # or weave
actor_rollout_ref.rollout.trace.token2text: True
trainer.logger: ['console', 'mlflow']

✍️ Citation

@article{wang2026t,
  title={T $^2$ PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning},
  author={Wang, Haixin and Cui, Hejie and Zhang, Chenwei and Liu, Xin and Jin, Shuowei and Geng, Shijie and Zhang, Xinyang and Zalmout, Nasser and Shi, Zhenyu and Sun, Yizhou},
  journal={arXiv preprint arXiv:2605.02178},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
agent_system		agent_system
examples		examples
public		public
recipe		recipe
scripts		scripts
verl.egg-info		verl.egg-info
verl		verl
LICENSE		LICENSE
README.md		README.md
prepare_all_embody.sh		prepare_all_embody.sh
prepare_all_game.sh		prepare_all_game.sh
prepare_all_search.sh		prepare_all_search.sh
prepare_all_web.sh		prepare_all_web.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_ragen.txt		requirements_ragen.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py
setup_conda.sh		setup_conda.sh
setup_cuda.sh		setup_cuda.sh
setup_env.sh		setup_env.sh
update_hf.py		update_hf.py
update_rft_data.py		update_rft_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T²PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic RL

|

🛠️ T²PO Framework Design

🔥 Key Features

💡 Getting Started

🚀 Existing Support

🌊 Easy Extension

📊 Further Analysis

✍️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

T2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic RL

|

🛠️ T2PO Framework Design

🔥 Key Features

💡 Getting Started

🚀 Existing Support

🌊 Easy Extension

📊 Further Analysis

✍️ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

T²PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic RL

🛠️ T²PO Framework Design

Packages