Skip to content

WillDreamer/T2PO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

T2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic RL

ICML 2026 Spotlight  |  Now Open Source

Agentic RL Arena Framework
Figure 1: Overview of the T2PO framework  |  ICML 2026 Spotlight

Problem: Hesitation is defeat! Multi-turn RL for LLM agents is powerful, but critically limited by poor exploration.

Key idea: Training fails mostly when agents repeat low-value actions or ignore task-level uncertainty.

Our method: T2PO directly controls exploration at both the token and turn level using uncertainty signals, greatly improving stability and sample efficiency.


🛠️ T2PO Framework Design

🔹 Token-level: T2PO tracks marginal uncertainty and triggers interventions when it dips below a threshold.
🔸 Turn-level: T2PO resamples turns with negligible exploration progress, preventing wasted updates.
📊 Benchmarks: Substantial gains on WebShop, ALFWorld, SearchQA and more—significantly better stability and learning efficiency.

🔥 Key Features

  • ✅ Support Training Multi-turn Embody Agents
  • ✅ Support Training Multi-turn Search Agents
  • ✅ Support Training Multi-turn Multi-modal Game Agents
  • ✅ Support Training Multi-turn Web Agents
  • ✅ Support Evaluating Commerical LLMs as Agents

💡 Getting Started

Our work is based on the following main dependencies:

Python=3.11, VeRL=0.4.0, PyTorch=2.6.0, and vLLM=0.8.5
👉 Click to expand installation guide (optional)
# (Optional) Install conda
bash set_conda.sh

# Install main dependencies
bash setup_env.sh

# Install extra requirements for specific tasks
conda activate verl
pip install -r requirements_xxx.txt

🚀 Existing Support

🤖 Embodied Agents
# 1. Build the environments
bash prepare_all_embody.sh

# 2. Run the demo code
conda activate agentrl_embody
bash examples/world_agent_trainer/train_xxx.sh
🛒 Web Agents
# 1. Build the webshop environments
bash prepare_all_web.sh

# 2. Run the demo code
conda activate agentrl_web
bash examples/shop_agent_trainer/train_xxxx.sh
🕸️ Search Agents
# 1. Build the RAG server environments
bash prepare_all_search.sh

# 2. Run the demo code
conda activate agentrl_search
bash examples/search_agent_trainer/train_xxx.sh
🎮 Multi-modal Game Agents
# 1. Install the requirements
bash prepare_all_game.sh

# 2. Run the demo code
bash examples/game_agent_trainer/train_xxx.sh

🌊 Easy Extension

Extensible by Design:

  • All task recipes live in recipe. Wrap the VERL worker to plug in your own method. [usage]
  • Add new environments under agent_system.
  • Extra dependencies go into requirements_xxx.txt.
  • Third-party tools? Place them in AgentRL/sandbox.

📊 Further Analysis

📈 Expand for MLFlow analysis setup
# Install requirements
pip install mlflow

# Start server
mlflow server \
  --host 0.0.0.0 --port 5000 \
  --backend-store-uri sqlite:////tmp/mlruns.db \
  --default-artifact-root /tmp/mlruns

export MLFLOW_TRACKING_URI=http://127.0.0.1:5000

# Trainer config
actor_rollout_ref.rollout.trace.backend: mlflow  # or weave
actor_rollout_ref.rollout.trace.token2text: True
trainer.logger: ['console', 'mlflow']

✍️ Citation

@article{wang2026t,
  title={T $^2$ PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning},
  author={Wang, Haixin and Cui, Hejie and Zhang, Chenwei and Liu, Xin and Jin, Shuowei and Geng, Shijie and Zhang, Xinyang and Zalmout, Nasser and Shi, Zhenyu and Sun, Yizhou},
  journal={arXiv preprint arXiv:2605.02178},
  year={2026}
}

About

【ICML2026 Spotlight】 T2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors