Sequence-Level Agentic-RL with Convergence Guarantees
✨ Multi-turn Agentic RL · ⚡ Critic-free Sequence-level Updates · ✅ Convergence Guarantees
Implementation track: seeupo · Upstream: modelscope/AgentEvolver
SeeUPO is a multi-turn agentic reinforcement learning training pipeline built on BeyondAgent with a modified and vendored verl under
external/verl/. This verl has been approved for open source. Training requires using this vendored copy — install it withpip install -e external/verl --no-deps. Environment interaction is served by the standaloneenv_service.
Use the roadmap below to jump directly to the paper summary, repository structure, setup instructions, or training entry points.
- Paper
- Repository layout
- Quick start (SeeUPO)
- Environment setup
- SeeUPO-related training settings
- Run training
- License
- Citation (BibTeX)
Authors: Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, Bolin Ding
Submitted: 2026-02-06 | arXiv:2602.06554
TL;DR. The paper studies advantage estimation (GAE vs. GRAE) together with policy updates (REINFORCE vs. proximal / HAML-style). In multi-turn settings, standard critic-free recipes usually do not provide both critic-free training and convergence guarantees. SeeUPO addresses this with reverse-order, turn-wise sequential updates so that backward induction can target global optimality while remaining critic-free.
- REINFORCE + GRAE can converge to a global optimum under undiscounted (γ = 1) conditions; PPO-style (PPU) + GRAE generally does not keep the usual monotonic improvement story because of structural bias in the clipped objective.
- Multi-turn exposes a trade-off: mainstream recipes rarely achieve both critic-free training and strong convergence-style guarantees.
- SeeUPO treats a multi-turn trajectory as sequential single-turn bandits / virtual agents, updates turn-by-turn in reverse order (T → T−1 → … → 1), and in practice instantiates GRAE + PPO-style mirror updates (paper: SeeUPPO-GRAE).
ST = single-turn, MT = multi-turn. The table below condenses the claims in §3. Each entry is backed by formal analysis in the paper and appendices, including definitions, assumptions, lemmas, theorems, and proofs.
Reading guide: Advantage and Update name the estimator and policy-update family; Level is token vs. sequence; Example gives a representative method; ST / MT indicate whether the paper’s convergence sketch covers single-turn vs. multi-turn; In this repo points to relevant config knobs.
| Advantage | Update | Level | Example | ST | MT | In this repo |
|---|---|---|---|---|---|---|
| GAE | PPU (PPO-style) | Token | PPO | ✓ | ✓ | adv_estimator: gae (critic on) |
| GRAE | PPU | Token | GRPO, REINFORCE++ | ✗ | ✗ | adv_estimator: grpo (token-level baseline) |
| GRAE | REINFORCE | Sequence | RLOO | ✓ | ✗ | — |
| GRAE | PPU | Sequence | GSPO | ✓ | ✗ | loss_mode: gspo (sequence baseline) |
| GRAE | HAML / sequential | Sequence | SeeUPO | — | ✓ | sequential_update, update_order: reverse, adv_updator: seeupo |
Configs in this repo: launcher/qwen3_appworld/, launcher/qwen3_bfcl/, launcher/qwen25_bfcl/ (YAML + shell helpers).
The repository is organized around four pieces: the training core, the vendored verl dependency, the environment service, and benchmark-specific launch/config files.
| Path | Description |
|---|---|
beyondagent/ |
Main training loop and Ray trainer; module/trainer/ba_ray_trainer.py implements SeeUPO-style sequential updates, ratio computation, and adv_updator: seeupo. |
external/verl/ |
Required SeeUPO modified version of verl (approved for open source). Install with pip install -e external/verl --no-deps. You must use the version included here. |
env_service/ |
Environment service for AppWorld, BFCL, OpenWorld, etc.; launch scripts live in env_service/launch_script/. |
launcher/ |
Hydra/YAML experiment entry points; e.g. SeeUPO on BFCL: launcher/qwen3_bfcl/qwen3-seeupo-bfcl.yaml. |
config/ |
Shared Hydra fragments for defaults and dataflow. |
seeupo_env.yaml |
Exported Conda environment for dependency pinning. |
requirements_NewVerl.txt |
Ultra-short install reminder; the English walkthrough lives under Quick start (SeeUPO) / Environment setup below. |
sync_env_with_yaml.py |
Compare / align an activated Conda env with seeupo_env.yaml (strict version sync). |
HTTP API details for environments are documented in env_service/interface.md (ports depend on your setup; training YAMLs typically set env_service.env_url).
Use this section as the shortest path into the project. The full setup still has two layers: (A) benchmark sandboxes under env_service/environments/ and (B) the training Conda stack with pip install -e external/verl, FlashAttention, vLLM, and optional sync_env_with_yaml.py. Exact commands and version pins are given in Environment setup below.
Set up the project in two layers:
- (A) Benchmark sandboxes: local benchmark dependencies under
env_service/environments/ - (B) Training infrastructure: Python plus this repo’s
external/verl, FlashAttention, and vLLM forlauncher.py
Important. You must use the modified verl vendored in this repository at
external/verl/. This version has been reviewed and approved for open-sourcing. Do not installverlfrom PyPI or use another clone. Runpip install -e external/verl --no-deps.
Each benchmark provides a small setup.sh that prepares its local dependencies, datasets, and environment hints. From the repo root, run the script for the benchmark you need:
Benchmark setup.sh commands
# AppWorld
bash env_service/environments/appworld/setup.sh
# BFCL
bash env_service/environments/bfcl/setup.sh
# OpenWorld (optional)
bash env_service/environments/openworld/setup.shAfter setup: read the script output carefully for paths, extra Conda envs, and data downloads. BFCL may additionally require preprocessing steps referenced in env_service/launch_script/bfcl.sh or the BFCL README so that BFCL_DATA_PATH and related files exist. Once ready, start the HTTP environment service with env_service/launch_script/appworld.sh, bfcl.sh, and related scripts, or let the launcher scripts in launcher/ start it for you.
The canonical version pin is seeupo_env.yaml. The recipe below is the recommended high-level path; if anything conflicts, prefer the exact versions in seeupo_env.yaml. This serves the same purpose as requirements_NewVerl.txt, but the YAML is the authoritative source of exact pins.
Conda env recipe (create → seeupo_env.yaml → editable external/verl → flash-attn / vLLM → sync_env_with_yaml.py)
# 1) Create and activate a Conda env (Python 3.10)
conda create -n seeupo python=3.10
conda activate seeupo
# 2) Install most dependencies from seeupo_env.yaml (verl, flash-attn, vllm are added in step 3)
conda env update -f seeupo_env.yaml -n seeupo
# 3) Editable install of THIS REPO's verl (required — not pip install verl from PyPI), then FlashAttention and vLLM
pip install -e external/verl --no-deps
pip install flash-attn==2.7.0.post2 --no-deps --no-build-isolation
pip install vllm==0.8.5
# 4) Strict sync with seeupo_env.yaml (YAML-listed packages only); on resolver/ABI issues, align versions from the YAML instead of ad hoc bumps
python sync_env_with_yaml.py seeupo_env.yaml -n seeupo --compare-only
python sync_env_with_yaml.py seeupo_env.yaml -n seeupo --installsync_env_with_yaml.py compares your environment to the YAML, then optionally installs mismatches. Other flags: --pip-only, --conda-only, --env-update (full conda env update from the YAML). If you omit the YAML path, the script falls back to seeupo_env.yaml next to the script when the default path is missing.
The snippets below summarize the checked-in launcher/qwen3_bfcl/qwen3-seeupo-bfcl.yaml. Other benchmarks reuse the same algorithm block; in practice you mainly adjust env_service, trainer.nnodes, dataset paths, and model checkpoints.
YAML — algorithm (advantage, loss, sequential SeeUPO core)
algorithm:
# (1) GRAE-family: use GRPO-style advantage
adv_estimator: grpo
use_kl_in_reward: False
# (2) Sequence-level policy loss (GSPO)
loss_mode: gspo
loss_agg_mode: "seq-mean-token-mean"
# (3) Turn-wise sequential updates (SeeUPO schedule)
sequential_update: True
update_order: "reverse" # sequential | reverse | random | custom
# (4) SeeUPO advantage correction (mirror / IS terms); set None to disable
adv_updator: seeupo
adv_clip_ratio_high: 0.2
adv_clip_ratio_low: 0.2
# (5) Advantage normalization (batch-level; paper §4.2 / §5.3.2)
norm_adv_by_std_in_grpo: False # group std norm off (convergence-sensitive)
special_norm: True # enables batch-level normalization path in codeYAML — env_service (where rollouts talk to the benchmark)
env_service:
env_url: http://localhost:8080
env_type: bfcl
env_feedin_preference: codeYAML — data (batching and file paths)
data:
train_batch_size: 32
max_prompt_length: 14000
max_response_length: 4000
return_raw_chat: True
train_files: '//external/bfcl/data/BFCL_v4_multi_turn_base_train.parquet'
val_files: '//external/bfcl/data/BFCL_v4_multi_turn_base_test.parquet'Use this block to set hardware scale, experiment naming, and logging. In particular, configure default_local_dir, experiment_name, n_gpus_per_node, nnodes, total_epochs, and your loggers (swanlab, etc.). The checked-in BFCL reference run uses 50 epochs on 8×1 GPUs.
This block controls optimization, rollout behavior, and model wiring for training.
actor: LR 1e-6, KL penalty (kl_loss_coef: 0.002,low_var_kl), FSDP offload flags, dynamic batch tokens (ppo_max_token_len_per_gpu, etc.).rollout:vllm+mode: async,n: 8rollouts per prompt,multi_turn.max_steps: 10, temperature 0.9,context_template: linear_think(SeeUPO family uses linear thinking template), lengths aligned to data (prompt_length/response_length/max_model_len).model: setpathto your Qwen3 checkpoint;use_qwen3: True, gradient checkpointing / padding as in the YAML.critic: for critic-free runs, keep thecritic.modelblock commented as in the file.
For AppWorld, keep the same algorithm block and switch env_service.env_type and URLs to the AppWorld service; see launcher/qwen3_appworld/qwen3-seeupo-appworld.yaml.
Prerequisites. Complete Quick start (SeeUPO) / Environment setup first, including benchmark sandboxes, the training Conda environment, and optionally
sync_env_with_yaml.py. Before launching, activate your training environment (conda activate seeupoor your own name),cdto the repo root, and verify that your CUDA/driver stack matches the installed vLLM build.
This is the fastest path for single-node experiments. The scripts under launcher/qwen3_bfcl/ and launcher/qwen3_appworld/ start the environment service with nohup, wait for readiness, and then invoke launcher.py with the matching YAML.
| Environment | GRPO | GSPO | SeeUPO |
|---|---|---|---|
| BFCL | bash launcher/qwen3_bfcl/qwen3-grpo-bfcl.sh |
bash launcher/qwen3_bfcl/qwen3-gspo-bfcl.sh |
bash launcher/qwen3_bfcl/qwen3-seeupo-bfcl.sh |
| AppWorld | bash launcher/qwen3_appworld/qwen3-grpo-appworld.sh |
bash launcher/qwen3_appworld/qwen3-gspo-appworld.sh |
bash launcher/qwen3_appworld/qwen3-seeupo-appworld.sh |
Required env vars: CONDA_SH, SWANLAB_API_KEY.
Optional env vars: BFCL_CONDA_ENV / APPWORLD_CONDA_ENV (defaults bfcl / appworld), TRAIN_CONDA_ENV (default seeupo), BFCL_ENV_DIR, BFCL_STARTUP_SLEEP / APPWORLD_STARTUP_SLEEP, APPWORLD_ROOT.
If you use the context-template alien LLM path, set DASHSCOPE_API_KEY or provide DASHSCOPE_API_KEYS / DASHSCOPE_API_KEYS_REGULAR together with DASHSCOPE_API_KEYS_BACKUP as comma-separated lists. Full details are documented in the header comments of each script.
Logs: bfcl_service.log / appworld_service.log at the repository root.
Use this path if the environment service is already running. In that case, skip the launcher script’s nohup block and call launcher.py directly from the repo root:
Example — manual launcher.py (BFCL SeeUPO)
python launcher.py --conf launcher/qwen3_bfcl/qwen3-seeupo-bfcl.yamlFor AppWorld, you may also use python launcher.py --conf <yaml> --with-appworld so the launcher starts AppWorld instead of expecting a pre-started service.
Use launcher_multinode.py together with launcher/qwen3_bfcl/qwen3-ppo-bfcl.sh or launcher/qwen3_appworld/qwen3-ppo-appworld.sh for distributed PPO baselines. Your scheduler must provide RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT, CONDA_SH, and SWANLAB_API_KEY. Optional knobs include TRAIN_CONDA_ENV, BFCL_CONDA_ENV / APPWORLD_CONDA_ENV, NUM_GPUS_PER_NODE, NUM_CPUS_PER_NODE, OBJECT_STORE_MEMORY, and NCCL_* / GLOO_*. Service logs are written under logs/bfcl/ or logs/appworld/ at the repo root.
launcher.py backs up config/, beyondagent/, and the chosen YAML under the experiment directory for reproducibility.
This repository is released under Apache License 2.0. See LICENSE.txt for the full text.
Use the following BibTeX entry to cite the paper.
BibTeX (click to expand)
@article{hu2026seeupo,
title={SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees},
author={Hu, Tianyi and Fu, Qingxu and Chen, Yanxi and Liu, Zhaoyang and Ding, Bolin},
journal={arXiv preprint arXiv:2602.06554},
year={2026},
url={https://arxiv.org/abs/2602.06554}
}You can also export BibTeX from the arXiv abstract page.