| ๐ Paper | ๐ Dataset |
This is the official repository for paper "UserRL: Training Proactive User-Centric Agent via Reinforcement Learning".
We provide a comprehensive framework for training LLM using reinforcement learning across diverse multi-turn user-centric gym environments. UserRL implements Group Relative Policy Optimization (GRPO) with multi-turn credit assignment for effective learning in interactive scenarios.
UserRL enables training language models to interact effectively with users across multiple domains through:
- Multi-Turn Conversations: Support for complex, extended dialogues with proper credit assignment
- Diverse Gym Environments: 10+ specialized environments covering reasoning, tool usage, persuasion, and more
- Advanced RL Algorithms: GRPO with turn-level reward attribution and trajectory scoring
- Scalable Training: Multi-GPU support with SGLang backend for efficient inference
- Comprehensive Evaluation: End-to-end pipeline for model assessment across all environments
UserRL/
โโโ gyms/ # Gymnasium environments for different domains
โโโ verl/ # Core RL training framework
โโโ examples/ # Training configurations and data preprocessing
โโโ sft/ # Supervised fine-tuning pipeline
โโโ eval/ # Comprehensive evaluation framework
โโโ data/ # Training and validation datasets
- ๐ค Multi-Environment Training: Train on 10+ diverse environments simultaneously
- ๐ฏ Turn-Level Credit Assignment: Advanced reward attribution for multi-turn scenarios
- โก Efficient Inference: SGLang backend with optimized memory utilization
- ๐ Comprehensive Logging: WandB integration with detailed metrics tracking
- ๐ง Flexible Configuration: Hydra-based configuration system for easy experimentation
- Python 3.12
- CUDA-compatible GPU(s)
- OpenAI API key (for user simulation)
-
Create Environment
conda create -n userrl python=3.12 conda activate userrl
-
Install UserRL
pip install -e .[sglang] pip install flash-attn --no-build-isolation
-
Install Gym Environments
bash install_gyms.sh
-
Configure Environment Variables
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 export OPENAI_API_KEY="your-openai-key" export OPENAI_BASE_URL="https://api.openai.com/v1" export MULTITURN_MODEL_NAME="gpt-4o"
-
Update Training Script
Edit
examples/sglang_multiturn/train.shand update:PROJECT_DIR="/path/to/your/UserRL" # Set your project path actor_rollout_ref.model.path=/path/to/your/model # Set your model path trainer.n_gpus_per_node=8 # Adjust based on available GPUs
-
Start Training
bash ./examples/sglang_multiturn/train.sh
UserRL includes 10+ specialized gym environments:
| Environment | Domain | Description |
|---|---|---|
| FunctionGym | Mathematics | Function discovery and parameter learning |
| IntentionGym | Intent Recognition | User intention inference through conversation |
| PersuadeGym | Persuasion | Strategic persuasive communication |
| SearchGym | Information Retrieval | Web search and information synthesis |
| TauGym | Tool Usage | Multi-agent tool interaction scenarios |
| TelepathyGym | Mind Reading | Entity guessing through strategic questions |
| TravelGym | Travel Planning | Preference elicitation and recommendation |
| TurtleGym | Lateral Thinking | Turtle Soup puzzle solving |
Each environment provides:
- Standardized action formats (
[action],[answer],[finish]) - Multi-turn conversation support
- Domain-specific reward mechanisms
- LLM-based evaluation systems
For improved initialization, start with SFT:
# See detailed instructions in sft/README.md
cd sft/
# Follow SFT pipeline setup and trainingKey Training Parameters:
- Algorithm: GRPO with multi-turn credit assignment
- Turn-Level Method:
Equalized,R2G, orEM - Trajectory Scoring:
SumorR2G
Training Configuration Example:
# Key hyperparameters in train.sh
algorithm.adv_estimator: grpo_multiturn
algorithm.gamma: 0.8
data.train_batch_size: 128
actor_rollout_ref.rollout.multi_turn.turn_level_method: "Equalized"
actor_rollout_ref.rollout.multi_turn.trajectory_score_method: "Sum"Comprehensive evaluation across all environments:
# See detailed instructions in eval/README.md
cd eval/
# Follow evaluation pipelineUserRL supports distributed training across multiple GPUs:
# Configure GPU usage
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
trainer.n_gpus_per_node=8
trainer.nnodes=1Option 1: OpenAI GPT-4o
export OPENAI_BASE_URL="https://api.openai.com/v1"
export MULTITURN_MODEL_NAME="gpt-4o"Option 2: Local Model
# We apply Qwen3-32B as simulated user in paper's experiments
export OPENAI_BASE_URL="http://localhost:8000/v1"
export MULTITURN_MODEL_NAME="Qwen/Qwen3-32B"For large models, configure memory settings:
actor_rollout_ref.model.enable_gradient_checkpointing: True
actor_rollout_ref.model.enable_activation_offload: True
actor_rollout_ref.rollout.gpu_memory_utilization: 0.50-
Design New Gym: Follow patterns in
gyms/README.md -
Create Data Preprocessing:
# Create new file in examples/data_preprocess/ # Ensure data source field starts with "interact_"
-
Update Dataset Configuration:
python examples/data_preprocess/merge_customize.py
-
Register Environment:
# Add to verl/tools/env_manager.py # Use same env_name as in data preprocessing
-
Begin Training: Run customized training with new environment
UserRL provides comprehensive logging through:
- Console Output: Real-time training progress
- Weights & Biases: Detailed metrics and visualization
- Checkpointing: Automatic model saving and best model selection
trainer.logger: ['console', 'wandb']
trainer.project_name: 'UserRL'
trainer.save_freq: 1
trainer.test_freq: 5We welcome contributions! Please see individual component READMEs:
@article{qian2025userrl,
title={UserRL: Training Interactive User-Centric Agent via Reinforcement Learning},
author={Qian, Cheng and Liu, Zuxin and Prabhakar, Akshara and Qiu, Jielin and Liu, Zhiwei and Chen, Haolin and Kokane, Shirley and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
journal={arXiv preprint arXiv:2509.19736},
year={2025}
}Built on top of:
For detailed documentation on specific components, please refer to the respective README files in each subdirectory.

