Skip to content

SalesforceAIResearch/UserRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

UserRL Logo

UserRL: Training Proactive User-Centric Agent via Reinforcement Learning

| ๐Ÿ“– Paper | ๐Ÿ“Š Dataset |

This is the official repository for paper "UserRL: Training Proactive User-Centric Agent via Reinforcement Learning".

We provide a comprehensive framework for training LLM using reinforcement learning across diverse multi-turn user-centric gym environments. UserRL implements Group Relative Policy Optimization (GRPO) with multi-turn credit assignment for effective learning in interactive scenarios.

DataPipeline

๐ŸŽฏ Overview

UserRL enables training language models to interact effectively with users across multiple domains through:

  • Multi-Turn Conversations: Support for complex, extended dialogues with proper credit assignment
  • Diverse Gym Environments: 10+ specialized environments covering reasoning, tool usage, persuasion, and more
  • Advanced RL Algorithms: GRPO with turn-level reward attribution and trajectory scoring
  • Scalable Training: Multi-GPU support with SGLang backend for efficient inference
  • Comprehensive Evaluation: End-to-end pipeline for model assessment across all environments

๐Ÿ—๏ธ Architecture

Core Components

UserRL/
โ”œโ”€โ”€ gyms/              # Gymnasium environments for different domains
โ”œโ”€โ”€ verl/              # Core RL training framework
โ”œโ”€โ”€ examples/          # Training configurations and data preprocessing
โ”œโ”€โ”€ sft/               # Supervised fine-tuning pipeline
โ”œโ”€โ”€ eval/              # Comprehensive evaluation framework
โ””โ”€โ”€ data/              # Training and validation datasets

Key Features

  • ๐Ÿค– Multi-Environment Training: Train on 10+ diverse environments simultaneously
  • ๐ŸŽฏ Turn-Level Credit Assignment: Advanced reward attribution for multi-turn scenarios
  • โšก Efficient Inference: SGLang backend with optimized memory utilization
  • ๐Ÿ“Š Comprehensive Logging: WandB integration with detailed metrics tracking
  • ๐Ÿ”ง Flexible Configuration: Hydra-based configuration system for easy experimentation

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.12
  • CUDA-compatible GPU(s)
  • OpenAI API key (for user simulation)

Installation

  1. Create Environment

    conda create -n userrl python=3.12
    conda activate userrl
  2. Install UserRL

    pip install -e .[sglang]
    pip install flash-attn --no-build-isolation
  3. Install Gym Environments

    bash install_gyms.sh

Basic Training

  1. Configure Environment Variables

    export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
    export OPENAI_API_KEY="your-openai-key"
    export OPENAI_BASE_URL="https://api.openai.com/v1"
    export MULTITURN_MODEL_NAME="gpt-4o"
  2. Update Training Script

    Edit examples/sglang_multiturn/train.sh and update:

    PROJECT_DIR="/path/to/your/UserRL"  # Set your project path
    actor_rollout_ref.model.path=/path/to/your/model  # Set your model path
    trainer.n_gpus_per_node=8  # Adjust based on available GPUs
  3. Start Training

    bash ./examples/sglang_multiturn/train.sh

๐ŸŽฎ Available Environments

UserRL includes 10+ specialized gym environments:

Environment Domain Description
FunctionGym Mathematics Function discovery and parameter learning
IntentionGym Intent Recognition User intention inference through conversation
PersuadeGym Persuasion Strategic persuasive communication
SearchGym Information Retrieval Web search and information synthesis
TauGym Tool Usage Multi-agent tool interaction scenarios
TelepathyGym Mind Reading Entity guessing through strategic questions
TravelGym Travel Planning Preference elicitation and recommendation
TurtleGym Lateral Thinking Turtle Soup puzzle solving

Each environment provides:

  • Standardized action formats ([action], [answer], [finish])
  • Multi-turn conversation support
  • Domain-specific reward mechanisms
  • LLM-based evaluation systems

๐Ÿ‹๏ธ Training Pipeline

1. Supervised Fine-Tuning (Optional)

For improved initialization, start with SFT:

# See detailed instructions in sft/README.md
cd sft/
# Follow SFT pipeline setup and training

2. Reinforcement Learning Training

Key Training Parameters:

  • Algorithm: GRPO with multi-turn credit assignment
  • Turn-Level Method: Equalized, R2G, or EM
  • Trajectory Scoring: Sum or R2G

Training Configuration Example:

# Key hyperparameters in train.sh
algorithm.adv_estimator: grpo_multiturn
algorithm.gamma: 0.8
data.train_batch_size: 128
actor_rollout_ref.rollout.multi_turn.turn_level_method: "Equalized"
actor_rollout_ref.rollout.multi_turn.trajectory_score_method: "Sum"

3. Model Evaluation

Comprehensive evaluation across all environments:

# See detailed instructions in eval/README.md
cd eval/
# Follow evaluation pipeline

๐Ÿ“Š Advanced Configuration

Multi-GPU Training

UserRL supports distributed training across multiple GPUs:

# Configure GPU usage
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
trainer.n_gpus_per_node=8
trainer.nnodes=1

User Simulation Options

Option 1: OpenAI GPT-4o

export OPENAI_BASE_URL="https://api.openai.com/v1"
export MULTITURN_MODEL_NAME="gpt-4o"

Option 2: Local Model

# We apply Qwen3-32B as simulated user in paper's experiments
export OPENAI_BASE_URL="http://localhost:8000/v1"
export MULTITURN_MODEL_NAME="Qwen/Qwen3-32B"

Memory Optimization

For large models, configure memory settings:

actor_rollout_ref.model.enable_gradient_checkpointing: True
actor_rollout_ref.model.enable_activation_offload: True
actor_rollout_ref.rollout.gpu_memory_utilization: 0.50

๐Ÿ”ง Adding New Environments

  1. Design New Gym: Follow patterns in gyms/README.md

  2. Create Data Preprocessing:

    # Create new file in examples/data_preprocess/
    # Ensure data source field starts with "interact_"
  3. Update Dataset Configuration:

    python examples/data_preprocess/merge_customize.py
  4. Register Environment:

    # Add to verl/tools/env_manager.py
    # Use same env_name as in data preprocessing
  5. Begin Training: Run customized training with new environment

๐Ÿ“ˆ Monitoring and Logging

UserRL provides comprehensive logging through:

  • Console Output: Real-time training progress
  • Weights & Biases: Detailed metrics and visualization
  • Checkpointing: Automatic model saving and best model selection
trainer.logger: ['console', 'wandb']
trainer.project_name: 'UserRL'
trainer.save_freq: 1
trainer.test_freq: 5

๐Ÿค Contributing

We welcome contributions! Please see individual component READMEs:

๐Ÿ“ Citation

@article{qian2025userrl,
  title={UserRL: Training Interactive User-Centric Agent via Reinforcement Learning},
  author={Qian, Cheng and Liu, Zuxin and Prabhakar, Akshara and Qiu, Jielin and Liu, Zhiwei and Chen, Haolin and Kokane, Shirley and Ji, Heng and Yao, Weiran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan},
  journal={arXiv preprint arXiv:2509.19736},
  year={2025}
}

๐Ÿ™ Acknowledgments

Built on top of:

  • VERL - Volcano Engine Reinforcement Learning framework
  • SGLang - Efficient LLM serving

For detailed documentation on specific components, please refer to the respective README files in each subdirectory.

About

The raw UserRL repo under construction

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages