Skip to content

guanning03/JSPO-Official

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

James-Stein Policy Optimization

This is the github repo for the paper "Variance-Reduced Reinforcement Learning for Large Reasoning Models via James-Stein Baselines" by Guanning Zeng, Zhaoyi Zhou, Daman Arora and Andrea Zanette.


Teaser Image
Overview of the James-Stein Policy Optimization (JSPO). James–Stein Policy Optimization replaces the usual per‑prompt baseline in critic‑free RL with an analytically derived, shrinkage baseline that pools information across prompts while preserving an unbiased policy‑gradient estimator. It consistently lowers gradient variance and improves training stability and accuracy for reasoning LLMs under different rollout budgets.

Getting Started

Create a conda environment and install dependencies

conda create -n jspo python=3.10
conda activate jspo
USE_MEGATRON=0 bash misc/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd peft
pip install -v -e .
cd ..
pip install numpy==1.26 math_verify==0.8.0

Set up a global cache with at least 50GB ssd available

export CACHE=/path/to/your/cache
  • $CACHE: Global Settings of the cache file, must be set to the absolute path without ~ or soft links
  • $CACHE/hf_models/{hf_id}/{hf_name}: Default Path of model
  • $CACHE/verl-data/{dataset_name}/train.parquet(test.parquet): Default Path of Data

Download models and datasets, for example:

python misc/download_model.py
python misc/download_knk_data.py --dataset="self-label-zanette-lab/knight-knave-3" --save_name="train"
python misc/download_knk_data.py --dataset="self-label-zanette-lab/knight-knave-3-OOD-test100" --save_name="test"
python misc/download_math_data.py --dataset="guanning-ai/dapo17k" --no_test
python misc/download_math_data.py --dataset="guanning/aime25" --no_train

Change WANDB_TOKEN and slurm setups in run_knk.sh as your own, and launch the experiments:

bash run_knk.sh jspo
bash run_knk.sh rloo            # baseline / comparison

Acknowledgement

The authors gratefully acknowledges Fahim Tajwar, Sheikh Shafayat and all the other members in Zanette's Lab for their helpful suggestions and valuable feedback.

Bibtex

To Be Filled

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published