James-Stein Policy Optimization

This is the github repo for the paper "Variance-Reduced Reinforcement Learning for Large Reasoning Models via James-Stein Baselines" by Guanning Zeng, Zhaoyi Zhou, Daman Arora and Andrea Zanette.

Overview of the James-Stein Policy Optimization (JSPO). James–Stein Policy Optimization replaces the usual per‑prompt baseline in critic‑free RL with an analytically derived, shrinkage baseline that pools information across prompts while preserving an unbiased policy‑gradient estimator. It consistently lowers gradient variance and improves training stability and accuracy for reasoning LLMs under different rollout budgets.

Getting Started

Create a conda environment and install dependencies

conda create -n jspo python=3.10
conda activate jspo
USE_MEGATRON=0 bash misc/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd peft
pip install -v -e .
cd ..
pip install numpy==1.26 math_verify==0.8.0

Set up a global cache with at least 50GB ssd available

export CACHE=/path/to/your/cache

$CACHE: Global Settings of the cache file, must be set to the absolute path without ~ or soft links
$CACHE/hf_models/{hf_id}/{hf_name}: Default Path of model
$CACHE/verl-data/{dataset_name}/train.parquet(test.parquet): Default Path of Data

Download models and datasets, for example:

python misc/download_model.py
python misc/download_knk_data.py --dataset="self-label-zanette-lab/knight-knave-3" --save_name="train"
python misc/download_knk_data.py --dataset="self-label-zanette-lab/knight-knave-3-OOD-test100" --save_name="test"
python misc/download_math_data.py --dataset="guanning-ai/dapo17k" --no_test
python misc/download_math_data.py --dataset="guanning/aime25" --no_train

Change WANDB_TOKEN and slurm setups in run_knk.sh as your own, and launch the experiments:

bash run_knk.sh jspo
bash run_knk.sh rloo            # baseline / comparison

Acknowledgement

The authors gratefully acknowledges Fahim Tajwar, Sheikh Shafayat and all the other members in Zanette's Lab for their helpful suggestions and valuable feedback.

Bibtex

To Be Filled

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github		.github
code		code
site		site
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

James-Stein Policy Optimization

Getting Started

Acknowledgement

Bibtex

About

Uh oh!

Releases

Packages

Languages

guanning03/JSPO-Official

Folders and files

Latest commit

History

Repository files navigation

James-Stein Policy Optimization

Getting Started

Acknowledgement

Bibtex

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages