This is the github repo for the paper "Variance-Reduced Reinforcement Learning for Large Reasoning Models via James-Stein Baselines" by Guanning Zeng, Zhaoyi Zhou, Daman Arora and Andrea Zanette.
Create a conda environment and install dependencies
conda create -n jspo python=3.10
conda activate jspo
USE_MEGATRON=0 bash misc/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
cd peft
pip install -v -e .
cd ..
pip install numpy==1.26 math_verify==0.8.0Set up a global cache with at least 50GB ssd available
export CACHE=/path/to/your/cache$CACHE: Global Settings of the cache file, must be set to the absolute path without~or soft links$CACHE/hf_models/{hf_id}/{hf_name}: Default Path of model$CACHE/verl-data/{dataset_name}/train.parquet(test.parquet): Default Path of Data
Download models and datasets, for example:
python misc/download_model.py
python misc/download_knk_data.py --dataset="self-label-zanette-lab/knight-knave-3" --save_name="train"
python misc/download_knk_data.py --dataset="self-label-zanette-lab/knight-knave-3-OOD-test100" --save_name="test"
python misc/download_math_data.py --dataset="guanning-ai/dapo17k" --no_test
python misc/download_math_data.py --dataset="guanning/aime25" --no_trainChange WANDB_TOKEN and slurm setups in run_knk.sh as your own, and launch the experiments:
bash run_knk.sh jspo
bash run_knk.sh rloo # baseline / comparisonThe authors gratefully acknowledges Fahim Tajwar, Sheikh Shafayat and all the other members in Zanette's Lab for their helpful suggestions and valuable feedback.
To Be Filled
