The official codebase for Improving reinforcement learning from human feedback with efficient reward model ensemble.
First, install this repo and its dependencies in a conda environment. We used Python 3.10.
pip install -e .Download the pretrained models by running the following:
bash model_prepare/get_llama.sh
bash model_prepare/get_finetuned_models.shSet up wandb to log PPO training. You may need to create an account on the wandb website first. Then, run
wandb loginRun
bash scripts/reward_modeling/alpaca_reward_modeling.shThis script runs the following scripts sequentially:
reward_modeling_ind_exps.sh: Train single reward models. May use them as single reward models or ensemble them later. The trained reward models are registered inmodel_configs/ind_reward_models_with_pretrain.yaml.reward_modeling_linear_exps.sh: Train linear-layer ensemble reward models. The trained reward models are registered inmodel_configs/linear_ensemble_reward_models.yaml.reward_modeling_linear_then_lora_exps.sh: First train linear-layer ensemble reward models, then further train using LoRA. The trained reward models are registered inmodel_configs/lora_ensemble_reward_models.yaml.
Run the following:
bash scripts/best_of_n/run_exps.shThis script does the following:
Decoding and Scoring.
It first generates all the samples using a policy model (by default, the SFT-ed model in AlpacaFarm, sft10k).
It generates 200 outputs for each input prompt.
Then we compute the rewards of these samples using different reward models.
Alpaca-eval Evaluation.
To evaluate using best-of-n, we pick the first
Run the following:
bash scripts/ppo/run_exps.shThis script does the following:
Run PPO Training.
Finetune the sft10k model using PPO and the trained reward models.
Decoding Using PPO-Trained Policies. We then run the PPO-finetuned policies to generate samples.
Alpaca-eval Evaluation. Similar to the best-of-n experiments, we evaluate the samples generated in the previous step using Alpaca-eval.
Useful references:
- The original AlpacaFarm repo: link.