Skip to content

shunzh/alpaca-farm-ensemble-reward

Repository files navigation

AlpacaFarm with Reward Ensemble

The official codebase for Improving reinforcement learning from human feedback with efficient reward model ensemble.

Preparation

Python Environment

First, install this repo and its dependencies in a conda environment. We used Python 3.10.

pip install -e .

Models

Download the pretrained models by running the following:

bash model_prepare/get_llama.sh
bash model_prepare/get_finetuned_models.sh

Logging

Set up wandb to log PPO training. You may need to create an account on the wandb website first. Then, run

wandb login

Reward Modeling Experiments

Run

bash scripts/reward_modeling/alpaca_reward_modeling.sh

This script runs the following scripts sequentially:

  • reward_modeling_ind_exps.sh: Train single reward models. May use them as single reward models or ensemble them later. The trained reward models are registered in model_configs/ind_reward_models_with_pretrain.yaml.
  • reward_modeling_linear_exps.sh: Train linear-layer ensemble reward models. The trained reward models are registered in model_configs/linear_ensemble_reward_models.yaml.
  • reward_modeling_linear_then_lora_exps.sh: First train linear-layer ensemble reward models, then further train using LoRA. The trained reward models are registered in model_configs/lora_ensemble_reward_models.yaml.

Best-of-n Experiments

Run the following:

bash scripts/best_of_n/run_exps.sh

This script does the following:

Decoding and Scoring. It first generates all the samples using a policy model (by default, the SFT-ed model in AlpacaFarm, sft10k). It generates 200 outputs for each input prompt. Then we compute the rewards of these samples using different reward models.

Alpaca-eval Evaluation. To evaluate using best-of-n, we pick the first $n$ samples and find their rewards (evaluated in the previous step). Then it returns the sample with the highest reward. Lastly, it runs evaluation using Alpaca-eval, which asks the GPT-4 model to judge the alignment performance of the samples.

PPO Experiments

Run the following:

bash scripts/ppo/run_exps.sh

This script does the following:

Run PPO Training. Finetune the sft10k model using PPO and the trained reward models.

Decoding Using PPO-Trained Policies. We then run the PPO-finetuned policies to generate samples.

Alpaca-eval Evaluation. Similar to the best-of-n experiments, we evaluate the samples generated in the previous step using Alpaca-eval.

Other Useful Information

Useful references:

  • The original AlpacaFarm repo: link.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published