AlpacaFarm with Reward Ensemble

The official codebase for Improving reinforcement learning from human feedback with efficient reward model ensemble.

Preparation

Python Environment

First, install this repo and its dependencies in a conda environment. We used Python 3.10.

pip install -e .

Models

Download the pretrained models by running the following:

bash model_prepare/get_llama.sh
bash model_prepare/get_finetuned_models.sh

Logging

Set up wandb to log PPO training. You may need to create an account on the wandb website first. Then, run

wandb login

Reward Modeling Experiments

Run

bash scripts/reward_modeling/alpaca_reward_modeling.sh

This script runs the following scripts sequentially:

reward_modeling_ind_exps.sh: Train single reward models. May use them as single reward models or ensemble them later. The trained reward models are registered in model_configs/ind_reward_models_with_pretrain.yaml.
reward_modeling_linear_exps.sh: Train linear-layer ensemble reward models. The trained reward models are registered in model_configs/linear_ensemble_reward_models.yaml.
reward_modeling_linear_then_lora_exps.sh: First train linear-layer ensemble reward models, then further train using LoRA. The trained reward models are registered in model_configs/lora_ensemble_reward_models.yaml.

Best-of-n Experiments

Run the following:

bash scripts/best_of_n/run_exps.sh

This script does the following:

Decoding and Scoring. It first generates all the samples using a policy model (by default, the SFT-ed model in AlpacaFarm, sft10k). It generates 200 outputs for each input prompt. Then we compute the rewards of these samples using different reward models.

Alpaca-eval Evaluation. To evaluate using best-of-n, we pick the first $n$ samples and find their rewards (evaluated in the previous step). Then it returns the sample with the highest reward. Lastly, it runs evaluation using Alpaca-eval, which asks the GPT-4 model to judge the alignment performance of the samples.

PPO Experiments

Run the following:

bash scripts/ppo/run_exps.sh

This script does the following:

Run PPO Training. Finetune the sft10k model using PPO and the trained reward models.

Decoding Using PPO-Trained Policies. We then run the PPO-finetuned policies to generate samples.

Alpaca-eval Evaluation. Similar to the best-of-n experiments, we evaluate the samples generated in the previous step using Alpaca-eval.

Other Useful Information

Useful references:

The original AlpacaFarm repo: link.

Name		Name	Last commit message	Last commit date
Latest commit History 213 Commits
.github/workflows		.github/workflows
assets		assets
configs		configs
examples		examples
model_prepare		model_prepare
pretrained_models		pretrained_models
scripts		scripts
src/alpaca_farm		src/alpaca_farm
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
DATA_LICENSE		DATA_LICENSE
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
upgrade_pypi.sh		upgrade_pypi.sh
visualize_all_annotations.py		visualize_all_annotations.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlpacaFarm with Reward Ensemble

Preparation

Python Environment

Models

Logging

Reward Modeling Experiments

Best-of-n Experiments

PPO Experiments

Other Useful Information

About

Uh oh!

Releases

Packages

Languages

License

shunzh/alpaca-farm-ensemble-reward

Folders and files

Latest commit

History

Repository files navigation

AlpacaFarm with Reward Ensemble

Preparation

Python Environment

Models

Logging

Reward Modeling Experiments

Best-of-n Experiments

PPO Experiments

Other Useful Information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages