General Exploratory Bonus for Optimistic Exploration in RLHF

Introduction

Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

How to self-design and implement a General Exploratory Bonus (GEB)

You can self-design and implement a general exploratory bonus for online iterative f-DPO as follows:

Design a general exploratory bonus (GEB) with a specific alpha-divergence by $$uf'(u) - f(u)$$ where $$f$$ corresponds to divergence class. While you can flexibly choose $$u$$, where $$u>\alpha$$ is a function of $$\pi$$ that decreases monotonically with respect to $$\pi$$.
- for example, when I choose the forward KL-divergence, the corresponding $$f$$ is $$f(u)= - \log u$$. Then, I pick $u$ as $u=\frac{1}{x}$ which decreases monotonically with respect to $$\pi$$. The final GEB formulation is $$-1 - \log \pi $$
Then substitute the $$\pi$$ of the GEB formulation by rejected_policy_logprobs.exp(), and add this term to the f-DPO loss.
- for example, when I have the GEB formulation as $$-1 - \log \pi $$, the final loss is L_{f-DPO} + kappa * policy_rejected_logps where kappa is the hyperparameter, and I omit the constant 1 of the GEB formulation.

Reproduction

Train

To run the iterative online RLHF algorithm, please run

bash Llama_SFT_GEB.sh

There are some arguments you might adjust in the script:

loss_type: choose from [dpo,geb_p,geb_f,geb_tanh]
f_div: choose from [kl,hel,fkl]
kappa: the hyperparameter, adjust it according to section 5 of our paper

Evaluation

1. Generate response from models

python generation/generate_eval_test.py --model_name_or_path MODEL_NAME --output_name_or_path FILE_NAME

2. Check the win rate and average reward

accelerate launch --main_process_port 29710 evaluation/check_win_rate.py --data_name test --model_name FILE_NAME

Citation

We now have a paper you can cite:

@article{li2025general,
  title={General Exploratory Bonus for Optimistic Exploration in RLHF},
  author={Li, Wendi and Oh, Changdae and Li, Yixuan},
  journal={arXiv preprint arXiv:2510.03269},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
annotate_data		annotate_data
conda_envs		conda_envs
configs		configs
dpo_iteration		dpo_iteration
evaluation		evaluation
generation		generation
LICENSE		LICENSE
LLama_SFT_GEB.sh		LLama_SFT_GEB.sh
README.md		README.md
optimistic_v2.drawio.png		optimistic_v2.drawio.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

General Exploratory Bonus for Optimistic Exploration in RLHF

Introduction

How to self-design and implement a General Exploratory Bonus (GEB)

Reproduction

Train

Evaluation

1. Generate response from models

2. Check the win rate and average reward

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

WindyLee0822/GEB

Folders and files

Latest commit

History

Repository files navigation

General Exploratory Bonus for Optimistic Exploration in RLHF

Introduction

How to self-design and implement a General Exploratory Bonus (GEB)

Reproduction

Train

Evaluation

1. Generate response from models

2. Check the win rate and average reward

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages