RPBench-Auto

An automated pipeline for evaluating LLMs for role-playing.

Installation

pip install -r requirements.txt

Usage

First, set the environment variable OPENAI_API_KEY for the judge model and to the path of the RPBench dataset.

export OPENAI_API_KEY=<API_KEY>

Then, add the model config file for the model you want to evaluate. Currently we support OpenAI API (and compatible APIs) and Anthropic API. Edit config/api_config.yaml to add the model config.

Finally, run the pipeline.

python run_character_eval.py --model_1 <CONFIG_NAME>  # Evaluate the model on the character subset
python run_scene_eval.py --model_1 <CONFIG_NAME>  # Evaluate the model on the scene subset

Generate the leaderboard.

python generate_leaderboard.py

How to contribute

After running all commands above, you can add your model to the leaderboard by creating a pull request with the updated leaderboard files, leaderboard.csv and leaderboard_for_display.csv, plus the .jsonl files in /results/character and /results/scene. The leaderboard will be updated automatically when the PR is merged.

Acknowledgements

This benchmark is heavily inspired by ArenaHard and AlpacaEval. Some code implementations are borrowed from these repositories.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
data		data
figures		figures
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calculate_metrics.py		calculate_metrics.py
generate_leaderboard.py		generate_leaderboard.py
requirements.txt		requirements.txt
run_character_eval.py		run_character_eval.py
run_scene_eval.py		run_scene_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RPBench-Auto

Installation

Usage

How to contribute

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

boson-ai/RPBench-Auto

Folders and files

Latest commit

History

Repository files navigation

RPBench-Auto

Installation

Usage

How to contribute

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages