PersonaEval

This is the official codebase for the paper PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?.

Installation

We use uv for environment management. Install it first:

curl -LsSf https://astral.sh/uv/install.sh | sh

Then install the project dependencies:

uv sync

Usage

Step 0: Dataset Preparation

The evaluation datasets will be automatically downloaded when you run any command. The system will:

Check if required data files exist in the data/ directory
If files are missing, automatically download them from Hugging Face Hub
Required files: Literary.csv, Drama.csv, Expertise.csv

You can also manually prepare the dataset by running:

python -c "from personaeval.dataset_prepare import ensure_dataset_ready; ensure_dataset_ready()"

Step 1: Configuration Setup

Copy the example configuration file and modify it:

cp configs/default.yaml.example configs/default.yaml

Edit configs/default.yaml to configure your models, including base_url, api_key, etc.

Step 2: Run Experiments

The simplest way to start is:

personaeval run --model gpt-4.1

This will run the evaluation on all tracks with the specified model.

Command Line Arguments

--model, -m: Model name to evaluate (required)
--track, -t: Track name to evaluate (default: "all")
--config, -c: Path to configuration file (default: "configs/default.yaml")
--no-resume: Do not resume from existing results
--list-tracks: List available tracks and exit
--list-models: List available models and exit

Examples

# Run on a specific track
personaeval run --track Literary --model gpt-4.1

# Run on all tracks
personaeval run --track all --model claude-sonnet-4-20250514

# Use custom configuration
personaeval run --config configs/my_config.yaml --model gpt-4.1

# List available tracks and models
personaeval run list-tracks
personaeval run list-models

Step 3: Calculate Metrics

After running experiments, calculate evaluation metrics:

# Calculate metrics for a single model
personaeval metrics --models gpt-4.1

# Calculate metrics for multiple models
personaeval metrics --models "gpt-4.1,claude-sonnet-4-20250514"

# Generate comparison plots
personaeval metrics --models "gpt-4.1,claude-sonnet-4-20250514" --plot

# Specify custom output
personaeval metrics --models gpt-4.1 --output my_metrics.csv --plot --plot-output comparison.png

Configuration Parameters

Model Configuration

Each model in the configuration file supports:

url: API endpoint URL
api_key: Your API key
cost_input: Cost per input token
cost_output: Cost per output token
proxy_url: Optional proxy URL for API requests

Experiment Configuration

max_workers: Number of concurrent workers (default: 4)
max_retries: Maximum retries per request (default: 5)
timeout: Request timeout in seconds (default: 600)
temperature: Model temperature (default: 0.0)
save_interval: Save interval for result file in seconds (default: 60)
sleep_interval: Sleep interval during retry (default: 60)
reasoning_models: List of models that require stream mode for API call

Track Configuration

Each track defines:

name: Track name (Literary, Drama, Expertise)
data_file: Path to CSV data file
output_dir: Directory to save results

Available Commands

`personaeval run`

Run evaluation experiments on specified tracks and models.

Options:

--config, -c: Configuration file path
--track, -t: Track name or "all"
--model, -m: Model name (required)
--no-resume: Disable resume from existing results
--list-tracks: List available tracks
--list-models: List available models

`personaeval metrics`

Calculate evaluation metrics from experiment results.

Options:

--results-dir, -r: Results directory (default: "results")
--models, -m: Comma-separated list of model names
--tracks: Comma-separated list of track names (default: "Literary,Drama,Expertise")
--output, -o: Output CSV file path (default: "metrics.csv")
--plot: Generate comparison plots
--plot-output: Plot output file path (default: "metrics_comparison.png")

`personaeval analyze`

Analyze a single result file.

Arguments:

result_file: Path to the result CSV file

Project Structure

personaeval/
├── personaeval/
│   ├── __init__.py
│   ├── cli.py              # Command-line interface
│   ├── config.py           # Configuration management
│   ├── models.py           # Model definitions and API calls
│   ├── evaluator.py        # Main evaluation logic
│   ├── metrics.py          # Metrics calculation
│   ├── dataset_prepare.py  # Dataset preparation utilities
│   └── utils.py            # Utility functions
├── configs/
│   ├── default.yaml        # Main configuration
│   └── default.yaml.example # Example configuration
├── data/                   # Evaluation datasets (auto-downloaded)
├── results/                # Experiment results
├── pyproject.toml
└── README.md

Citation

If you find this code useful, please cite our paper:

@article{zhou2025personaeval,
  title={PersonaEval: Are LLM Evaluators Human Enough to Judge Role-Play?},
  author={Zhou, Lingfeng and Zhang, Jialing and Gao, Jin and Jiang, Mohan and Wang, Dequan},
  journal={arXiv preprint arXiv:2508.10014},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PersonaEval

Installation

Usage

Step 0: Dataset Preparation

Step 1: Configuration Setup

Step 2: Run Experiments

Command Line Arguments

Examples

Step 3: Calculate Metrics

Configuration Parameters

Model Configuration

Experiment Configuration

Track Configuration

Available Commands

`personaeval run`

`personaeval metrics`

`personaeval analyze`

Project Structure

Citation

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
configs		configs
personaeval		personaeval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

maple-zhou/PersonaEval

Folders and files

Latest commit

History

Repository files navigation

PersonaEval

Installation

Usage

Step 0: Dataset Preparation

Step 1: Configuration Setup

Step 2: Run Experiments

Command Line Arguments

Examples

Step 3: Calculate Metrics

Configuration Parameters

Model Configuration

Experiment Configuration

Track Configuration

Available Commands

personaeval run

personaeval metrics

personaeval analyze

Project Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`personaeval run`

`personaeval metrics`

`personaeval analyze`

Packages