Authors: Pranav Mahajan, Ihor Kendiukhov, Syed Hussain, Lydia Nottingham
📊 HuggingFace Dataset | 📄 Paper
This repository contains the analysis scripts for reproducing the results from our study on how elicitation protocols affect stated vs. revealed (SvR) preference correlation in language models.
The complete dataset with model responses and analysis outputs is available on HuggingFace at the link above.
- Allowing neutrality/abstention during stated preference elicitation substantially improves Spearman's rank correlation (ρ) with forced-choice revealed preferences
- Further allowing abstention in revealed preferences drives ρ to near-zero due to high neutrality rates
- System prompt steering using stated preferences does not reliably improve SvR correlation
- SvR correlation is highly protocol-dependent
├── scripts/
│ ├── generation/ # Scripts to run model evaluations
│ ├── processing/ # Scripts to process raw model responses
│ ├── analysis/ # Scripts for statistical analysis
│ └── visualization/ # Scripts for creating figures
├── configs/ # Model configuration files
├── figures/ # Generated visualization outputs
├── results/ # Analysis results and CSVs
└── img/ # Static images for README
- Install dependencies:
pip install -r requirements.txt- Set up API keys for the model providers you want to evaluate:
- OpenAI:
OPENAI_API_KEY - Anthropic:
ANTHROPIC_API_KEY - Together AI:
TOGETHER_API_KEY - X.AI:
XAI_API_KEY - OpenRouter:
OPENROUTER_API_KEY
- OpenAI:
Revealed Preferences (Forced Protocol):
python scripts/generation/run_revealed_preferences_forced_protocol.py --api_provider openai --model gpt-4o --api_key $OPENAI_API_KEYStated Preferences (Expanded Protocol):
python scripts/generation/run_stated_preferences_expanded_protocol.py --api_provider anthropic --model claude-sonnet-4.5 --api_key $ANTHROPIC_API_KEYBatch Evaluation:
./scripts/run_models.shProcess Revealed Preference Responses:
python scripts/processing/process_generations_revealed_expanded_protocol.py --model gpt-4oCalculate ELO Ratings:
python scripts/processing/calculate_elo_rating_revealed.py --model gpt-4o
python scripts/processing/calculate_elo_rating_stated.py --model gpt-4oCompute SvR Divergence:
python scripts/analysis/calculate_stated_revealed_divergence.pyAnalyze Neutrality Rates:
python scripts/analysis/analyze_neutrality_revealed_expanded_protocol.py
python scripts/analysis/analyze_neutrality_stated_expanded_protocol.pyAnalyze Scaling Trends:
python scripts/analysis/analyze_svr_scaling_trends.pyVisualize ELO Ratings:
python scripts/visualization/visualize_elo_rating.py --model gpt-4oCreate 3-Panel Scaling Trends Figure:
python scripts/visualization/create_svr_gap_scaling_trend_visualization_3_panels.py| Script | Description |
|---|---|
run_revealed_preferences_forced_protocol.py |
Forced binary choice revealed preferences |
run_revealed_preferences_expanded_protocol.py |
Expanded protocol (with neutrality) revealed preferences |
run_stated_preferences_forced_protocol.py |
Forced binary choice stated preferences |
run_stated_preferences_expanded_protocol.py |
Expanded protocol stated preferences |
run_revealed_preferences_steered.py |
Steering experiments |
| Script | Description |
|---|---|
process_generations_revealed_expanded_protocol.py |
Process revealed preference responses |
process_generations_stated_expanded_protocol.py |
Process stated preference responses |
calculate_elo_rating_revealed.py |
Calculate ELO ratings from revealed preferences |
calculate_elo_rating_stated.py |
Calculate ELO ratings from stated preferences |
| Script | Description |
|---|---|
calculate_stated_revealed_divergence.py |
Calculate SvR divergence metrics |
analyze_neutrality_revealed_expanded_protocol.py |
Analyze neutrality rates in revealed preferences |
analyze_neutrality_stated_expanded_protocol.py |
Analyze neutrality rates in stated preferences |
analyze_svr_scaling_trends.py |
Analyze SvR correlation vs model capabilities |
calculate_steering_improvement.py |
Analyze steering intervention effectiveness |
get_svr_scaling_trends_corr.py |
Statistical analysis of scaling trends |
| Script | Description |
|---|---|
visualize_elo_rating.py |
Visualize ELO ratings per model |
visualize_elo_rating_stated.py |
Visualize stated preference ELO ratings |
visualize_svr_correlation_by_model.py |
Visualize SvR correlation by model |
create_svr_gap_scaling_trend_visualization_3_panels.py |
Create 3-panel scaling trends figure |
The complete dataset with all model responses and pre-computed analyses is available on HuggingFace:
The dataset includes:
- Model responses from 24 state-of-the-art language models
- Both forced and expanded protocol results
- Stated and revealed preferences for all models
- Pre-computed ELO ratings and statistical analyses
- Visualization outputs
If you use this code or dataset, please cite:
@misc{mahajan2025mindthegap,
title={Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models},
author={Pranav Mahajan and Ihor Kendiukhov and Syed Hussain and Lydia Nottingham},
year={2025},
}And the original AIRiskDilemmas dataset:
@misc{chiu2025aitellliessave,
title={Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas},
author={Yu Ying Chiu and Zhilin Wang and Sharan Maiya and Yejin Choi and Kyle Fish and Sydney Levine and Evan Hubinger},
year={2025},
eprint={2505.14633},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14633},
}This project is licensed under the MIT License - see the LICENSE file for details.