AI debate improves human judgment accuracy on controversial claims by 10% over single-advisor consultancy, with persona-based LLM judges outperforming humans—demonstrating progress on scalable oversight for supervising AI systems beyond human expertise.
This repository provides all resources for the paper "AI Debate Aids Assessment of Controversial Claims".
- Quick Start
- LLM Judge Experiments
- Persona-based LLM Judge Experiments
- Configuration
- Saved Data
- Human Judge Experiments
git clone https://github.com/salman-lui/ai-debate
cd ai-debate
# Create a new conda environment with Python 3.10
conda create -n ai-debate python=3.10 -y
# Activate the environment
conda activate ai-debate
pip install -r requirements.txtEdit config/config.yaml to configure debater, consultant, and LLM judge settings. The config.yaml contains different model providers (OpenAI, Anthropic, Azure, SGLang).
OpenAI Setup:
export OPENAI_API_KEY='your-openai-api-key'Claude Setup:
export ANTHROPIC_API_KEY='your-anthropic-api-key'Open Source Models Setup:
For open source models, set up SGLang and change the configuration in config.yaml accordingly.
Test Run (Single Claim):
python run_debate.py \
--dataset covid \
--debater default \
--judge default \
--debater-a-model gpt4o \
--debater-b-model gpt4o \
--judge-model gpt4o \
--argue-for-debater-a correct \
--test-runFull Run (All Claims):
python run_debate.py \
--dataset covid \
--debater default \
--judge default \
--debater-a-model gpt4o \
--debater-b-model gpt4o \
--judge-model gpt4o \
--argue-for-debater-a correctBatch Processing (All Combinations):
# Run all dataset/model/position combinations in parallel
bash scripts/debate/run_default_setup_debate_parallel.shTest Run (Single Claim):
python run_consultancy.py \
--dataset covid \
--consultant default \
--judge default \
--consultant-model gpt4o \
--judge-model gpt4o \
--argue-for correct \
--test-runFull Run (All Claims):
python run_consultancy.py \
--dataset covid \
--consultant default \
--judge default \
--consultant-model gpt4o \
--judge-model gpt4o \
--argue-for correctBatch Processing (All Combinations):
# Run all combinations in parallel
bash scripts/consultancy/run_default_setup_consultancy_parallel.shThis section covers experiments using human judge personas from crowd annotation platforms like Prolific.
First, run the following script to generate personas:
Note: The template inside the Python script will need to be modified depending on the fields in your CSV, and what you want each persona to say. It currently is set up for our COVID dataset.
# multiple CSVs can be passed in at once by repeating the -i flag
python get_personas_from_prolific.py -i input.csvYou will need to place JSON files containing the assigned claims for every Prolific response:
- Debate experiments:
debate-claim-assignment-by-participant/PROLIFIC_ID.json - Consultancy experiments:
consultancy-claim-assignment-by-participant/PROLIFIC_ID.json
Claim Assignment Format:
[
{
"claim": "Ferrets pass the novel coronavirus on to one another in a different way than humans do",
"veracity": false,
"label": "refute",
"supporting_sources": [
{
"source_id": 1,
"title": "SARS-CoV-2 is transmitted via contact and via the air between ferrets",
"url": "https://www.nature.com/articles/s41467-020-17367-2",
"content": "The study published in Nature Communications..."
},
{
"source_id": 2,
"title": "Infection and Rapid Transmission of SARS-CoV-2 in Ferrets",
"url": "https://www.sciencedirect.com/science/article/pii/S1931312820301876",
"content": "The message from Elsevier B.V. indicates a technical issue..."
}
]
}
]Debate Mode:
python run_debate.py \
--dataset covid \
--debater browsing-personalized \
--judge persona \
--debater-a-model gpt4o \
--debater-b-model gpt4o \
--judge-model gpt4o \
--argue-for-debater-a correct \
--judge-prolific-id "$judge_id"Consultancy Mode:
python run_consultancy.py \
--dataset covid \
--consultant browsing-personalized \
--judge persona \
--consultant-model gpt4o \
--judge-model gpt4o \
--argue-for correct \
--judge-prolific-id "$judge_id"Batch Processing (With and Without Personas):
# Debate experiments with persona comparison
./scripts/debate/run_browsing_setup_with_without_personas.sh
# Consultancy experiments with persona comparison
./scripts/consultancy/run_browsing_setup_with_without_personas.shThese scripts run two experiments each: one incorporating simulated personas and a control run with the same claims but no persona. By default, configured for the COVID dataset.
Available options:
- Datasets:
covid,climate - Models:
gpt4o,claude,deepseek,azure - Agent types:
default,browsing,default-personalized,browsing-personalized - Judge types:
default,persona
Batch processing experiments run with:
- Datasets:
covid,climate - Models:
gpt4o,claude - Positions:
correct,incorrect - All debater A/B and judge model combinations
Results are automatically saved in structured JSON format:
saved-data/
├── debate/
│ └── debater_default_judge_default/
│ ├── covid/results_correct.json
│ └── climate/results_incorrect.json
└── consultancy/
└── consultant_default_judge_default/
├── covid/results_correct.json
└── climate/results_incorrect.json
For conducting human judge experiments, refer to the UI implementations:
- Debate UI:
llm-debater-ui/- Web interface for debate experiments with human judges - Consultancy UI:
llm-consultancy-ui/- Web interface for consultancy experiments with human judges
Each UI folder contains its own README with detailed setup and usage instructions.
