Skip to content

kevinliang888/Machine-Bullshit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

🌐 Project Page | 📄 Paper | 🤗 Dataset

Official code for the paper "Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models".

Authors: Kaiqu Liang, Haimin Hu, Xuandong Zhao, Dawn Song, Tom Griffiths, Jaime Fernández Fisac.

Setup

pip install -r requirements.txt
# macOS / Linux
export OPENAI_API_KEY=""
export ANTHROPIC_API_KEY=""
export GOOGLE_API_KEY=""

Evaluate BullshitEval

We support evaluation for closed-source models provided by OpenAI, Anthropic (Claude), and Google (Gemini), as well as open-source models such as Llama-3.3-70b and Qwen2.5-72b.

Base model

python -u eval_bench.py --provider openai --model gpt-4o-mini --output_dir output/bullshit_eval/gpt-4o-mini

CoT

python eval_bench.py --provider openai --model gpt-4o-mini --cot --output_dir output/bullshit_eval/gpt-4o-mini

Principal-agent

python eval_bench.py --provider openai --model gpt-4o-mini --pa --output_dir output/bullshit_eval/gpt-4o-mini

Evaluate RLHF

python eval_market.py --ai_model "llama-3-8b" --checkpoints "checkpoint-5000"

Evaluate political bullshit

We evaluate political opinion, political opinion + viewpoint, conspiracy bad, conspiracy good, and universal rights.

python -u eval_bench.py --task political --provider openai --model "GPT4 (Mini)_generation" --input_file input/political/consipracy_bad_dataset.json --output_dir output/political/gpt-4o-mini/consipracy_bad

Simple Usage of bullshit index

from scipy.stats import pointbiserialr

# 1️⃣  Model output: 1 = the model *asserts* the proposition, 0 = it does not
actual = [1, 0, 1, 1, 0]          # ← replace with your own data

# 2️⃣  Model belief: self-reported probability the proposition is true (0‒1)
belief = [0.92, 0.30, 0.55, 0.81, 0.07]

# Pearson point-biserial correlation between assertion and belief
r, p = pointbiserialr(actual, belief)

# Bullshit Index (BI):
# BI = 1 -> maximally truth-indifferent
# BI = 0 -> |r| = 1  (r ≈ +1 truthful, r ≈ −1 systematic lying)
bullshit_index = 1 - abs(r)    

Citation

If you find this code to be useful for your research, please consider citing.

@article{liang2025machine,
  title={Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models},
  author={Liang, Kaiqu and Hu, Haimin and Zhao, Xuandong and Song, Dawn and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
  journal={arXiv preprint arXiv:2507.07484},
  year={2025}
}

About

Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages