When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

This is the official repository for our paper:

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction.

🛠️ Environment Setup

Please refer to retraction.yml for setting up the conda environment.

Note:

To successfully run Olmo2-7B, install transformers>=4.51.3.
To extract attention weights and value vectors from Qwen2.5-7B, follow this issue and apply the corresponding transformers patch.
Be cautious with Qwen2.5's precision—bf16 is recommended as discussed in this issue.

Most experiments were conducted on a single A6000 GPU. For evaluation with Llama3.3-70B-Instruct, we used 4 A6000 GPUs.

📂 Data

We provide:

Original datasets wikidata_{train,test}_free.jsonl, celebrity_{train,test}_free.jsonl.
Continuation datasets for Llama3.1-8B, Qwen2.5-7B, and Olmo2-7B.

For the construction of the original datasets and continuation datasets, please refer to data/README.md.

🔍 Main Experiments for Probing and Steering

Extract Activations: Please run scripts/get_activations.sh first to get the hidden states for each layer of the model, for the following probing and steering experiments.
Run Probing: For probing, we train a single linear probe per layer. Hyperparameters are provided in scripts/probing.sh.
Apply Steering: To steer the model toward different belief directions, please run scripts/steering.sh.

🔧 Patching Experiments

To perform attention weights and value vectors patching, please run scripts/patching.sh to first extract the patching values and then do patching to the original model without steering.

🧪 Supervised Fine-Tuning

To replicate our SFT experiments:

Generate training data using data/generate_sft_data.py.
Train the model using LLaMA-Factory. As detailed in the paper, we use the following config:

stage: sft
do_train: true
finetuning_type: lora
lora_target: all
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
learning_rate: 1.0e-4
num_train_epochs: 2.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
report_to: wandb

🥳Citation

If you find our work useful, please consider citing:

@misc{yang2025llmsadmitmistakesunderstanding,
      title={When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction}, 
      author={Yuqing Yang and Robin Jia},
      year={2025},
      eprint={2505.16170},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16170}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
retraction.yml		retraction.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

🛠️ Environment Setup

📂 Data

🔍 Main Experiments for Probing and Steering

🔧 Patching Experiments

🧪 Supervised Fine-Tuning

🥳Citation

About

Uh oh!

Releases

Packages

Languages

ayyyq/llm-retraction

Folders and files

Latest commit

History

Repository files navigation

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

🛠️ Environment Setup

📂 Data

🔍 Main Experiments for Probing and Steering

🔧 Patching Experiments

🧪 Supervised Fine-Tuning

🥳Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages