This is the official repository for ConfTuner: Training Large Language Models to Express Their Confidence Verbally, accepted by NeurIPS 2025.
ConfTuner is a novel method to fine-tune large language models (LLMs) using a customized loss function called the Tokenized Brier Score. This approach enables LLMs to more accurately verbalize their confidence in natural language, such as expressing percentages like "80% confident."
ConfTuner consists of two steps:
-
Compute Probability Distribution Over Confidence Tokens: Given a prompt that asks the LLM to output the answer and its confidence for a question, this step extracts the model’s probability distribution over a predefined set of confidence tokens.
-
Fine-Tune Based on Tokenized Brier Score: The probability distribution is used to compute a tokenized Brier score against the ground truth correctness of the generated answer, effectively penalizing miscalibrated confidence. We fine-tune the LLM based on the tokenized Brier score.
- Clone the repository:
git clone [email protected]:liushiliushi/Uncertainty_ft.git
cd Uncertainty_ft- Install dependencies:
pip install -r requirements.txt- Set up OpenAI API credentials for GPT-4o evaluation:
export OPENAI_DEPLOYMENT_NAME='gpt-4o'
export OPENAI_API_KEY='your-api-key-here' # Replace with your OpenAI API keyNote: The evaluation of answer accuracy is performed using GPT-4o. You need to set up your OpenAI API credentials before running any generation or evaluation scripts.
To generate training datasets for different base models, use the following commands:
For training set:
python generate_response.py \
--split train \
--output_dir ../dataset/hotpot_qa/train_llama_temp=0.jsonl \
--do_sample False \
--temperature 0 \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--dataset hotpot_qaFor training set:
python generate_response.py \
--split train \
--output_dir ../dataset/hotpot_qa/train_Qwen_temp=0.jsonl \
--do_sample False \
--temperature 0 \
--model_name Qwen/Qwen2.5-7B-Instruct \
--dataset hotpot_qaFor training set:
python generate_response.py \
--split train \
--output_dir ../dataset/hotpot_qa/train_ministral_temp=0.jsonl \
--do_sample False \
--temperature 0 \
--model_name mistralai/Ministral-8B-Instruct-2410 \
--dataset hotpot_qaTo generate training datasets for GPT-4o, use the following commands:
python generate_response.py --split train --output_dir ../dataset/hotpot_qa/train_response_gpt.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset hotpot_qaTo generate testing datasets for GPT-4o, use the following commands:
python generate_response.py --split test --output_dir ../dataset/hotpot_qa/validation_response_gpt.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset hotpot_qa
python generate_response.py --split test --output_dir ../dataset/trivia_qa/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset trivia_qa
python generate_response.py --split test --output_dir ../dataset/grade_school_math/data/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset gsm8k_dataset
python generate_response.py --split test --output_dir ../dataset/truthful_qa/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset truthful_qa
python generate_response.py --split test --output_dir ../dataset/StrategyQA/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset strategy_qa- For Llama-3.1:
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
--num_processes 4 \
--mixed_precision bf16 \
--use_deepspeed \
--deepspeed_config_file llama_recipes/configs/ds_config.json \
uncertainty_sft.py \
--add_loss_con True \
--train_coarse False \
--batch_size_testing 4 \
--do_sample False \
--temperature 0 \
--use_peft \
--peft_method lora \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--output_dir checkpoints/llama_ft \
--dataset hotpot_qa \
--batch_size_training=4 \
--val_batch_size=4 \
--generate=llm \
--lr=1e-5 \
--loss_type=brier \
--num_epochs=2 \
--merge_peft True \
--use_wandb \- For Qwen:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 accelerate launch \
--num_processes 6 \
--mixed_precision bf16 \
--use_deepspeed \
--deepspeed_config_file llama_recipes/configs/ds_config.json \
uncertainty_sft.py \
--add_loss_con False \
--train_coarse True \
--batch_size_testing 4 \
--do_sample False \
--temperature 0 \
--use_peft \
--peft_method lora \
--model_name Qwen/Qwen2.5-7B-Instruct \
--output_dir checkpoints/qwen_ft \
--dataset hotpot_qa \
--batch_size_training=4 \
--val_batch_size=4 \
--generate=llm \
--lr=1e-5 \
--loss_type=brier \
--num_epochs=3 \
--merge_peft True \
--use_wandb- For Mistral:
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
--num_processes 4 \
--mixed_precision bf16 \
--use_deepspeed \
--deepspeed_config_file llama_recipes/configs/ds_config.json \
uncertainty_sft.py \
--add_loss_con False \
--train_coarse True \
--batch_size_testing 4 \
--do_sample False \
--temperature 0 \
--use_peft \
--peft_method lora \
--model_name mistralai/Ministral-8B-Instruct-2410 \
--output_dir checkpoints/ministral_ft \
--dataset hotpot_qa \
--batch_size_training=4 \
--val_batch_size=4 \
--generate=llm \
--lr=3e-5 \
--loss_type=brier \
--num_epochs=2 \
--merge_peft TrueTrain on GPT-4o's responses:
CUDA_VISIBLE_DEVICES=1,2,3,4 accelerate launch \
--num_processes 4 \
--mixed_precision bf16 \
--use_deepspeed \
--deepspeed_config_file llama_recipes/configs/ds_config.json \
uncertainty_sft.py \
--add_loss_con False \
--train_gpt True \
--train_coarse False \
--on_policy False \
--batch_size_testing 4 \
--do_sample False \
--temperature 0 \
--use_peft \
--peft_method lora \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--output_dir checkpoints/llama_gpt \
--dataset hotpot_qa \
--batch_size_training=4 \
--val_batch_size=4 \
--generate=llm \
--lr=1e-4 \
--loss_type=sot \
--num_epochs=2 \
--merge_peft False \
--use_wandbKey parameters for fine-tuning:
--add_loss_con: Enable consistency loss--train_coarse: Enable training for confidence levels of 0-9--use_peft: Enable Parameter-Efficient Fine-Tuning--peft_method: Choose between 'lora' or 'qlora'--batch_size_training: Training batch size per gpu--val_batch_size: Validation batch size per gpu--use_wandb: Enable Weights & Biases logging--loss_type: Loss function type (e.g., 'brier')--num_epochs: Number of training epochs--merge_peft: Merge PEFT weights after training
To evaluate the fine-tuned model:
Validate on confidence levels 0-100:
python inference.py \
--model_name /path/to/your/checkpoint \
--dataset dataset_name \
--use_wandb \Validate on confidence levels high/medium/low:
python inference.py \
--model_name /path/to/your/checkpoint \
--dataset dataset_name \
--use_wandb \
--test_linguistic TrueYou can just run these commands to see the performance on all the datasets:
cd src
./inference.sh /path/to/your/checkpointTo evaluate the model's cascading performance, run:
python inference_gpt.py \
--model_name /path/to/your/checkpoint \
--dataset hotpot_qa or truthful_qa \
--use_wandb \The checkpoints of ConfTuner:
