GitHub - liushiliushi/ConfTuner: Official code of ConfTuner: Training Large Language Models to Express Their Confidence Verbally

This is the official repository for ConfTuner: Training Large Language Models to Express Their Confidence Verbally, accepted by NeurIPS 2025.

How ConfTuner works?

ConfTuner is a novel method to fine-tune large language models (LLMs) using a customized loss function called the Tokenized Brier Score. This approach enables LLMs to more accurately verbalize their confidence in natural language, such as expressing percentages like "80% confident."

ConfTuner consists of two steps:

Compute Probability Distribution Over Confidence Tokens: Given a prompt that asks the LLM to output the answer and its confidence for a question, this step extracts the model’s probability distribution over a predefined set of confidence tokens.
Fine-Tune Based on Tokenized Brier Score: The probability distribution is used to compute a tokenized Brier score against the ground truth correctness of the generated answer, effectively penalizing miscalibrated confidence. We fine-tune the LLM based on the tokenized Brier score.

Setups

Clone the repository:

git clone [email protected]:liushiliushi/Uncertainty_ft.git
cd Uncertainty_ft

Install dependencies:

pip install -r requirements.txt

Set up OpenAI API credentials for GPT-4o evaluation:

export OPENAI_DEPLOYMENT_NAME='gpt-4o'
export OPENAI_API_KEY='your-api-key-here'  # Replace with your OpenAI API key

Note: The evaluation of answer accuracy is performed using GPT-4o. You need to set up your OpenAI API credentials before running any generation or evaluation scripts.

Dataset Generation

To generate training datasets for different base models, use the following commands:

Llama-3.1

For training set:

python generate_response.py \
    --split train \
    --output_dir ../dataset/hotpot_qa/train_llama_temp=0.jsonl \
    --do_sample False \
    --temperature 0 \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --dataset hotpot_qa

Qwen

For training set:

python generate_response.py \
    --split train \
    --output_dir ../dataset/hotpot_qa/train_Qwen_temp=0.jsonl \
    --do_sample False \
    --temperature 0 \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --dataset hotpot_qa

Mistral

For training set:

python generate_response.py \
    --split train \
    --output_dir ../dataset/hotpot_qa/train_ministral_temp=0.jsonl \
    --do_sample False \
    --temperature 0 \
    --model_name mistralai/Ministral-8B-Instruct-2410 \
    --dataset hotpot_qa

Datasets for GPT-4o

To generate training datasets for GPT-4o, use the following commands:

python generate_response.py --split train --output_dir ../dataset/hotpot_qa/train_response_gpt.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset hotpot_qa

To generate testing datasets for GPT-4o, use the following commands:

python generate_response.py --split test --output_dir ../dataset/hotpot_qa/validation_response_gpt.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset hotpot_qa

python generate_response.py --split test --output_dir ../dataset/trivia_qa/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset trivia_qa

python generate_response.py --split test --output_dir ../dataset/grade_school_math/data/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset gsm8k_dataset

python generate_response.py --split test --output_dir ../dataset/truthful_qa/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset truthful_qa

python generate_response.py --split test --output_dir ../dataset/StrategyQA/validation_gpt_temp=0.jsonl --do_sample False --temperature 0 --model_name gpt4o --dataset strategy_qa

Training

Training Command

For Llama-3.1:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
    --num_processes 4 \
    --mixed_precision bf16 \
    --use_deepspeed \
    --deepspeed_config_file llama_recipes/configs/ds_config.json \
    uncertainty_sft.py \
    --add_loss_con True \
    --train_coarse False \
    --batch_size_testing 4 \
    --do_sample False \
    --temperature 0 \
    --use_peft \
    --peft_method lora \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --output_dir checkpoints/llama_ft \
    --dataset hotpot_qa \
    --batch_size_training=4 \
    --val_batch_size=4 \
    --generate=llm \
    --lr=1e-5 \
    --loss_type=brier \
    --num_epochs=2 \
    --merge_peft True \
    --use_wandb \

For Qwen:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 accelerate launch \
    --num_processes 6 \
    --mixed_precision bf16 \
    --use_deepspeed \
    --deepspeed_config_file llama_recipes/configs/ds_config.json \
    uncertainty_sft.py \
    --add_loss_con False \
    --train_coarse True \
    --batch_size_testing 4 \
    --do_sample False \
    --temperature 0 \
    --use_peft \
    --peft_method lora \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --output_dir checkpoints/qwen_ft \
    --dataset hotpot_qa \
    --batch_size_training=4 \
    --val_batch_size=4 \
    --generate=llm \
    --lr=1e-5 \
    --loss_type=brier \
    --num_epochs=3 \
    --merge_peft True \
    --use_wandb

For Mistral:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch \
    --num_processes 4 \
    --mixed_precision bf16 \
    --use_deepspeed \
    --deepspeed_config_file llama_recipes/configs/ds_config.json \
    uncertainty_sft.py \
    --add_loss_con False \
    --train_coarse True \
    --batch_size_testing 4 \
    --do_sample False \
    --temperature 0 \
    --use_peft \
    --peft_method lora \
    --model_name mistralai/Ministral-8B-Instruct-2410 \
    --output_dir checkpoints/ministral_ft \
    --dataset hotpot_qa \
    --batch_size_training=4 \
    --val_batch_size=4 \
    --generate=llm \
    --lr=3e-5 \
    --loss_type=brier \
    --num_epochs=2 \
    --merge_peft True

Train on GPT-4o's responses:

CUDA_VISIBLE_DEVICES=1,2,3,4 accelerate launch \
    --num_processes 4 \
    --mixed_precision bf16 \
    --use_deepspeed \
    --deepspeed_config_file llama_recipes/configs/ds_config.json \
    uncertainty_sft.py \
    --add_loss_con False \
    --train_gpt True \
    --train_coarse False \
    --on_policy False \
    --batch_size_testing 4 \
    --do_sample False \
    --temperature 0 \
    --use_peft \
    --peft_method lora \
    --model_name meta-llama/Llama-3.1-8B-Instruct \
    --output_dir checkpoints/llama_gpt \
    --dataset hotpot_qa \
    --batch_size_training=4 \
    --val_batch_size=4 \
    --generate=llm \
    --lr=1e-4 \
    --loss_type=sot \
    --num_epochs=2 \
    --merge_peft False \
    --use_wandb

Training Parameters

Key parameters for fine-tuning:

--add_loss_con: Enable consistency loss
--train_coarse: Enable training for confidence levels of 0-9
--use_peft: Enable Parameter-Efficient Fine-Tuning
--peft_method: Choose between 'lora' or 'qlora'
--batch_size_training: Training batch size per gpu
--val_batch_size: Validation batch size per gpu
--use_wandb: Enable Weights & Biases logging
--loss_type: Loss function type (e.g., 'brier')
--num_epochs: Number of training epochs
--merge_peft: Merge PEFT weights after training

Testing

To evaluate the fine-tuned model:

Validate on confidence levels 0-100:

python inference.py \
    --model_name /path/to/your/checkpoint \
    --dataset dataset_name \
    --use_wandb \

Validate on confidence levels high/medium/low:

python inference.py \
    --model_name /path/to/your/checkpoint \
    --dataset dataset_name \
    --use_wandb \
    --test_linguistic True

You can just run these commands to see the performance on all the datasets:

cd src
./inference.sh /path/to/your/checkpoint

To evaluate the model's cascading performance, run:

python inference_gpt.py \
    --model_name /path/to/your/checkpoint \
    --dataset hotpot_qa or truthful_qa \
    --use_wandb \

Checkpoints

The checkpoints of ConfTuner:

ConfTuner based on LLAMA

ConfTuner based on Qwen

ConfTuner based on Ministral

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
downloaded_config		downloaded_config
images		images
src		src
.gitignore		.gitignore
README.md		README.md
scripts.sh		scripts.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

How ConfTuner works?

Setups

Dataset Generation

Llama-3.1

Qwen

Mistral

Datasets for GPT-4o

Training

Training Command

Training Parameters

Testing

Checkpoints

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

liushiliushi/ConfTuner

Folders and files

Latest commit

History

Repository files navigation

How ConfTuner works?

Setups

Dataset Generation

Llama-3.1

Qwen

Mistral

Datasets for GPT-4o

Training

Training Command

Training Parameters

Testing

Checkpoints

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages