- Project page for our paper "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models"
- We investigated several variants of the selfcheck approach: BERTScore, Question-Answering, n-gram, NLI, and LLM-Prompting.
- [Nov 2023] SelfCheckGPT-NLI Calibration Analysis thanks to Daniel Huynh [Link to Article]
- [Oct 2023] The paper is accepted and to appear at EMNLP 2023 [Poster]
- [Aug 2023] Slides from ML Collective Talk [Link to Slides]
pip install selfcheckgpt
There are three variants of SelfCheck scores in this package as described in the paper: SelfCheckBERTScore(), SelfCheckMQAG(), SelfCheckNgram(). All of the variants have predict() which will output the sentence-level scores w.r.t. sampled passages. You can use packages such as spacy to split passage into sentences. For reproducibility, you can set torch.manual_seed before calling this function. See more details in Jupyter Notebook demo/SelfCheck_demo1.ipynb
# Include necessary packages (torch, spacy, ...)
from selfcheckgpt.modeling_selfcheck import SelfCheckMQAG, SelfCheckBERTScore, SelfCheckNgram
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_mqag = SelfCheckMQAG(device=device) # set device to 'cuda' if GPU is available
selfcheck_bertscore = SelfCheckBERTScore(rescale_with_baseline=True)
selfcheck_ngram = SelfCheckNgram(n=1) # n=1 means Unigram, n=2 means Bigram, etc.
# LLM's text (e.g. GPT-3 response) to be evaluated at the sentence level & Split it into sentences
passage = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Nation."
sentences = [sent.text.strip() for sent in nlp(passage).sents] # spacy sentence tokenization
print(sentences)
['Michael Alan Weiner (born March 31, 1942) is an American radio host.', 'He is the host of The Savage Nation.']
# Other samples generated by the same LLM to perform self-check for consistency
sample1 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He is the host of The Savage Country."
sample2 = "Michael Alan Weiner (born January 13, 1960) is a Canadian radio host. He works at The New York Times."
sample3 = "Michael Alan Weiner (born March 31, 1942) is an American radio host. He obtained his PhD from MIT."
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-MQAG: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
# Additional params for each scoring_method:
# -> counting: AT (answerability threshold, i.e. questions with answerability_score < AT are rejected)
# -> bayes: AT, beta1, beta2
# -> bayes_with_alpha: beta1, beta2
sent_scores_mqag = selfcheck_mqag.predict(
sentences = sentences, # list of sentences
passage = passage, # passage (before sentence-split)
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
num_questions_per_sent = 5, # number of questions to be drawn
scoring_method = 'bayes_with_alpha', # options = 'counting', 'bayes', 'bayes_with_alpha'
beta1 = 0.8, beta2 = 0.8, # additional params depending on scoring_method
)
print(sent_scores_mqag)
# [0.30990949 0.42376232]
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-BERTScore: Score for each sentence where value is in [0.0, 1.0] and high value means non-factual
sent_scores_bertscore = selfcheck_bertscore.predict(
sentences = sentences, # list of sentences
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_bertscore)
# [0.0695562 0.45590915]
# --------------------------------------------------------------------------------------------------------------- #
# SelfCheck-Ngram: Score at sentence- and document-level where value is in [0.0, +inf) and high value means non-factual
# as opposed to SelfCheck-MQAG and SelfCheck-BERTScore, SelfCheck-Ngram's score is not bounded
sent_scores_ngram = selfcheck_ngram.predict(
sentences = sentences,
passage = passage,
sampled_passages = [sample1, sample2, sample3],
)
print(sent_scores_ngram)
# {'sent_level': { # sentence-level score similar to MQAG and BERTScore variant
# 'avg_neg_logprob': [3.184312, 3.279774],
# 'max_neg_logprob': [3.476098, 4.574710]
# },
# 'doc_level': { # document-level score such that avg_neg_logprob is computed over all tokens
# 'avg_neg_logprob': 3.218678904916201,
# 'avg_max_neg_logprob': 4.025404834169327
# }
# }Entailment (or Contradiction) score with input being the sentence and a sampled passage can be used as the selfcheck score. We use DeBERTa-v3-large fine-tuned to Multi-NLI, and we normalize the probability of "entailment" or "contradiction" classes, and take Prob(contradiction) as the score.
from selfcheckgpt.modeling_selfcheck import SelfCheckNLI
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
selfcheck_nli = SelfCheckNLI(device=device) # set device to 'cuda' if GPU is available
sent_scores_nli = selfcheck_nli.predict(
sentences = sentences, # list of sentences
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
)
print(sent_scores_nli)
# [0.334014 0.975106 ] -- based on the example abovePrompting an LLM (Llama2, Mistral, OpenAI's GPT) to assess information consistency in a zero-shot setup. We query an LLM to assess whether the i-th sentence is supported by the sample (as the context). Similar to other methods, a higher score indicates higher chance of being hallucination. An example when using Mistral is below:
# Option1: open-source model
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"
selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)
# Option2: API access
# (currently only support OpenAI and Groq)
# from selfcheckgpt.modeling_selfcheck_apiprompt import SelfCheckAPIPrompt
# selfcheck_prompt = SelfCheckAPIPrompt(client_type="openai", model="gpt-3.5-turbo")
# selfcheck_prompt = SelfCheckAPIPrompt(client_type="groq", model="llama3-70b-8192", api_key="your-api-key")
sent_scores_prompt = selfcheck_prompt.predict(
sentences = sentences, # list of sentences
sampled_passages = [sample1, sample2, sample3], # list of sampled passages
verbose = True, # whether to show a progress bar
)
print(sent_scores_prompt)
# [0.33333333, 0.66666667] -- based on the example aboveThe LLM can be any model available on HuggingFace. The default prompt template is Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: , but you can change it using selfcheck_prompt.set_prompt_template(new_prompt).
Most models (gpt-3.5-turbo, Llama2, Mistral) will output either 'Yes' or 'No' >95% of the time, while any remaining outputs can be set to N/A. The output is converted to score: Yes -> 0.0, No -> 1.0, N/A -> 0.5. The inconsistency score is then calculated by averaging.
The wiki_bio_gpt3_hallucination dataset currently consists of 238 annotated passages (v3). You can find more information in the paper or our data card on HuggingFace: https://huggingface.co/datasets/potsawee/wiki_bio_gpt3_hallucination. To use this dataset, you can either load it through HuggingFace dataset API, or download it directly from below in the JSON format.
We've annotated GPT-3 wikibio passages further, and now the dataset consists of 238 annotated passages. Here is the link for the IDs of the first 65 passages in the v1.
from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")Download from our Google Drive, then you can load it in python:
import json
with open("dataset.json", "r") as f:
content = f.read()
dataset = json.loads(content)Each instance consists of:
gpt3_text: GPT-3 generated passagewiki_bio_text: Actual Wikipedia passage (first paragraph)gpt3_sentences:gpt3_textsplit into sentences usingspacyannotation: human annotation at the sentence levelwiki_bio_test_idx: ID of the concept/individual from the original wikibio dataset (testset)gpt3_text_samples: list of sampled passages (do_sample = True & temperature = 1.0)
As described in our paper, probabilities (and generation entropies) of the generative LLM can be used to measure its confidence. Check our example/implementation of this approach in demo/experiments/probability-based-baselines.ipynb
- Full details can be found in our paper.
- Note that our new results show that LLMs such as GPT-3 (text-davinci-003) or ChatGPT (gpt-3.5-turbo) are good at text inconsistency assessment. Based on this finding, we try SelfCheckGPT-Prompt where each sentence (to be evaluated) is compared against each and every sampled_passage by prompting ChatGPT. SelfCheckGPT-Prompt is the best-performing method.
Results on the wiki_bio_gpt3_hallucination dataset.
| Method | NonFact (AUC-PR) | Factual (AUC-PR) | Ranking (PCC) |
|---|---|---|---|
| Random Guessing | 72.96 | 27.04 | - |
| GPT-3 Avg(-logP) | 83.21 | 53.97 | 57.04 |
| SelfCheck-BERTScore | 81.96 | 44.23 | 58.18 |
| SelfCheck-QA | 84.26 | 48.14 | 61.07 |
| SelfCheck-Unigram | 85.63 | 58.47 | 64.71 |
| SelfCheck-NLI | 92.50 | 66.08 | 74.14 |
| SelfCheck-Prompt (Llama2-7B-chat) | 89.05 | 63.06 | 61.52 |
| SelfCheck-Prompt (Llama2-13B-chat) | 91.91 | 64.34 | 75.44 |
| SelfCheck-Prompt (Mistral-7B-Instruct-v0.2) | 91.31 | 62.76 | 74.46 |
| SelfCheck-Prompt (gpt-3.5-turbo) | 93.42 | 67.09 | 78.32 |
MQAG (Multiple-choice Question Answering and Generation) was proposed in our previous work. Our MQAG implementation is included in this package, which can be used to: (1) generate multiple-choice questions, (2) answer multiple-choice questions, (3) obtain MQAG score.
from selfcheckgpt.modeling_mqag import MQAG
mqag_model = MQAG()It has three main functions: generate(), answer(), score(). We show an example usage in demo/MQAG_demo1.ipynb
This work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge, and the Cambridge Commonwealth, European & International Trust.
@article{manakul2023selfcheckgpt,
title={Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models},
author={Manakul, Potsawee and Liusie, Adian and Gales, Mark JF},
journal={arXiv preprint arXiv:2303.08896},
year={2023}
}
