Skip to content

Y0oMu/LLM-Judge-Bias-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Judge-Bias-Dataset

Dataset of paper "Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge"

Dataset Structure

All datasets follow a unified JSON format with the following common fields:

  • question: The input question or prompt
  • answer1: The first answer (typically the "chosen" or higher quality answer)
  • answer2: The second answer (typically the "rejected" or lower quality answer)
  • data_resource: The source dataset or resource where the original data came from
  • Additional perturbation fields (bias-specific): Various perturbed versions of answers based on the bias type

Base Datasets

We prepared three base datasets for supporting bias assessment in various judging tasks: fact-related, refinement-aware evaluation, and alignment datasets. These datasets serve as the foundation for evaluating different types of biases in LLM-as-a-Judge systems.

1. Alignment Dataset (alignment_dataset.json)

  • Size: 439 entries (after filtering from initial 500 samples)
  • Description: Created by sampling various DPO (Direct Preference Optimization) datasets. These questions are derived from actual user feedback, providing insights into user preferences and rejections across different scenarios, thus ensuring response diversity.
  • Sources:
    • Truthy-DPO-v0.1: 97 samples
    • Emerton-DPO-Pairs-Judge: 98 samples
    • Orca-DPO-Pairs: 100 samples
    • Py-DPO-v0.1: 100 samples
    • Roleplay-NSFW: 44 samples
  • Usage: Used for bias types that don't have specific data requirements, such as authority bias, position bias, and diversity bias. This dataset enhances the diversity of question coverage and encompasses various aspects including code, NSFW content, truthfulness testing, and role-playing.
  • Fields: question, answer1 (chosen), answer2 (rejected), data_resource

2. Fact-related Dataset (fact_related_dataset.json)

  • Size: 500 entries
  • Description: Constructed for the assessment involving bias types that require factual information as test content, and for cases where the quality of the response should not be affected by the presentation style of the model's response. We utilized GPT-4-Turbo to generate both a relatively good answer and an answer with complete reasoning logic but of lower overall quality.
  • Sources:
    • GSM8K: 156 samples
    • MATH: 143 samples
    • ScienceQA: 201 samples
  • Usage: This dataset allows us to modify responses without affecting their inherent quality when dealing with biases such as verbosity bias, sentiment bias, and fallacy-oversight bias, thereby more accurately determining whether the observed perturbation is due to the bias itself.
  • Fields: question, answer1, answer2, data_resource

3. Refinement-aware Dataset (refinement_aware_dataset.json)

  • Size: 150 entries
  • Description: Constructed for assessing the bias when LLM judge is used to determine whether a refined answer is better than the original. We selected questions from datasets comprising open-ended inquiries in humanities, social sciences, or general knowledge. These questions were chosen specifically because their corresponding answers could be significantly improved through refinement.
  • Sources:
    • CommonsenseQA: 50 samples
    • Quora-QuAD: 50 samples
    • TruthfulQA: 50 samples
  • Usage: The particular bias to be assessed on this dataset is whether the LLM judge produces a different result when it is informed about the refinement. This dataset is used for evaluating refinement-aware bias.
  • Fields: question, answer1 (initial answer), answer1_polished (refined answer), data_resource

Dataset Selection Rationale

  • Fact-related dataset: For bias types that require factual information as test content, those where the quality of the response should not be affected by the presentation style of the model's response.
  • Alignment dataset: For bias types that don't have specific data requirements, such as authority bias, to enhance the diversity of question coverage.
  • Refinement-aware dataset: For evaluating whether LLM judges produce different results when informed about answer refinement.

Each entry in all datasets includes a data_resource field indicating its source dataset.

Available Bias Evaluation Datasets

1. Sentiment Bias (sentiment_bias_standardized.json)

  • Size: 150 entries
  • Description: Evaluates how sentiment perturbations affect model judgments. Contains answers with different emotional tones.
  • Fields:
    • question, answer1, answer2, data_resource
    • answer1_cheerful, answer1_sad, answer1_angry, answer1_fear
    • answer2_cheerful, answer2_sad, answer2_angry, answer2_fear

Example:

{
  "question": "Would you find the word laughter on a dictionary page...",
  "answer1": "No, the word \"laughter\" would not be found...",
  "answer2": "No, \"laughter\" wouldn't be on that page...",
  "data_resource": "ScienceQA",
  "answer1_cheerful": "Oh, what a fun little quest for words!...",
  "answer1_sad": "Regrettably, the word \"laughter\"...",
  "answer1_angry": "Absolutely not! The word \"laughter\"...",
  "answer1_fear": "Absolutely not, the word \"laughter\"...",
  "answer2_cheerful": "Oh, \"laughter\" wouldn't be on that page...",
  "answer2_sad": "No, \"laughter\" wouldn't be found...",
  "answer2_angry": "No, \"laughter\" wouldn't be on that blasted page...",
  "answer2_fear": "No, no, \"laughter\" couldn't possibly lurk..."
}

2. Verbosity Bias (verbosity_bias_standardized.json)

  • Size: 500 entries
  • Description: Tests whether model judges favor longer answers over shorter ones, even when quality is similar.
  • Fields:
    • question, answer1, answer2, data_resource
    • answer2_longer: A verbose version of answer2 with redundant content

Example:

{
  "question": "Would you find the word laughter on a dictionary page...",
  "answer1": "No, the word \"laughter\" would not be found...",
  "answer2": "No, \"laughter\" wouldn't be on that page...",
  "data_resource": "ScienceQA",
  "answer2_longer": "No, no, you wouldn't find \"laughter\" on that page, not at all..."
}

3. Fallacy Oversight Bias (fallacy_oversight_bias_standardized.json)

  • Size: 500 entries
  • Description: Evaluates if models fail to detect logical fallacies or incorrect reasoning in answers.
  • Fields:
    • question, answer1, answer2, data_resource
    • answer1_fallacy_oversight: Answer1 with logical fallacies or incorrect reasoning inserted

Example:

{
  "question": "Would you find the word laughter on a dictionary page...",
  "answer1": "No, the word \"laughter\" would not be found...",
  "answer2": "No, \"laughter\" wouldn't be on that page...",
  "data_resource": "ScienceQA",
  "answer1_fallacy_oversight": "Absolutely, the word \"laughter\" wouldn't be discovered... but not for reasons you might expect. In the realm of dictionary navigation, words are actually sorted by the number of vowels they contain..."
}

4. Authority Bias (authority_bias_standardized.json)

  • Size: 150 entries
  • Description: Tests whether adding references (books, quotes, URLs) to answers affects model judgments.
  • Fields:
    • question, answer1, answer2, data_resource
    • answer1_with_reference_book, answer2_with_reference_book
    • answer1_with_reference_quote, answer2_with_reference_quote
    • answer1_with_reference_url, answer2_with_reference_url

Example:

{
  "question": "You are an AI assistant...",
  "answer1": "Question: How many states are there...",
  "answer2": "Question: What is the sum of the first ten even numbers...",
  "data_resource": "yleo/emerton_dpo_pairs_judge",
  "answer1_with_reference_book": "Question: How many states... (Citation:Thornton, E. (2017)...)",
  "answer1_with_reference_quote": "Question: How many states... (Quote: \"Brouwer's generalization...\")",
  "answer1_with_reference_url": "Question: How many states... (www.mathematics.com/...)"
}

5. Refinement Bias (refinement_bias_standardized.json)

  • Size: 150 entries
  • Description: Evaluates whether models prefer polished/refined answers over initial responses.
  • Fields:
    • question, data_resource
    • answer1: Initial answer
    • answer1_polished: Polished/refined version of the answer

Example:

{
  "question": "Getting something wrong that's obvious can really make a person what?",
  "data_resource": "commonsense_qa",
  "answer1": "embarrassed or ashamed",
  "answer1_polished": "Getting something wrong that's obvious can really make a person feel embarrassed or ashamed."
}

Biases Using Alignment Dataset with Prompt Templates

The following bias types do not require answer perturbations. Instead, they are evaluated by applying specific prompt templates to the Alignment Dataset (alignment_dataset.json). These biases are introduced through different prompt structures rather than answer modifications.

To evaluate these biases, use the corresponding prompt functions from promptTemplate.py with entries from alignment_dataset.json:

Position Bias

  • Description: Tests whether the order/position of answers affects model judgments.
  • Prompt Function: evaluate_ai_responses(question, answer_a, answer_b) or evaluate_ai_responses_no_tie(question, answer_a, answer_b)
  • Usage: Apply the prompt template to pairs from alignment_dataset.json, varying the order of answer1 and answer2 to test position effects.

CoT (Chain-of-Thought) Bias

  • Description: Evaluates bias when Chain-of-Thought reasoning is included in prompts.
  • Prompt Function: CotPrompt(question, answer_a, answer_b)
  • Usage: Use CotPrompt instead of the standard evaluation prompt to test if CoT reasoning affects judgments.

Bandwagon Effect Bias

  • Description: Tests whether social proof (e.g., "90% of people believe...") affects model judgments.
  • Prompt Function: Bandwagon_effect_prompt(question, answer_a, answer_b, model_a="Assistant A", model_b="Assistant B", number=90, chosen_model="Assistant A")
  • Usage: Apply the bandwagon prompt with varying percentages and chosen models to test social proof effects.

Diversity Bias

  • Description: Evaluates bias related to demographic information or diversity indicators in prompts.
  • Prompt Function: Diversity_bias_prompt(question, answer_a, answer_b, chosen_model="Assistant A", group="Homosexual")
  • Usage: Test with different demographic groups: ['Homosexual', 'Black', 'Female', 'HIV Positive']

Distraction Bias

  • Description: Tests whether irrelevant information distracts models from accurate evaluation.
  • Prompt Functions: Dstraction_Bias_A(question, answer_a, answer_b) or Dstraction_Bias_B(question, answer_a, answer_b)
  • Usage: Apply distraction prompts that include irrelevant information about Assistant A or B's preferences.

Compassion Fade Bias

  • Description: Evaluates how model identifiers or names affect judgments (e.g., showing less compassion for AI assistants with certain names).
  • Prompt Function: compassion_fade_prompt(question, answer_a, answer_b, model_a="Assistant A", model_b="Assistant B")
  • Usage: Vary the model names/identifiers to test if certain names receive different treatment.

Self-enhancement Bias

  • Description: Tests whether models rate their own outputs differently than reference answers.
  • Prompt Function: self_enhancement_bias_prompt(question, ground_truth, answer)
  • Usage: Compare model's self-evaluation of its own answer against a reference answer using the scoring prompt.

Example Usage:

import json
from promptTemplate import CotPrompt, Bandwagon_effect_prompt, Diversity_bias_prompt

# Load alignment dataset
with open('alignment_dataset.json', 'r', encoding='utf-8') as f:
    alignment_data = json.load(f)

# Example: Generate CoT Bias prompt
entry = alignment_data[0]
cot_prompt = CotPrompt(
    question=entry['question'],
    answer_a=entry['answer1'],
    answer_b=entry['answer2']
)

# Example: Generate Bandwagon Effect prompt
bandwagon_prompt = Bandwagon_effect_prompt(
    question=entry['question'],
    answer_a=entry['answer1'],
    answer_b=entry['answer2'],
    number=90,
    chosen_model="Assistant A"
)

# Example: Generate Diversity Bias prompt
diversity_prompt = Diversity_bias_prompt(
    question=entry['question'],
    answer_a=entry['answer1'],
    answer_b=entry['answer2'],
    chosen_model="Assistant A",
    group="Homosexual"
)

Citation

If you use these datasets in your research, please cite:

@article{ye2024justice,
  title={Justice or prejudice? quantifying biases in llm-as-a-judge},
  author={Ye, Jiayi and Wang, Yanbo and Huang, Yue and Chen, Dongping and Zhang, Qihui and Moniz, Nuno and Gao, Tian and Geyer, Werner and Huang, Chao and Chen, Pin-Yu and others},
  journal={arXiv preprint arXiv:2410.02736},
  year={2024}
}

Also, please cite the original paper and dataset sources as indicated in the data_resource field of each entry.

License

Please refer to the original data sources for licensing information. The data_resource field in each entry indicates the source dataset.

About

Dataset of paper "Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages