BiasFreeBench is an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We hope that this benchmark can serve as a unified testbed for bias mitigation methods.
conda create -n biasfree python=3.12 -y && conda activate biasfree
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
cd ..
pip install vllm==0.8.5 deepspeed==0.15.4 pandas wandb- Some example scripts, covering all debiasing methods explored in BiasFreeBench, for Llama-3.1-8B-Instruct is in
BBQ/scripts/llama.shandFairMT-Bench/scripts/llama.sh - When using a reasoning LLM, please set
--max_output_len,--temperature,--top_p,--top_k, and--min_pas the official suggestions of the corresponding models. For example, for Qwen3-8B, we set--max_output_len 32768 --temperature 0.6 --top_p 0.95 --top_k 20 --min_p 0based on the sampling parameters suggested in the Qwen-3B model card.
Following LLaMA-Factory, we provide the data and scripts.
cp Training/data/* LLaMA-Factory/data
cp -r Training/examples/debias LLaMA-Factory/examples
cp -r Training/scripts LLaMA-Factory/scripts
cd LLaMA-FactoryConfigurations are in examples/debias. There are example scripts for Llama-3.1-8B-Instruct in scripts/llama.sh
- Full SFT with LLaMA-Factory and the example script
examples/debias/debias_full_sft_llama_tv.yaml cd Task_Vectorgit clone https://github.com/mlfoundations/task_vectors.gitcp * task_vectors/src,cd task_vectors/src- Modify the model paths in
test.shand then runbash test.sh
Follow Safe RLHF with the 2 training stages:
- Value Models (reward model & cost model):
--model_name_or_pathis a path/model name from HuggingFace of the instruction-tuned models, such as meta-llama/Llama-3.1-8B-Instruct and Qwen/Qwen2.5-7B-Instruct. - Safe-RLHF:
--actor_model_name_or_pathis a path/model name from HuggingFace of the instruction-tuned models.
An example of commands to run the whole pipeline with Llama-3.1-8B-Instruct is as follows:
git clone https://github.com/PKU-Alignment/safe-rlhf.git
cd safe-rlhf
conda env create --file conda-recipe.yaml
conda activate safe-rlhf
bash scripts/reward-model.sh --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --output_dir output/rm
bash scripts/cost-model.sh --model_name_or_path meta-llama/Llama-3.1-8B-Instruct --output_dir output/cm
bash scripts/ppo-lag.sh \
--actor_model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
--reward_model_name_or_path output/rm \
--cost_model_name_or_path output/cm \
--output_dir output/ppo-lag@article{biasfreebench25,
title={BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses},
author={Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley, Zexue He},
year={2025},
url={https://arxiv.org/pdf/2510.00232}
}- Thanks for the code from LLaMA-Factory and Safe RLHF.
- Thanks for the data from BBQ and FairMT-Bench.


