[🌐 Website] •
[📜 Paper] •
[🤗 Dataset] •
[🐱 GitHub]
Repo for "CriticBench: Benchmarking LLMs for Critique-Correct Reasoning"
The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs' abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning, and analyze the key factors affecting LLM critical reasoning.
Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in critique and correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing pattern, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement.
Figure 1: An overview for the CriticBench construction.
Cloning the repository
git clone [email protected]:CriticBench/CriticBench.git
cd CriticBench/srcPreparing conda env
conda create -n critcbench python=3.10
conda activate criticbenchInstall torch that is compatible with your device, then install the required dependencies as follows:
pip install -r requirements.txtYou can evaluation model's generation(G), critique(Q), correction(C) by the following command.
Some models require access permissions, which can be set with the following commands:
export HUGGING_FACE_HUB_TOKEN=<Your Huggingface token>
export OPENAI_API_KEY=<Your OpenAI API key>python evaluate.py \
--available_gpus <GPU_IDs> \
--tasks GQC \
--prompt_type fs\
--hf_model <model-name> \
--enable_code_executionWe provide support for Auto-J and UltraCM. You can evaluate these models with the following command.
python evaluate.py \
--available_gpus <GPU_IDs> \
--tasks Q \
--hf_critic_model <model-name> \
--enable_code_executionOpenAI model
python evaluate.py \
--tasks GQC \
--prompt_type fs\
--openai_model <model-name> \
--enable_code_execution--tasksspecifies which task to evaluate, with the available options being:GQCfor a combination of generation, critique, and correction;QCfor critique and correction;G,Q, orCfor generation, critique, or correction individually;- Note that correction tasks ("C") should be executed after critique tasks ("Q") or require a specified critique result file.
--prompt_typeallows you to further specify the prompts for critique and correction used during evaluation:fs: few-shot prompt for both critique and correction;zs-crit-cot: zero-shot chain-of-thought prompt for critique;zs-crit-ao-1,zs-crit-ao-2andzs-crit-ao-3represent three distinct types of zero-shot answer-only prompts for critique;- In correction, zero-shot prompts are all set to chain of thought (cot).
--enable_code_executionargument enables execution of code for generation and correction tasks--available_gpusargument specifies which GPUs to use, identified by their IDs (e.g.,0,1).
You can specify paths to the existed result file using the --existed_gen_file, --existed_crit_file and --existed_corr_file. For accurate answer extraction,
ensure --prompt_type aligns with your results. Here is an example:
python evaluate.py \
--tasks GQC \
--prompt_type fs \
--enable_code_execution \
--existed_gen_file <path to generation result file> \
--existed_crit_file <path to critique result file> \
--existed_corr_file <path to correction result file>Here's an example of what a JSON line in a generation result file might look like:
{
"id": 0,
"final_prompt": "The final prompt for LLMs",
"generation_result": "LLM's result for the generation task"
}If you find this repository helpful, please consider citing our paper:
@misc{lin2024criticbench,
title={CriticBench: Benchmarking LLMs for Critique-Correct Reasoning},
author={Zicheng Lin and Zhibin Gou and Tian Liang and Ruilin Luo and Haowei Liu and Yujiu Yang},
year={2024},
eprint={2402.14809},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
