Abstract
Large Language Models (LLMs) have demonstrated remarkable performance on various quantitative reasoning and knowledge benchmarks, such as MMLU and MATH. However, many of these benchmarks are losing utility as LLMs get increasingly high scores, despite not yet achieving expert level performance in these domains. We introduce ARB, a novel benchmark composed of advanced reasoning problems designed to evaluate LLMs on text comprehension and expert domain reasoning. ARB presents a more challenging test than prior benchmarks, featuring questions that test deeper knowledge of mathematics, physics, biology, chemistry, and law.
As a subset of ARB, we introduce a challenging set of math and physics problems which require advanced symbolic reasoning and domain knowledge. In order to improve both automatic and assisted symbolic evaluation capabilities, we introduce a rubric-based self-evaluation approach, allowing GPT-4 to score its own intermediate reasoning steps.
We evaluated recent models such as GPT-4 and Claude on ARB and demonstrated that even with Chain-of-Thought prompting methods, current models score well below 50% on more demanding expert tasks. Further, we conducted a human evaluation of the symbolic subset of ARB, finding close agreement between annotators and GPT-4 self-evaluation scores.
Evaluation Results
Our evaluation of current large language models (LLMs) focuses on text-only problems, with no multimodal tasks, using models including ChatGPT, GPT 3.5, GPT-4, and Claude. Each question type is assessed with task-specific instructions and chain of thought; for multiple-choice questions, the model's choice is compared with the correct answer, while numerical, symbolic, and proof-like problems require extraction and parsing of the model's answer, often requiring mathematical libraries and manual grading due to their complexity. We also tested two model-based approaches for grading, including GPT-4's ability to grade equivalence of two symbolic expressions and a rubric-based evaluation method, which showed promising results, facilitating the evaluation of increasingly unstructured answers.

Model-based Rubric Evaluation
As the complexity of reasoning tasks for language learning models (LLMs) grows, reliable evaluation becomes challenging due to difficulties in grading symbolic answers and assessing intermediate reasoning steps. We propose an approach where the model generates and uses rubrics to evaluate solutions, based on reference solutions and examples of human-crafted rubrics. Our evaluation revealed that GPT-4 creates effective rubrics, covering key solution steps well but struggling with point allocation, outperforming its predecessor, GPT-3.5-turbo.




