Logo GRAB

A Challenging GRaph Analysis Benchmark for Large Multimodal Models

1University of Cambridge, 2The University of Hong Kong
ICCV 2025
overall scores

Overall performance on GRAB at release. Our benchmark proves challenging for frontier LMMs.
The highest performing model, Claude 3.5 Sonnet 🥇, attains an accuracy of just 21.7%.

Overview

Large multimodal models (LMMs) have exhibited proficiences across many visual tasks. Although numerous benchmarks exist to evaluate model performance, they increasing have insufficient headroom and are unfit to evaluate the next generation of frontier LMMs.

To overcome this, we present GRAB, a challenging 2170-question benchmark focused on the tasks human analysts might typically perform when interpreting figures. Such tasks include estimating the mean, intercepts or correlations of functions and data series and performing transforms. We evaluate an initial suite of 20 LMMs on GRAB via exact matching, finding it to be a challenging benchmark, with the current best model scoring just 21.7%.

To compliment the main GRAB benchmark, which is constructed from synthetic figures, we also introduce GRAB-real, a 1114-question set of more realistic figures organised into 4 tasks: paper sketches, whiteboard sketches, figures embedded in computer environments and figures with added noise.

Focused on the recent development of reasoning models, we also introduce GRAB-lite, a light-weight task-balanced 500-question subset of GRAB and GRAB-real, and evaluate leading frontier LMMs on it.

Logo Leaderboards

Select to view the leaderboards for GRAB, GRAB-real, and GRAB-lite below. The tables are sortable by clicking on the column headers. Initial evaluations at release (for non-thinking models) were carried out to maximize determinism, using temperature 0, and setting seeds and top k to 0 where possible. For more recent thinking models, evals were carried out with default settings. More model results will be added over time.
Rank Model Properties Functions Series Transforms Overall
GPT-5.2 75.6 64.5 48.4 70.6 65.1
GPT-5.1 66.8 38.0 32.7 53.9 47.8
GPT-5-nano 48.3 26.2 25.1 30.3 33.3
GPT-5-mini 65.0 38.7 36.1 54.5 48.4
GPT-5 66.8 41.1 35.5 57.4 50.0
Gemini 3 Pro 69.8 54.9 46.7 58.7 58.2
Gemini 3 Flash 76.4 58.7 48.4 68.4 63.1
Claude 4.5 Sonnet 56.5 34.1 36.7 41.6 42.6
Claude 3.5 Sonnet 41.8 15.5 11.0 10.0 21.7
Gemini 1.5 Pro 34.2 11.4 13.3 6.5 18.1
Gemini 1.5 Flash 28.5 11.5 8.4 9.0 15.6
GPT-4o 24.7 10.8 9.2 3.5 13.6
Claude 3 Sonnet 15.3 8.6 4.5 4.8 9.2
Reka Flash 13.2 10.1 6.3 3.9 9.3
GPT-4 Turbo 18.5 8.5 4.9 3.5 10.0
Claude 3 Haiku 14.2 6.6 8.8 3.9 9.0
TransCore-M 7.9 9.2 7.6 3.9 7.6
Yi-VL-6b 5.6 8.6 7.1 4.2 6.7
LLaVA-1.5 13b 5.0 7.7 8.4 3.9 6.5
CogVLM-Chat 7.0 4.9 5.1 3.9 5.4
GPT-4o mini 15.8 6.8 5.7 2.9 8.7
LLaVA-1.5 7b 4.7 7.5 6.5 4.8 6.0
Yi-VL-34b 7.6 5.9 5.5 2.3 5.8
Qwen-VL-Chat 10.2 6.6 5.1 2.9 6.8
OmniLMM-3b 6.7 4.9 4.1 4.5 5.2
Reka Core 1.7 0.0 4.3 0.3 1.5
Gemini 1.0 Pro Vision 20.2 5.8 6.9 6.1 10.5
Reka Edge 11.8 8.7 11.6 1.9 9.4

🎉 To add your GRAB results, please contact this email.

Logo GRAB, GRAB-real and GRAB-lite datasets

The GRAB benchmark suite evaluates the graph analysis capabilities of large multimodal models through questions that mirror tasks human analysts typically perform when interpreting figures. These tasks include estimating means, intercepts, correlations of functions and data series, and performing transforms. The core task categories are:

  • Properties focuses on the analysis of features of individual functions and series
  • Functions requires computing the mean of properties across multiple functions
  • Series requires computing the mean of properties across multiple series
  • Transforms involves determining the properties of a function after it has undergone a series of transforms

GRAB

The main benchmark consisting of 2,170 questions centered around high-quality synthetic graphs, spanning 23 different graph properties.

GRAB-real

A 1,114-question set featuring more realistic figures organised into 4 sub-tasks: paper sketches, whiteboard sketches, figures embedded in computer environments, and figures with added noise.

GRAB-lite

A lightweight, task-balanced 500-question subset combining questions from both GRAB and GRAB-real, with 100 questions from each of the five task categories (Properties, Functions, Series, Transforms, and Real).

Additional Experimental Results

BibTeX

@inproceedings{roberts2025grab,
  title={GRAB: A challenging graph analysis benchmark for large multimodal models},
  author={Roberts, Jonathan and Han, Kai and Albanie, Samuel},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={1644--1654},
  year={2025}
}