Skip to content
Ari Mayer edited this page Mar 23, 2026 · 5 revisions

Benchmarks

Compare model performance across 15+ benchmarks with ~400 entries from Artificial Analysis. The Benchmarks tab offers browse and compare modes with head-to-head tables, scatter plots, and radar charts. The CLI provides filtering, sorting, and JSON output.

Benchmarks tab

TUI

Layout

The Benchmarks tab has two modes:

  • Browse mode -- model list on the left, detail panel on the right
  • Compare mode -- model list on the left, comparison view on the right (H2H table, scatter plot, or radar chart)

The left panel can toggle between a Models list and a Creators sidebar with t.

Quick sort

Press the number key once to sort by that metric; press again to toggle direction:

Key Metric
1 Intelligence index
2 Release date
3 Speed (tokens/second)

Filters

Key Action
4 Cycle source filter (All / Open / Closed)
5 Cycle region filter (US / China / Europe / ...)
6 Cycle type filter (Startup / Big Tech / Research)
7 Cycle reasoning filter (All / Reasoning / Non-reasoning)

Sort picker

Key Action
s Open sort picker popup with all available metrics
S Toggle sort direction (ascending/descending)

The sort picker popup lists all available benchmark metrics. Select one with Enter or dismiss with Esc.

Creators sidebar

Press t to toggle the left panel between the model list and the creators sidebar. The sidebar shows 40+ model creators with counts, filterable by region, type, and open/closed source. Select a creator to filter the model list to their models only.

Detail panel

In browse mode, the detail panel shows full benchmark data for the selected model:

  • Indexes -- Intelligence, Coding, Math, GPQA Diamond
  • Scores -- individual benchmark scores
  • Performance -- speed (tokens/second), latency, time to first token
  • Pricing -- input and output cost per million tokens

Values are formatted as {:.1} (one decimal place) for scores and indexes. Missing values display as an em-dash.

Compare mode

Benchmarks compare mode

Select up to 8 models for comparison:

Key Action
Space Toggle model selection (max 8)
v Cycle comparison view (H2H table, Scatter plot, Radar chart)
c Clear all selections
Left / Right Switch focus between list and compare panel

Head-to-head table

A side-by-side comparison table showing all metrics for selected models. Press d to show the detail overlay. Scroll with arrow keys when the compare panel is focused.

Scatter plot

A two-axis scatter plot comparing selected models. Cycle the axes:

Key Action
x Cycle X axis metric
y Cycle Y axis metric

Radar chart

A multi-axis radar chart overlaying selected models. Press a to cycle through radar presets (different metric combinations).

Each selected model gets a unique color from the comparison palette for consistent identification across all three views and the legend.

Search

Press / to search benchmark entries by name, slug, or creator.

Actions

Key Action
o Open the selected model's Artificial Analysis page in browser

CLI

CLI benchmarks list

The benchmarks CLI can be invoked as models benchmarks <command> or as a standalone benchmarks <command> via a symlink (see Installation#setting-up-command-aliases).

Interactive benchmark picker

models benchmarks list
models benchmarks list --sort speed --limit 10
models benchmarks list --creator openai --reasoning
models benchmarks list --open --sort price-input --asc

Opens an inline terminal picker with model table and detail preview. Inside the picker:

  • / starts a live text filter over name, slug, and creator
  • s cycles sort metrics
  • S reverses the current sort
  • Enter prints the selected model's benchmark details

Show benchmark details

models benchmarks show gpt-4o
models benchmarks show "Claude Sonnet 4"

Prints a formatted benchmark breakdown. If the query matches multiple variants in an interactive terminal, the picker reopens with just the matching candidates.

Filtering flags

Flag Description
--creator <name> Filter by creator name
--open Show only open-source models
--closed Show only closed-source models
--reasoning Show only reasoning models
--sort <metric> Sort by metric (intelligence, coding, math, speed, price-input, etc.)
--asc Sort ascending (default is descending)
--limit <n> Limit results

JSON output

models benchmarks list --creator anthropic --json
models benchmarks show gpt-4o --json

Data freshness

Benchmark data is fetched fresh from the Artificial Analysis CDN on every launch -- there is no local cache for benchmark data. The upstream dataset is updated automatically every 30 minutes via a GitHub Actions workflow.

Clone this wiki locally