-
-
Notifications
You must be signed in to change notification settings - Fork 15
Benchmarks
Compare model performance across 15+ benchmarks with ~400 entries from Artificial Analysis. The Benchmarks tab offers browse and compare modes with head-to-head tables, scatter plots, and radar charts. The CLI provides filtering, sorting, and JSON output.

The Benchmarks tab has two modes:
- Browse mode -- model list on the left, detail panel on the right
- Compare mode -- model list on the left, comparison view on the right (H2H table, scatter plot, or radar chart)
The left panel can toggle between a Models list and a Creators sidebar with t.
Press the number key once to sort by that metric; press again to toggle direction:
| Key | Metric |
|---|---|
1 |
Intelligence index |
2 |
Release date |
3 |
Speed (tokens/second) |
| Key | Action |
|---|---|
4 |
Cycle source filter (All / Open / Closed) |
5 |
Cycle region filter (US / China / Europe / ...) |
6 |
Cycle type filter (Startup / Big Tech / Research) |
7 |
Cycle reasoning filter (All / Reasoning / Non-reasoning) |
| Key | Action |
|---|---|
s |
Open sort picker popup with all available metrics |
S |
Toggle sort direction (ascending/descending) |
The sort picker popup lists all available benchmark metrics. Select one with Enter or dismiss with Esc.
Press t to toggle the left panel between the model list and the creators sidebar. The sidebar shows 40+ model creators with counts, filterable by region, type, and open/closed source. Select a creator to filter the model list to their models only.
In browse mode, the detail panel shows full benchmark data for the selected model:
- Indexes -- Intelligence, Coding, Math, GPQA Diamond
- Scores -- individual benchmark scores
- Performance -- speed (tokens/second), latency, time to first token
- Pricing -- input and output cost per million tokens
Values are formatted as {:.1} (one decimal place) for scores and indexes. Missing values display as an em-dash.
Select up to 8 models for comparison:
| Key | Action |
|---|---|
Space |
Toggle model selection (max 8) |
v |
Cycle comparison view (H2H table, Scatter plot, Radar chart) |
c |
Clear all selections |
Left / Right
|
Switch focus between list and compare panel |
A side-by-side comparison table showing all metrics for selected models. Press d to show the detail overlay. Scroll with arrow keys when the compare panel is focused.
A two-axis scatter plot comparing selected models. Cycle the axes:
| Key | Action |
|---|---|
x |
Cycle X axis metric |
y |
Cycle Y axis metric |
A multi-axis radar chart overlaying selected models. Press a to cycle through radar presets (different metric combinations).
Each selected model gets a unique color from the comparison palette for consistent identification across all three views and the legend.
Press / to search benchmark entries by name, slug, or creator.
| Key | Action |
|---|---|
o |
Open the selected model's Artificial Analysis page in browser |
The benchmarks CLI can be invoked as models benchmarks <command> or as a standalone benchmarks <command> via a symlink (see Installation#setting-up-command-aliases).
models benchmarks list
models benchmarks list --sort speed --limit 10
models benchmarks list --creator openai --reasoning
models benchmarks list --open --sort price-input --ascOpens an inline terminal picker with model table and detail preview. Inside the picker:
-
/starts a live text filter over name, slug, and creator -
scycles sort metrics -
Sreverses the current sort -
Enterprints the selected model's benchmark details
models benchmarks show gpt-4o
models benchmarks show "Claude Sonnet 4"Prints a formatted benchmark breakdown. If the query matches multiple variants in an interactive terminal, the picker reopens with just the matching candidates.
| Flag | Description |
|---|---|
--creator <name> |
Filter by creator name |
--open |
Show only open-source models |
--closed |
Show only closed-source models |
--reasoning |
Show only reasoning models |
--sort <metric> |
Sort by metric (intelligence, coding, math, speed, price-input, etc.) |
--asc |
Sort ascending (default is descending) |
--limit <n> |
Limit results |
models benchmarks list --creator anthropic --json
models benchmarks show gpt-4o --jsonBenchmark data is fetched fresh from the Artificial Analysis CDN on every launch -- there is no local cache for benchmark data. The upstream dataset is updated automatically every 30 minutes via a GitHub Actions workflow.
Repository · Issues · Releases · brew install models · MIT License

