Leaderboard

Benchmark rankings on real-world test data, stratified by dataset and training paradigm. This page shows 9 metrics in a two-column grid. Use Top 5 / All to toggle how many models are shown for each metric. Note: Update Ratio is only reported for Real-world finetuning (simulated pretraining → real-world finetuning).

Bars are sorted best → worst for each metric. Bar length is min–max normalized across all models in the current dataset + training paradigm (best = 100%). For error metrics (↓), smaller raw values correspond to longer bars; for R² (↑), larger values correspond to longer bars.

Dataset

Training paradigm

Show

Loading benchmark data…

Notes: Reported metrics are evaluated on real-world test data. Where unavailable, values are omitted.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search