Leaderboard
Benchmark rankings on real-world test data, stratified by dataset and training paradigm. This page shows 9 metrics in a two-column grid. Use Top 5 / All to toggle how many models are shown for each metric. Note: Update Ratio is only reported for Real-world finetuning (simulated pretraining → real-world finetuning).
Bars are sorted best → worst for each metric.
Bar length is min–max normalized across all models in the current dataset + training paradigm (best = 100%).
For error metrics (↓), smaller raw values correspond to longer bars; for R² (↑), larger values correspond to longer bars.
Training paradigm
Show
Loading benchmark data…
Notes:
Reported metrics are evaluated on real-world test data. Where unavailable, values are omitted.