I read the ARC-AGI-3 paper entirely, and I'm unimpressed.
The "100% human-solvable, <1% AI solved" is basically p-hacking. They cook their metrics to guarantee high human scores and punish any sub-human score. They also prevent measurement of super-human performance, so in practice it's close to a binary metric of "matches best human or not".
There are also a number of incoherences in the stated methodology, but they're non-central.
Their metric is:
Environment must be solved by at least 2/10 humans. Among the successes, pick the median (¤) of actions taken, that's the baseline (per level of the environment), call it b.
Humans are defined as 100% for being the baseline (no analysis of how many humans solve the environment, or whether the average score is 100% or any deeper analysis of human performance).
An environment has n levels. Levels are attempted sequentially, in increasing order of difficulty; solving one unlocks the next one. The environment is solved if all levels are completed.
If a model doesn't solve a level, it scores 0 on that level (and subsequent ones). If it does solve it in m steps, it receives (b/m)² score. (*)
Then take the weighted average of its scores over levels, where level k is weighted k.
(*) If the model is better than human (m < b), its level score is clamped at 1.15, but tbh it doesn't really matter. Also, environment score is clamped at 1 for some reason.
(¤) They say "upper-median best", which doesn't make sense, and their example is the median of people who solve the environment, so I'm going with that interpretation.
There are two problems with this metric:
- Human variance. The baseline might be ultra-optimized, close to optimal, depending on the environment; it might also not. In their empirical evaluation of optimal score (probably from human performance not-first-run?), it's clear that the baseline is very noisy.
- The way it's calculated punishes sub-human performance quadratically for no reason, and upweighs the hardes