EdgeBench | Scaling Laws of Environment Learning

An ultra-long-horizon benchmark built to measure learning from environments

Most benchmarks score what a model already knows. EdgeBench is built to measure something else. It asks how an agent learns from a real-world environment when it is given the time, the feedback, and the room to improve.

First benchmark to measure real-world environment learning

Every workspace, feedback signal, and judge approximates real practice, so a high score reflects what an agent learns.
Ultra-long-horizon real-world tasks
≥12h

Each task runs 12+ hours of continuous operation, long enough for experience to compound. Selected extended runs continue beyond 72 h.
Six diverse task families, mostly built from scratch
134 tasks

Tasks span science, software engineering, optimization, knowledge work, formal math, and games. Most are brand-new, built from zero.
Researcher reviewed, with expert effort tracked
57.2h

Domain experts review and iterate each task. Among tasks with recorded human effort, expert work averages 57.2 h and reaches 320 h in the largest cases.

We're releasing an initial 51 of the 134 tasks, together with the full evaluation framework, so the community can study how agents learn from real-world environments.

4/39Scientific & ML
13/36Systems & SE
14/19Optimization
4/19Knowledge
8/13Formal
8/8Games

Why learning from environments matters

Why study agents' ability to learn from their environments? Real world use of AI depends on more than what a model learned during training. Some needed knowledge never appears in training data, such as private records and internal tools. Even when raw data exists, it omits the human process behind it: the trial and error, the interpretation of evidence, and the adaptation to feedback through which experts actually reach results. The real world also never stands still: human knowledge keeps advancing, and new tools, discoveries, and problems continually emerge that no fixed training corpus can anticipate. Therefore, an agent's ability to learn from its environment and improve task performance is central to deploying AI systems at scale in the real world.

The environment-learning loop

1 AttemptTest a candidate solution in the live environment.
2 ObserveReceive new information from the environment during each attempt.
3 AbsorbAnalyze environmental feedback and interaction history.
4 ImproveConvert experience into better plans and artifacts.

Repeats across a 12-hour run

Agents continuously learn from environments and improve their performance

Across EdgeBench, agents do not simply submit once and stop. They interact with task environments, receive informative feedback, and learn from that feedback to improve performance. The representative curves below, drawn from all six capability families, show how different models learn differently across tasks. Some improve gradually, some move quickly early and then plateau, and others remain flat until a late breakthrough. Each colored line is a model's best-so-far score over a 12-hour run, showing agents continually turn environment feedback into better artifacts, strategies, and final outcomes.

38,000 hours of noisy curves, one clean law

For each model, EdgeBench records 402 learning curves from 134 tasks, spanning 12- to 72-hour interaction windows. We average those curves point by point in interaction time. The noisy task-specific trajectories collapse into a simple log-sigmoid, \(S(t) = S_{\max}/(1 + (t_{\mathrm{mid}}/t)^{\beta})\), with high precision and mean \(R^2 = 0.998\).

Per-task curves

A theory of the log-sigmoid law

The fits are clean, but why should learning from an environment take this exact shape? Write \(x(t) = S(t)/S_{\max}\) for the normalized score so far, we show that the dynamics of \(x\) follows

\[ \frac{dx}{d\ln t} = \beta\,x(1-x), \]

which is enough to recover the log-sigmoid shape.

Score is built from many small units. A task's total adds up from many small wins — a fact learned, a test passed. Imagine each of them as a node in a hidden graph, either unlocked or still locked. So \(x(t)\) is the fraction of nodes that have been unlocked so far. The key assumption: unlocking a node makes its locked neighbors easier to reach, so learning spreads outward from there.

Progress moves as a frontier. Because a node opens only once a neighbor already has, new score appears at the boundary where unlocked nodes meet the locked ones. Learning is the frontier moving forward: from what the agent knows to what it doesn't.

Pace is proportional to \(x(1-x)\). Two forces set the frontier's speed: what's already unlocked \((x)\) powers the next; what's still locked \((1-x)\) is the room left to grow. The process converges to \(\frac{dx}{du} = \beta\,x(1-x)\) via an elegant mean-field theory in the many-unit limit.

Self-similarity implies logarithmic time. The task graph is self-similar: each step up in difficulty exposes multiplicatively more structure. With steady effort, the difficulty reached grows like \(\ln t\), so the right time axis is \(u = \ln(t/t_{\mathrm{mid}})\), not the raw time \(t\).

Put together, they give the log-sigmoid. From above we have \(\frac{dx}{d\ln t} = \beta\,x(1-x)\), and solving it gives back exactly the fitted law:

\[ S(t) = \frac{S_{\max}}{1 + (t_{\mathrm{mid}}/t)^{\beta}}. \]

The law emerges from many-task averaging. Any single task is jagged, consisting of long plateaus and sudden jumps due to the finite-size task graph. However, the model describes the limiting trend. Averaging over many tasks cancels the individual task noise, so the log-sigmoid emerges as a population-level trend.

For the complete theory, please refer to our paper.

Left: a toy capability graph — the frontier unlocks outward from a seed. Right: the score (the fraction of units unlocked) traces the log-sigmoid law \(x(u) = 1/(1 + e^{-\beta u})\), \(u = \ln(t/t_{\mathrm{mid}})\), \(\beta = 1.0\).

Left: many small task graphs — each runs its own frontier expansion on a shared clock, with its own midpoint and speed; dot size is a unit's score weight, so a big dot lighting up makes that task's staircase take a big jump. Right: each thin staircase is one task's score, jagged because a finite task has only a few score units; the thick line is their average. The jumps cancel and the smooth log-sigmoid emerges at the population level — increase Tasks and watch \(R^2_{\mathrm{avg}}\) climb.

AI learns from environments roughly twice as fast every three months

To isolate environment learning from prior knowledge, we selected 18 tasks where models start from similar initial performance. We then evaluated model releases from September 2025 to May 2026 for two hours, using the performance improvement within that window as the learning-speed metric. The frontier trend shows that AI learning speed from environments roughly doubles every three months.

Inside a single 12-hour run

One GPT-5.5 run on the gravitational-wave task, traced submission by submission. Across 247 scored attempts the best-so-far score climbs from 42.8 to 67.0, with seven turning points where the agent reframes the problem rather than just tuning.

42.8→67.0 Performance · GPT-5.5 · 12-hour run

Reconstruct a gravitational-wave signal from LIGO strain.

best learned behavior best-so-far envelope

A sparse but structured learning loop

Only 27 of 224 agent submissions improve the best-so-far score by at least 0.1 pp, but the useful updates reveal a diagnose-edit-evaluate loop. The agent makes the task measurable, decomposes failures, isolates a bottleneck, then keeps the working core while repairing what remains.

1

The agent first makes the problem measurable before making it better. The first valid submission turns an underspecified analysis into a scoreable pipeline, then early feedback drives stabilization and a +4.5 pp gain.
2

When direct repair stalls, the agent decomposes the failure into searchable subproblems. The agent splits waveform mismatch into reference anchoring, time-frequency localization, and detector alignment, producing seven meaningful updates and lifting the score to 52.3.
3

Identifying a main bottleneck lets the agent keep searching productively. Component feedback points to velocity and separation as the dominant gap, so the agent searches within source-mass calibration and creates the largest jump in the run.
4

After finding a stable solution, the agent keeps the core and repairs only the remaining errors. The final hours focus on targeted residual, phase, and narrow-band corrections, raising H1 waveform quality while the aggregate score reaches 67.0.

An ultra-long-horizon benchmark built to measure learning from environments

First benchmark to measure real-world environment learning

Ultra-long-horizon real-world tasks

Six diverse task families, mostly built from scratch

Researcher reviewed, with expert effort tracked