EdgeBench

EdgeBench

Unveiling scaling laws of learning from real-world environments.

An ultra-long-horizon benchmark built to measure learning from environments

Most benchmarks score what a model already knows. EdgeBench is built to measure something else. It asks how an agent learns from a real-world environment when it is given the time, the feedback, and the room to improve.

  • First benchmark to measure real-world environment learning

    Every workspace, feedback signal, and judge approximates real practice, so a high score reflects what an agent learns.

  • Ultra-long-horizon real-world tasks

    ≥12h

    Each task runs 12+ hours of continuous operation, long enough for experience to compound. Selected extended runs continue beyond 72 h.

  • Six diverse task families, mostly built from scratch

    134 tasks

    Tasks span science, software engineering, optimization, knowledge work, formal math, and games. Most are brand-new, built from zero.

  • Researcher reviewed, with expert effort tracked

    57.2h

    Domain experts review and iterate each task. Among tasks with recorded human effort, expert work averages 57.2 h and reaches 320 h in the largest cases.

Scientific Problems & ML 39 tasks

Each task uses real world research data and experimental settings sourced from working scientists. Domain expertise is essential: agents must formulate hypotheses, choose models, validate against noisy observations, and refine iteratively. Many problems are open-ended, with no known optimal solution.

Gravitational-wave detection
3-D gravity inversion
Groundwater plume modeling
Solar power forecasting
Battery health forecasting
···+34 more
Systems & Software Engineering 36 tasks

Agents work on production-grade codebases where a single task may require thousands of lines of change, with over 100,000 lines in the largest cases. Because the code spans interdependent modules, an agent must reason about cross-module coupling while meeting both correctness and performance targets.

RISC-V CPU design
Matching-engine optimization
Regex engine repair
PocketBase development
TLS 1.3 implementation
···+31 more
Combinatorial Optimization 19 tasks

These are open-ended, predominantly NP-hard problems where exact methods are intractable and progress depends on designing, tuning, and iterating on heuristic search strategies. Even strong solvers have room to improve with additional time and feedback.

Vehicle routing
SAT / SMT solving
Molecular self-assembly
Job-shop scheduling
2-D irregular nesting
···+14 more
Professional Knowledge Work 19 tasks

These tasks reproduce real white-collar deliverables across finance, education, healthcare, and legal domains, matching work that would take a human professional with three or more years of experience roughly three full days to complete. Many tasks feature carefully designed rubrics and multi-round delivery feedback that approximate real client review cycles, so agents can learn from structured critique and revise iteratively.

CTA risk budgeting
Cross-border compliance
Claim-ring fraud audit
AIGC storyboarding
Brand annual planning
···+14 more
Formal Math & Theorem Proving 13 tasks

These tasks sit at the frontier of mathematical difficulty and require building large-scale machine-checked proofs in Lean, coupling deep mathematical insight with substantial formal-verification engineering. Most are newly created for EdgeBench and designed to support iterative progress: agents receive structured intermediate guidance and can extend partial proofs incrementally.

Fermat (regular case)
Sphere eversion
Combinatorial games
Erdos-Graham problem
Prime Number Theorem
···+8 more
Interactive Games & Simulators 8 tasks

These are real games designed for human players, where proficient humans typically invest tens of hours to master the mechanics. The state spaces are enormous and each run is procedurally distinct, so agents face strong out-of-distribution pressure. Agents must develop and refine strategies through high-frequency interaction across many episodes.

NetHack
Dungeon Crawl
Transport tycoon sim
Text adventures
Wesnoth
···+3 more

We're releasing an initial 51 of the 134 tasks, together with the full evaluation framework, so the community can study how agents learn from real-world environments.

  • 4/39Scientific & ML
  • 13/36Systems & SE
  • 14/19Optimization
  • 4/19Knowledge
  • 8/13Formal
  • 8/8Games

Why learning from environments matters

Why study agents' ability to learn from their environments? Real world use of AI depends on more than what a model learned during training. Some needed knowledge never appears in training data, such as private records and internal tools. Even when raw data exists, it omits the human process behind it: the trial and error, the interpretation of evidence, and the adaptation to feedback through which experts actually reach results. The real world also never stands still: human knowledge keeps advancing, and new tools, discoveries, and problems continually emerge that no fixed training corpus can anticipate. Therefore, an agent's ability to learn from its environment and improve task performance is central to deploying AI systems at scale in the real world.

The environment-learning loop

  1. 1 AttemptTest a candidate solution in the live environment.
  2. 2 ObserveReceive new information from the environment during each attempt.
  3. 3 AbsorbAnalyze environmental feedback and interaction history.
  4. 4 ImproveConvert experience into better plans and artifacts.

Repeats across a 12-hour run

Agents continuously learn from environments and improve their performance

Across EdgeBench, agents do not simply submit once and stop. They interact with task environments, receive informative feedback, and learn from that feedback to improve performance. The representative curves below, drawn from all six capability families, show how different models learn differently across tasks. Some improve gradually, some move quickly early and then plateau, and others remain flat until a late breakthrough. Each colored line is a model's best-so-far score over a 12-hour run, showing agents continually turn environment feedback into better artifacts, strategies, and final outcomes.

38,000 hours of noisy curves, one clean law

For each model, EdgeBench records 402 learning curves from 134 tasks, spanning 12- to 72-hour interaction windows. We average those curves point by point in interaction time. The noisy task-specific trajectories collapse into a simple log-sigmoid, \(S(t) = S_{\max}/(1 + (t_{\mathrm{mid}}/t)^{\beta})\), with high precision and mean \(R^2 = 0.998\).

Per-task curves

A theory of the log-sigmoid law

The fits are clean, but why should learning from an environment take this exact shape? Write \(x(t) = S(t)/S_{\max}\) for the normalized score so far, we show that the dynamics of \(x\) follows

\[ \frac{dx}{d\ln t} = \beta\,x(1-x), \]

which is enough to recover the log-sigmoid shape.

Score is built from many small units. A task's total adds up from many small wins — a fact learned, a test passed. Imagine each of them as a node in a hidden graph, either unlocked or still locked. So \(x(t)\) is the fraction of nodes that have been unlocked so far. The key assumption: unlocking a node makes its locked neighbors easier to reach, so learning spreads outward from there.

Progress moves as a frontier. Because a node opens only once a neighbor already has, new score appears at the boundary where unlocked nodes meet the locked ones. Learning is the frontier moving forward: from what the agent knows to what it doesn't.

Pace is proportional to \(x(1-x)\). Two forces set the frontier's speed: what's already unlocked \((x)\) powers the next; what's still locked \((1-x)\) is the room left to grow. The process converges to \(\frac{dx}{du} = \beta\,x(1-x)\) via an elegant mean-field theory in the many-unit limit.

Self-similarity implies logarithmic time. The task graph is self-similar: each step up in difficulty exposes multiplicatively more structure. With steady effort, the difficulty reached grows like \(\ln t\), so the right time axis is \(u = \ln(t/t_{\mathrm{mid}})\), not the raw time \(t\).

Put together, they give the log-sigmoid. From above we have \(\frac{dx}{d\ln t} = \beta\,x(1-x)\), and solving it gives back exactly the fitted law:

\[ S(t) = \frac{S_{\max}}{1 + (t_{\mathrm{mid}}/t)^{\beta}}. \]

The law emerges from many-task averaging. Any single task is jagged, consisting of long plateaus and sudden jumps due to the finite-size task graph. However, the model describes the limiting trend. Averaging over many tasks cancels the individual task noise, so the log-sigmoid emerges as a population-level trend.

For the complete theory, please refer to our paper.

140
unlock time
Left: a toy capability graph — the frontier unlocks outward from a seed. Right: the score (the fraction of units unlocked) traces the log-sigmoid law \(x(u) = 1/(1 + e^{-\beta u})\), \(u = \ln(t/t_{\mathrm{mid}})\), \(\beta = 1.0\).

AI learns from environments roughly twice as fast every three months

To isolate environment learning from prior knowledge, we selected 18 tasks where models start from similar initial performance. We then evaluated model releases from September 2025 to May 2026 for two hours, using the performance improvement within that window as the learning-speed metric. The frontier trend shows that AI learning speed from environments roughly doubles every three months.

Inside a single 12-hour run

One GPT-5.5 run on the gravitational-wave task, traced submission by submission. Across 247 scored attempts the best-so-far score climbs from 42.8 to 67.0, with seven turning points where the agent reframes the problem rather than just tuning.

42.867.0 Performance · GPT-5.5 · 12-hour run

Reconstruct a gravitational-wave signal from LIGO strain.

best learned behavior best-so-far envelope

A sparse but structured learning loop

Only 27 of 224 agent submissions improve the best-so-far score by at least 0.1 pp, but the useful updates reveal a diagnose-edit-evaluate loop. The agent makes the task measurable, decomposes failures, isolates a bottleneck, then keeps the working core while repairing what remains.

  1. 1
    The agent first makes the problem measurable before making it better. The first valid submission turns an underspecified analysis into a scoreable pipeline, then early feedback drives stabilization and a +4.5 pp gain.
  2. 2
    When direct repair stalls, the agent decomposes the failure into searchable subproblems. The agent splits waveform mismatch into reference anchoring, time-frequency localization, and detector alignment, producing seven meaningful updates and lifting the score to 52.3.
  3. 3
    Identifying a main bottleneck lets the agent keep searching productively. Component feedback points to velocity and separation as the dominant gap, so the agent searches within source-mass calibration and creates the largest jump in the run.
  4. 4
    After finding a stable solution, the agent keeps the core and repairs only the remaining errors. The final hours focus on targeted residual, phase, and narrow-band corrections, raising H1 waveform quality while the aggregate score reaches 67.0.

Benchmark leaderboard across 134 day-long tasks

To evaluate on the full task set, please contact [email protected].

Expanded figure