EdgeBench
Unveiling scaling laws of learning from real-world environments.
An ultra-long-horizon benchmark built to measure learning from environments
Most benchmarks score what a model already knows. EdgeBench is built to measure something else. It asks how an agent learns from a real-world environment when it is given the time, the feedback, and the room to improve.
-
First benchmark to measure real-world environment learning
Every workspace, feedback signal, and judge approximates real practice, so a high score reflects what an agent learns.
-
Ultra-long-horizon real-world tasks
≥12hEach task runs 12+ hours of continuous operation, long enough for experience to compound. Selected extended runs continue beyond 72 h.
-
Six diverse task families, mostly built from scratch
134 tasksTasks span science, software engineering, optimization, knowledge work, formal math, and games. Most are brand-new, built from zero.
-
Researcher reviewed, with expert effort tracked
57.2hDomain experts review and iterate each task. Among tasks with recorded human effort, expert work averages 57.2 h and reaches 320 h in the largest cases.
Each task uses real world research data and experimental settings sourced from working scientists. Domain expertise is essential: agents must formulate hypotheses, choose models, validate against noisy observations, and refine iteratively. Many problems are open-ended, with no known optimal solution.
Feedback: experimental errors, constraint violations, physical-consistency checks, etc.
Gravitational-wave detection
3-D gravity inversion
Groundwater plume modeling
Solar power forecasting
Battery health forecasting
Agents work on production-grade codebases where a single task may require thousands of lines of change, with over 100,000 lines in the largest cases. Because the code spans interdependent modules, an agent must reason about cross-module coupling while meeting both correctness and performance targets.
Feedback: build logs, test failures, profiler output, etc.
RISC-V CPU design
Matching-engine optimization
Regex engine repair
PocketBase development
TLS 1.3 implementation
These are open-ended, predominantly NP-hard problems where exact methods are intractable and progress depends on designing, tuning, and iterating on heuristic search strategies. Even strong solvers have room to improve with additional time and feedback.
Feedback: feasibility checks, objective values, simulator traces, etc.
Vehicle routing
SAT / SMT solving
Molecular self-assembly
Job-shop scheduling
2-D irregular nesting
These tasks reproduce real white-collar deliverables across finance, education, healthcare, and legal domains, matching work that would take a human professional with three or more years of experience roughly three full days to complete. Many tasks feature carefully designed rubrics and multi-round delivery feedback that approximate real client review cycles, so agents can learn from structured critique and revise iteratively.
Feedback: revision notes, test results, format checks, etc.
CTA risk budgeting
Cross-border compliance
Claim-ring fraud audit
AIGC storyboarding
Brand annual planning
These tasks sit at the frontier of mathematical difficulty and require building large-scale machine-checked proofs in Lean, coupling deep mathematical insight with substantial formal-verification engineering. Most are newly created for EdgeBench and designed to support iterative progress: agents receive structured intermediate guidance and can extend partial proofs incrementally.
Feedback: proof states, compiler errors, tactic failures, etc.
Fermat (regular case)
Sphere eversion
Combinatorial games
Erdos-Graham problem
Prime Number Theorem
These are real games designed for human players, where proficient humans typically invest tens of hours to master the mechanics. The state spaces are enormous and each run is procedurally distinct, so agents face strong out-of-distribution pressure. Agents must develop and refine strategies through high-frequency interaction across many episodes.
Feedback: player-visible observations, event logs, episode scores, etc.
NetHack
Dungeon Crawl
Transport tycoon sim
Text adventures
Wesnoth
We're releasing an initial 51 of the 134 tasks, together with the full evaluation framework, so the community can study how agents learn from real-world environments.
- 4/39Scientific & ML
- 13/36Systems & SE
- 14/19Optimization
- 4/19Knowledge
- 8/13Formal
- 8/8Games
Why learning from environments matters
Why study agents' ability to learn from their environments? Real world use of AI depends on more than what a model learned during training. Some needed knowledge never appears in training data, such as private records and internal tools. Even when raw data exists, it omits the human process behind it: the trial and error, the interpretation of evidence, and the adaptation to feedback through which experts actually reach results. The real world also never stands still: human knowledge keeps advancing, and new tools, discoveries, and problems continually emerge that no fixed training corpus can anticipate. Therefore, an agent's ability to learn from its environment and improve task performance is central to deploying AI systems at scale in the real world.
The environment-learning loop
- 1 AttemptTest a candidate solution in the live environment.
- 2 ObserveReceive new information from the environment during each attempt.
- 3 AbsorbAnalyze environmental feedback and interaction history.
- 4 ImproveConvert experience into better plans and artifacts.
Repeats across a 12-hour run
Agents continuously learn from environments and improve their performance
Across EdgeBench, agents do not simply submit once and stop. They interact with task environments, receive informative feedback, and learn from that feedback to improve performance. The representative curves below, drawn from all six capability families, show how different models learn differently across tasks. Some improve gradually, some move quickly early and then plateau, and others remain flat until a late breakthrough. Each colored line is a model's best-so-far score over a 12-hour run, showing agents continually turn environment feedback into better artifacts, strategies, and final outcomes.
38,000 hours of noisy curves, one clean law
For each model, EdgeBench records 402 learning curves from 134 tasks, spanning 12- to 72-hour interaction windows. We average those curves point by point in interaction time. The noisy task-specific trajectories collapse into a simple log-sigmoid, \(S(t) = S_{\max}/(1 + (t_{\mathrm{mid}}/t)^{\beta})\), with high precision and mean \(R^2 = 0.998\).
A theory of the log-sigmoid law
The fits are clean, but why should learning from an environment take this exact shape? Write \(x(t) = S(t)/S_{\max}\) for the normalized score so far, we show that the dynamics of \(x\) follows
\[ \frac{dx}{d\ln t} = \beta\,x(1-x), \]
which is enough to recover the log-sigmoid shape.
Score is built from many small units. A task's total adds up from many small wins — a fact learned, a test passed. Imagine each of them as a node in a hidden graph, either unlocked or still locked. So \(x(t)\) is the fraction of nodes that have been unlocked so far. The key assumption: unlocking a node makes its locked neighbors easier to reach, so learning spreads outward from there.
Progress moves as a frontier. Because a node opens only once a neighbor already has, new score appears at the boundary where unlocked nodes meet the locked ones. Learning is the frontier moving forward: from what the agent knows to what it doesn't.
Pace is proportional to \(x(1-x)\). Two forces set the frontier's speed: what's already unlocked \((x)\) powers the next; what's still locked \((1-x)\) is the room left to grow. The process converges to \(\frac{dx}{du} = \beta\,x(1-x)\) via an elegant mean-field theory in the many-unit limit.
Self-similarity implies logarithmic time. The task graph is self-similar: each step up in difficulty exposes multiplicatively more structure. With steady effort, the difficulty reached grows like \(\ln t\), so the right time axis is \(u = \ln(t/t_{\mathrm{mid}})\), not the raw time \(t\).
Put together, they give the log-sigmoid. From above we have \(\frac{dx}{d\ln t} = \beta\,x(1-x)\), and solving it gives back exactly the fitted law:
\[ S(t) = \frac{S_{\max}}{1 + (t_{\mathrm{mid}}/t)^{\beta}}. \]
The law emerges from many-task averaging. Any single task is jagged, consisting of long plateaus and sudden jumps due to the finite-size task graph. However, the model describes the limiting trend. Averaging over many tasks cancels the individual task noise, so the log-sigmoid emerges as a population-level trend.
For the complete theory, please refer to our paper.
AI learns from environments roughly twice as fast every three months
To isolate environment learning from prior knowledge, we selected 18 tasks where models start from similar initial performance. We then evaluated model releases from September 2025 to May 2026 for two hours, using the performance improvement within that window as the learning-speed metric. The frontier trend shows that AI learning speed from environments roughly doubles every three months.
Inside a single 12-hour run
One GPT-5.5 run on the gravitational-wave task, traced submission by submission. Across 247 scored attempts the best-so-far score climbs from 42.8 to 67.0, with seven turning points where the agent reframes the problem rather than just tuning.
Reconstruct a gravitational-wave signal from LIGO strain.
A sparse but structured learning loop
Only 27 of 224 agent submissions improve the best-so-far score by at least 0.1 pp, but the useful updates reveal a diagnose-edit-evaluate loop. The agent makes the task measurable, decomposes failures, isolates a bottleneck, then keeps the working core while repairing what remains.
-
1The agent first makes the problem measurable before making it better. The first valid submission turns an underspecified analysis into a scoreable pipeline, then early feedback drives stabilization and a +4.5 pp gain.
-
2When direct repair stalls, the agent decomposes the failure into searchable subproblems. The agent splits waveform mismatch into reference anchoring, time-frequency localization, and detector alignment, producing seven meaningful updates and lifting the score to 52.3.
-
3Identifying a main bottleneck lets the agent keep searching productively. Component feedback points to velocity and separation as the dominant gap, so the agent searches within source-mass calibration and creates the largest jump in the run.
-
4After finding a stable solution, the agent keeps the core and repairs only the remaining errors. The final hours focus on targeted residual, phase, and narrow-band corrections, raising H1 waveform quality while the aggregate score reaches 67.0.
Benchmark leaderboard across 134 day-long tasks
To evaluate on the full task set, please contact [email protected].