Datacurve (@datacurve) / X

Datacurve

12 posts

Datacurve

@datacurve

Research and data to advance frontier models.

San Francisco

Joined February 2024

Pinned
Datacurve
@datacurve
May 30
Opus 4.8 is now on DeepSWE. On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
00:00
971K
Datacurve reposted
Artificial Analysis
@ArtificialAnlys
Jun 12
We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top
541K
Datacurve
@datacurve
May 30
Replying to @datacurve
Full deep dive coming soon. Check out the full benchmark here →
DeepSWE
From deepswe.datacurve.ai
25K
Datacurve
@datacurve
May 30
Replying to @datacurve
Opus 4.8 delivers efficiency gains by solving tasks in fewer steps, directly reducing the total number of input tokens required per task.
123K
Datacurve reposted
Matthew Berman
@MatthewBerman
May 27
DeepSWE reflects what I’m hearing from engineers better than any other benchmark. They took the hard path to build a good one.
00:00
Serena Ge (Datacurve)
@serenaa_ge
May 26
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
40K
Datacurve reposted
Garry Tan
@garrytan
May 26
This is the new standard for engineering evals
Serena Ge (Datacurve)
@serenaa_ge
May 26
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
116K
Datacurve reposted
Serena Ge (Datacurve)
@serenaa_ge
May 26
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
2M
Datacurve reposted
Serena Ge (Datacurve)
@serenaa_ge
Apr 4, 2024
I presented today at Demo day Day 2 and @TechCrunch featured us @datacurve! Just been reading TC and listening to TC Daily Crunch since high school mornings... a surreal feeling to see us on it. Also, post-demo sadness cuz now YC is coming to an end
30K