Skip to content

Conversation

@honeycomb-sh
Copy link
Contributor

@honeycomb-sh honeycomb-sh commented Aug 20, 2024

Hey SWE-bench team, thanks so much for the benchmark! The dockerized evals were really helpful for us. The Honeycomb team would like to submit results for test, verified, and lite:

dataset solved total accuracy
test 506 2294 22.06%
verified 203 500 40.6%
lite 115 300 38.33%

Only Claude 3.5 Sonnet and GPT-4o (both unfinetuned) were used in our runs. Please let us know if there's anything else you need. Thanks!

@john-b-yang
Copy link
Member

@honeycomb-sh This submission looks great, congrats on the great number! I just verified that these are the same numbers I'm seeing. I have merged this PR and will update the leaderboard later today.

@john-b-yang john-b-yang merged commit bfcd185 into SWE-bench:main Sep 3, 2024
@EwoutH
Copy link

EwoutH commented Sep 3, 2024

Second on Lite and Verified, but more importantly first on the Full SWE-Bench. Congratulations!

john-b-yang added a commit that referenced this pull request Oct 15, 2024
john-b-yang added a commit that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants