Honeycomb #61

honeycomb-sh · 2024-08-20T22:44:39Z

Hey SWE-bench team, thanks so much for the benchmark! The dockerized evals were really helpful for us. The Honeycomb team would like to submit results for test, verified, and lite:

dataset	solved	total	accuracy
test	506	2294	22.06%
verified	203	500	40.6%
lite	115	300	38.33%

Only Claude 3.5 Sonnet and GPT-4o (both unfinetuned) were used in our runs. Please let us know if there's anything else you need. Thanks!

john-b-yang · 2024-09-03T15:59:10Z

@honeycomb-sh This submission looks great, congrats on the great number! I just verified that these are the same numbers I'm seeing. I have merged this PR and will update the leaderboard later today.

EwoutH · 2024-09-03T17:55:22Z

Second on Lite and Verified, but more importantly first on the Full SWE-Bench. Congratulations!

Honeycomb

add honeycomb

0f1e81f

john-b-yang merged commit bfcd185 into SWE-bench:main Sep 3, 2024

john-b-yang added a commit that referenced this pull request Oct 15, 2024

Merge pull request #61 from honeycombsh/main

74c5dd2

Honeycomb

john-b-yang added a commit that referenced this pull request Oct 15, 2024

Merge pull request #61 from honeycombsh/main

d9682c8

Honeycomb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Honeycomb #61

Honeycomb #61

Uh oh!

honeycomb-sh commented Aug 20, 2024 •

edited

Loading

Uh oh!

john-b-yang commented Sep 3, 2024

Uh oh!

EwoutH commented Sep 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Honeycomb #61

Honeycomb #61

Uh oh!

Conversation

honeycomb-sh commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

john-b-yang commented Sep 3, 2024

Uh oh!

EwoutH commented Sep 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

honeycomb-sh commented Aug 20, 2024 •

edited

Loading