-
Notifications
You must be signed in to change notification settings - Fork 283
Submission for CosineAI Genie model #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I believe that using SWE-bench data as training data and then using it for evaluation can cause data leakage issues. For instance, if you look at the golden patch for 'django__django-14730' and Genie's patch, the warning content is exactly the same. golden patch: Genie's patch: |
|
We did check our dataset ahead of time for contamination and made sure that there wasn't any - in fact we didn't train on any Django code (happy to provide evidence of this if necessary), of course we can't do anything about the base model's memory of patches/anything in the SWE set. Edit: After some further investigation, the file the model edited contains plenty of examples of existing warning messages, for example, this code is directly above the hunk that the model inserted: Therefore the model saw how the previous warning messages were written and written the new one in the same style, particularly as the existing messages are all written in a very predictable and consistent format. |
|
So why the model even know attaching an example, "self", at the end of this warning message? Such example is not present in the sample warning messages provided. |
|
There is an example |
|
Sorry for the misunderstanding. But I have carefully reviewed the instance |
|
I see what you mean, there are multiple (three in total) usages of the phraseology And if we dig a little deeper into what these examples are referring to when they are talking about 'to self', they are talking about relationships, which is incidentally what the model patch is referring to. I think it's reasonable to assume that as the model is performing inference and is writing the error message, the attention mechanism will obviously pay attention to existing error messages as examples, but also given the links of the model patch to the concept of relationships it makes sense that it would probably be paying attention to areas that are tied to relationships in the file, and all of the usages of Now it can't be said with certainty why the model said what it said, but I did store logprobs for every message the model returned during inference, so I've taken a look at the logprobs to try to get a better picture of what was said: Now we can see that the model's certainty of saying the token |
|
Are there other strings in that file that resemble: 'with a symmetrical relationship, e.g. to "self".'? Or in other files that the agent had seen at that point? |
|
@ofirpress Yes there are, one example is here: You can see there are plenty of references to symmetrical relationships and calls to As a final point, it's fairly clear to me that the base model does remember this code from its pretraining, something we can't help unfortunately – as an example I took a chunk of the file from the base commit from the beginning of the file until the first line of the hunk the model wrote: |
|
My question was about if there were any strings that had wording similar to the "with a symmetrical relationship" message. You pointed to a variable name. So my question still stands. You are right though with regards to data leakage into foundational models. This could be because of that. |
|
Please see our new rules for submitting to SWE-bench: https://github.com/swe-bench/experiments?tab=readme-ov-file#reasoning-traces If you want to be included in the leaderboard you should add reasoning trajectories to your submission. Thanks |
|
Understood @ofirpress – I'll commit our traj's in the morning. Once they're submitted does this mean we'll have to wait until next Monday to be eligible to be on the leaderboard, or can the trajectories be checked as part of this Monday's submission, given that at the time of us submitting the rules weren't as they are now? |
|
To update we're still figuring out as a company how we feel about sharing trajectories particularly given the model is closed-source and is fine-tuned, so in the interests of keeping things tidy I'm going to close this PR until we make a decision on the subject. |
Hey! Thanks for spending so much time on this eval, and particularly for dockerizing the entire thing which has completely changed the game.
I believe I've done everything correctly for the submission, if I've missed anything happy to provide it ASAP.
EDIT:
Our scores as they stand are: