Seeking Community Feedback: Adding a “Train Split Seen” Indicator to the Leaderboard #47
Replies: 4 comments 2 replies
-
|
Hi everyone, Having a Train Split Seen option would definitely be a possible solution, which would improve interpretability of the leaderboard. Besides having an additional column, wouldn't it be simpler to have “zero-shot” as a “model type”? Models which also use the training data of the evaluation datasets could still be called “pretrained” and those who don't would be changed to “Zero-Shot”. This kind of distinction has precedent — for example, shortly after the release of TimeCopilot, a new In short:
Given the current leaderboard setup, this clarification is urgently needed, as it presents a highly misleading advantage in favor of Moirai 2.0 and risks undermining the credibility of the benchmark. Thanks again for opening this up to community input! |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for opening this @cuthalionn! I empathize with your dilemma here and my feeling is that you're running into problems that are inevitable with a static benchmark. In the long (or short, in this case) run, if a benchmark is static and somehow becomes the measure of progress, people will find creative ways to "beat" the benchmark. This is exactly what happened with the "ETT" bench but to be fair that was a terrible benchmark in the first place. My imperfect suggestions would be:
For the first 2, you can ask model submitters to self-report but with reasoning. Maybe also include the reasoning somewhere in the public domain? On the topic of "zero-shot", I agree there's no way to ensure something is truly zero-shot but it's also unclear what zero-shot means from a machine learning perspective. If we think of this as out of distribution generalization, the natural question to ask would be: how far out? |
Beta Was this translation helpful? Give feedback.
-
|
I agree with most of the points above from @largraf and also the observation about static benchmark from @abdulfatir . I think it is too cumbersome to address, define and maintain questions like "how much out of distribution" -- it might be too high of a burden for benchmark maintainers.
|
Beta Was this translation helpful? Give feedback.
-
|
Thanks everyone for your input! After considering the discussion here, we’ll be following the community’s suggestions and introducing a new model type label: This label will only be assigned to models that:
By adding this label, we’re making sure the leaderboard transparently reflects these choices while giving submitters the flexibility to decide how they want to train their models. With the new option, users can filter and compare zero-shot models alongside others. We’ll update the leaderboard and documentation accordingly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
Following recent conversations in this issue #46 , we’d like to open a broader discussion on introducing a “Train Split Seen” indicator on the GIFT-Eval leaderboard — to make it easier to distinguish between models that have seen the train split of evaluation datasets and those that haven’t.
Right now, the leaderboard only indicates whether a model has test data leakage. This means two models could both be marked “No” for leakage, yet one might have been trained on the train split of evaluation datasets and the other not.
The reason we lightened the restrictions on using the train split is that it’s extremely difficult to ensure all models are truly zero-shot — i.e., entirely free from exposure to data highly similar to the test set. In practice, a model could curate or find data from other sources that is very close in distribution to the target datasets, and still be considered zero-shot under a narrow definition. Drawing and enforcing that boundary can quickly become subjective and open to disagreement. Our thinking has been that giving every model access to the same train split would provide a consistent, shared baseline. This way, no team is at a disadvantage simply because they didn’t, or couldn’t, find data similar to the target datasets, and the evaluation remains more grounded and comparable across submissions.
That said, we’ve heard from some community members that a clear “Zero-shot” distinction could improve transparency and fairness. The purpose of this discussion is to see whether the majority of the community shares this view and, if so, how we might introduce such an indicator in a way that is practical, transparent, and broadly supported.
We’d love to get direct input from everyone — especially active model contributors — on:
Would you like to see this as an additional column (e.g., “Train Split Seen: Yes/No”)?
If yes, should it be a binary flag or something more nuanced like a percentage overlap (noting that this might require more effort from submitters)?
Please share your preferences on both of these points — and feel free to suggest alternatives or raise potential pitfalls.
We’ll aim to gather as many viewpoints as possible. The goal is for any change to reflect what we as a community believe will make the leaderboard more transparent, fair, and trusted.
Looking forward to hearing your thoughts!
— The GIFT-Eval Team
Beta Was this translation helpful? Give feedback.
All reactions