Seeking Community Feedback: Adding a “Train Split Seen” Indicator to the Leaderboard #47

cuthalionn · 2025-08-14T09:02:12Z

cuthalionn
Aug 14, 2025
Maintainer

Hi everyone,

Following recent conversations in this issue #46 , we’d like to open a broader discussion on introducing a “Train Split Seen” indicator on the GIFT-Eval leaderboard — to make it easier to distinguish between models that have seen the train split of evaluation datasets and those that haven’t.

Right now, the leaderboard only indicates whether a model has test data leakage. This means two models could both be marked “No” for leakage, yet one might have been trained on the train split of evaluation datasets and the other not.

The reason we lightened the restrictions on using the train split is that it’s extremely difficult to ensure all models are truly zero-shot — i.e., entirely free from exposure to data highly similar to the test set. In practice, a model could curate or find data from other sources that is very close in distribution to the target datasets, and still be considered zero-shot under a narrow definition. Drawing and enforcing that boundary can quickly become subjective and open to disagreement. Our thinking has been that giving every model access to the same train split would provide a consistent, shared baseline. This way, no team is at a disadvantage simply because they didn’t, or couldn’t, find data similar to the target datasets, and the evaluation remains more grounded and comparable across submissions.

That said, we’ve heard from some community members that a clear “Zero-shot” distinction could improve transparency and fairness. The purpose of this discussion is to see whether the majority of the community shares this view and, if so, how we might introduce such an indicator in a way that is practical, transparent, and broadly supported.

We’d love to get direct input from everyone — especially active model contributors — on:

Would you like to see this as an additional column (e.g., “Train Split Seen: Yes/No”)?
If yes, should it be a binary flag or something more nuanced like a percentage overlap (noting that this might require more effort from submitters)?

Please share your preferences on both of these points — and feel free to suggest alternatives or raise potential pitfalls.

We’ll aim to gather as many viewpoints as possible. The goal is for any change to reflect what we as a community believe will make the leaderboard more transparent, fair, and trusted.

Looking forward to hearing your thoughts!
— The GIFT-Eval Team

largraf · 2025-08-14T09:40:18Z

largraf
Aug 14, 2025

Hi everyone,
Thanks @cuthalionn for moving forward on this issue.

Having a Train Split Seen option would definitely be a possible solution, which would improve interpretability of the leaderboard.
In my opinion, a binary indicator should suffice, since it's hard to compare results with varying percentages anyway.

Besides having an additional column, wouldn't it be simpler to have “zero-shot” as a “model type”?
Zero-Shot forecasting was a central theme of the Gift-eval paper, and right now looking at the leaderboard it is almost impossible to see which models are zero-shot.
Adding a type “Zero-Shot” would allow viewers to select it as a model type and compare between zero-shot models.

Models which also use the training data of the evaluation datasets could still be called “pretrained” and those who don't would be changed to “Zero-Shot”.

This kind of distinction has precedent — for example, shortly after the release of TimeCopilot, a new "Agentic" model type was introduced to reflect its uniqueness. It seems natural to apply the same logic here.

In short:

I support either a new column or a model type — both would improve clarity.
A binary indicator is likely the most practical and effective.
A "Zero-Shot" model type would align well with the original goals of the benchmark and make comparisons easier.

Given the current leaderboard setup, this clarification is urgently needed, as it presents a highly misleading advantage in favor of Moirai 2.0 and risks undermining the credibility of the benchmark.

Thanks again for opening this up to community input!

0 replies

abdulfatir · 2025-08-15T08:57:29Z

abdulfatir
Aug 15, 2025

Thanks for opening this @cuthalionn!

I empathize with your dilemma here and my feeling is that you're running into problems that are inevitable with a static benchmark. In the long (or short, in this case) run, if a benchmark is static and somehow becomes the measure of progress, people will find creative ways to "beat" the benchmark. This is exactly what happened with the "ETT" bench but to be fair that was a terrible benchmark in the first place. My imperfect suggestions would be:

Add a column for "train-test data overlap". If possible, make it a percentage (of tasks) and not just binary yes/no.
Add a column for "potential test leakage". You already have this but again maybe make it a percentage and not just yes/no.
Still sort the benchmark normally (based on WQL/MASE) and readers would be free to take these columns into account when drawing their conclusions.

For the first 2, you can ask model submitters to self-report but with reasoning. Maybe also include the reasoning somewhere in the public domain?

On the topic of "zero-shot", I agree there's no way to ensure something is truly zero-shot but it's also unclear what zero-shot means from a machine learning perspective. If we think of this as out of distribution generalization, the natural question to ask would be: how far out?

0 replies

rajatsen91 · 2025-08-15T17:17:37Z

rajatsen91
Aug 15, 2025

I agree with most of the points above from @largraf and also the observation about static benchmark from @abdulfatir . I think it is too cumbersome to address, define and maintain questions like "how much out of distribution" -- it might be too high of a burden for benchmark maintainers.

I think the "test" data leakage column is definitely a right column to have and I still believe that the ranking only between test data "no leak" models make sense to me. Even in all other fields like NLP that is a bare minimum that model trainers adhere to. So I think I would support this ranking.
I also agree with @largraf that we should have a zero-shot mark on the model otherwise people can just train on the train set of gift eval and then we are basically judging a huge supervised train/val/test split kind of setting and not judging transfer among tasks. As for the definition of zero-shot may be its fine to go with the simplest one: for a task in gift eval, the model should not have been trained on all available training portions of that data source.

0 replies

cuthalionn · 2025-08-22T02:27:17Z

cuthalionn
Aug 22, 2025
Maintainer Author

Thanks everyone for your input!

After considering the discussion here, we’ll be following the community’s suggestions and introducing a new model type label: zero-shot.

This label will only be assigned to models that:

Do not leak test data, and
Do not train on the train split of GIFT-Eval datasets.

By adding this label, we’re making sure the leaderboard transparently reflects these choices while giving submitters the flexibility to decide how they want to train their models. With the new option, users can filter and compare zero-shot models alongside others.

We’ll update the leaderboard and documentation accordingly.
Thanks again for helping shape this change.

2 replies

MoradLaglil Sep 15, 2025

Hi,
Thank you for your effort in making GIFT-Eval more trustworthy.
I have a question about Moirai, which is listed as a zero-shot model on the leaderboard. In the GIFT-Eval paper, it is mentioned that Moirai’s pre-training datasets exhibit partial data leakage issues for GIFT-Eval. This is a bit confusing to me. Could you clarify which version of Moirai is labeled as zero-shot (Moirai 1.0.R, 1.1.R, or 2.0.R)?
Thanks,

cuthalionn Sep 16, 2025
Maintainer Author

Hi @MoradLaglil,

Undestandable confusion, I explained this before in this issue comment --> #46 (comment)

So Moirai small, base and large are a version of Moirai 1 trained on non-leaking datasets. Moirai 2 is not labeled as zero-shot following the discussion here because it uses the train split.
Hope this clarifies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seeking Community Feedback: Adding a “Train Split Seen” Indicator to the Leaderboard #47

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Seeking Community Feedback: Adding a “Train Split Seen” Indicator to the Leaderboard #47

Uh oh!

cuthalionn Aug 14, 2025 Maintainer

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

largraf Aug 14, 2025

Uh oh!

Uh oh!

abdulfatir Aug 15, 2025

Uh oh!

rajatsen91 Aug 15, 2025

Uh oh!

cuthalionn Aug 22, 2025 Maintainer Author

Uh oh!

MoradLaglil Sep 15, 2025

Uh oh!

cuthalionn Sep 16, 2025 Maintainer Author

cuthalionn
Aug 14, 2025
Maintainer

Replies: 4 comments 2 replies

largraf
Aug 14, 2025

abdulfatir
Aug 15, 2025

rajatsen91
Aug 15, 2025

cuthalionn
Aug 22, 2025
Maintainer Author

cuthalionn Sep 16, 2025
Maintainer Author