Skip to content

Conversation

@shchur
Copy link
Contributor

@shchur shchur commented Aug 29, 2025

Issue #, if available:

Description of changes:

  • Refactor the analysis methods leaderboard() and pairwise_comparison() to use skill scores + win rates, and report the CIs based on bootstrap.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@shchur shchur marked this pull request as draft August 29, 2025 14:48
@shchur shchur force-pushed the refactor-analysis-code branch from 5a0f38d to abc9578 Compare September 1, 2025 08:21
Copy link
Collaborator

@abdulfatir abdulfatir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shchur! Overall looks okay to me. I left some comments and questions.

Computes skill score (1 - geometric mean relative error) and win rate with bootstrap confidence
intervals across all tasks. Models are ranked by skill score.
Missing results are handled in 2 ways:
Copy link
Collaborator

@abdulfatir abdulfatir Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be helpful to specify what is the default strategy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored this logic to use the missing_strategy kwarg and the updated the docstring accordingly

def pairwise_comparison(
summaries: SummaryType | list[SummaryType],
metric_column: str = "test_error",
task_columns: str | list[str] = "dataset_path",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "task_name"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the task_columns from the API completely to always use all the default Task columns.

baseline_model: str | None = None,
) -> pd.DataFrame:
"""Compute the average score for each model for each task.
"""Convert summaries into a pivot table with index equal to `task_columns` and columns equal to model names.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the docstring and preferably also the function name to indicate that this also scales by the baseline scores?

@shchur shchur changed the title [WIP] Refactor analysis code Refactor analysis code Sep 5, 2025
@shchur shchur marked this pull request as ready for review September 5, 2025 15:16
@shchur shchur merged commit 5f83dbc into pre-v1.0.0 Sep 5, 2025
@shchur shchur deleted the refactor-analysis-code branch September 5, 2025 15:16
shchur added a commit that referenced this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants