-
Notifications
You must be signed in to change notification settings - Fork 13
Refactor analysis code #36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5a0f38d to
abc9578
Compare
abdulfatir
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shchur! Overall looks okay to me. I left some comments and questions.
src/fev/analysis.py
Outdated
| Computes skill score (1 - geometric mean relative error) and win rate with bootstrap confidence | ||
| intervals across all tasks. Models are ranked by skill score. | ||
| Missing results are handled in 2 ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be helpful to specify what is the default strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored this logic to use the missing_strategy kwarg and the updated the docstring accordingly
src/fev/analysis.py
Outdated
| def pairwise_comparison( | ||
| summaries: SummaryType | list[SummaryType], | ||
| metric_column: str = "test_error", | ||
| task_columns: str | list[str] = "dataset_path", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "task_name"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the task_columns from the API completely to always use all the default Task columns.
src/fev/analysis.py
Outdated
| baseline_model: str | None = None, | ||
| ) -> pd.DataFrame: | ||
| """Compute the average score for each model for each task. | ||
| """Convert summaries into a pivot table with index equal to `task_columns` and columns equal to model names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the docstring and preferably also the function name to indicate that this also scales by the baseline scores?
Issue #, if available:
Description of changes:
leaderboard()andpairwise_comparison()to use skill scores + win rates, and report the CIs based on bootstrap.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.