Refactor analysis code #36

shchur · 2025-08-29T14:47:59Z

Issue #, if available:

Description of changes:

Refactor the analysis methods leaderboard() and pairwise_comparison() to use skill scores + win rates, and report the CIs based on bootstrap.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

abdulfatir

Thanks @shchur! Overall looks okay to me. I left some comments and questions.

src/fev/analysis.py

abdulfatir · 2025-09-05T09:00:23Z

src/fev/analysis.py

+    Computes skill score (1 - geometric mean relative error) and win rate with bootstrap confidence
+    intervals across all tasks. Models are ranked by skill score.
+    
+    Missing results are handled in 2 ways:


Maybe it would be helpful to specify what is the default strategy.

I have refactored this logic to use the missing_strategy kwarg and the updated the docstring accordingly

src/fev/analysis.py

abdulfatir · 2025-09-05T09:31:52Z

src/fev/analysis.py

+def pairwise_comparison(
+    summaries: SummaryType | list[SummaryType],
+    metric_column: str = "test_error",
+    task_columns: str | list[str] = "dataset_path",


Should this be "task_name"?

Removed the task_columns from the API completely to always use all the default Task columns.

src/fev/analysis.py

abdulfatir · 2025-09-05T15:03:23Z

src/fev/analysis.py

    baseline_model: str | None = None,
 ) -> pd.DataFrame:
-    """Compute the average score for each model for each task.
+    """Convert summaries into a pivot table with index equal to `task_columns` and columns equal to model names.


Can we update the docstring and preferably also the function name to indicate that this also scales by the baseline scores?

shchur marked this pull request as draft August 29, 2025 14:48

shchur added 2 commits September 1, 2025 08:21

Update analysis code with CI intervals

b55f851

Add docstrings

abc9578

shchur force-pushed the refactor-analysis-code branch from 5a0f38d to abc9578 Compare September 1, 2025 08:21

shchur added 4 commits September 2, 2025 05:10

Update analysis code

78736e0

Update analysis code

1497a1e

Clip relative errors during evaluation

4d8759d

Update docstring

3887905

shchur requested review from abdulfatir and canerturkmen September 5, 2025 08:20

abdulfatir requested changes Sep 5, 2025

View reviewed changes

shchur added 2 commits September 5, 2025 14:49

Address PR comments, refactor analysis code

dd1a672

Fix typos

130852f

abdulfatir approved these changes Sep 5, 2025

View reviewed changes

Add docstring for pivot_table

38a600b

shchur changed the title ~~[WIP] Refactor analysis code~~ Refactor analysis code Sep 5, 2025

shchur added 2 commits September 5, 2025 15:15

Improve docstring

dbda6fa

Fix pivot_table

22eee83

shchur marked this pull request as ready for review September 5, 2025 15:16

shchur merged commit 5f83dbc into pre-v1.0.0 Sep 5, 2025

shchur deleted the refactor-analysis-code branch September 5, 2025 15:16

shchur added a commit that referenced this pull request Sep 16, 2025

Refactor analysis code (#36)

fd8f59e

shchur mentioned this pull request Sep 16, 2025

Cherry-pick breaking changes for v0.6.0 into the main branch #46

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor analysis code #36

Refactor analysis code #36

Uh oh!

shchur commented Aug 29, 2025 •

edited

Loading

Uh oh!

abdulfatir left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abdulfatir Sep 5, 2025 •

edited

Loading

Uh oh!

shchur Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

abdulfatir Sep 5, 2025

Uh oh!

shchur Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

abdulfatir Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor analysis code #36

Refactor analysis code #36

Uh oh!

Conversation

shchur commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abdulfatir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abdulfatir Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shchur Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abdulfatir Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

shchur Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abdulfatir Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shchur commented Aug 29, 2025 •

edited

Loading

abdulfatir Sep 5, 2025 •

edited

Loading