Remove `median_absolute_error` from `METRICS_WITHOUT_SAMPLE_WEIGHT` by lucyleeow · Pull Request #30787 · scikit-learn/scikit-learn

lucyleeow · 2025-02-08T10:34:05Z

What does this implement/fix? Explain your changes.

sample_weights was added to median_absolute_error in 0.24 but it was never removed from METRICS_WITHOUT_SAMPLE_WEIGHT. Removing it has highlighted several issues:

Redundancy in checking

I had to add check_consistent_length to median_absolute_error to get the same error message as other metrics when y_pred, y_true or sample_weights are not of the same length. - it looks like this check was added to most sample weight metrics in #9903 but not median_absolute_error.

However it is worth noting that there is redundancy is our checking.

_check_reg_targets* - checks y_pred, y_true are of consistent length, performs check_array , checks multioutput is acceptable, and various other reg related checks.
check_consistent_length - checks that y_pred, y_true and sample_weights are of the same length, used in most (all?) regression metrics
_check_sample_weight - checks sample_weights is the same length as y, performs check_array on sample_weights, various other checks.

If all 3 checks are done in a metrics, we are effectively checking the sample_weight is the correct length 3 times.

Quantile problems

median_absolute_error fails check_sample_weight_invariance in test_regression_sample_weight_invariance - I will put detailed description in review comments.

Classification data used to test regression metrics

test_multilabel_sample_weight_invariance fails with:

ValueError: Unweighted and weighted scores are unexpectedly almost equal (0.0) and (0.0) for median_absolute_error

This makes sense because we are passing multilabel classification data (0/1's) to MULTIOUTPUT_METRICS which are Regression metrics with "multioutput-continuous" format support (e.g., "mean_squared_error", "r2_score" etc). I am not sure why we would not use regression data for these metrics? The tests do pass for all the other regression metrics, but as abs(y_pred - y_true) would be either 1 or 0 for every sample, it is very likely that weighted and unweighted median_absolute_error would be the same value.

I think we should amend test_multilabel_sample_weight_invariance so multi-output regression data is passed to the MULTIOUTPUT_METRICS tests (any maybe even change the name of this various to make it clear that these are regression metrics).

Any other comments?

This is a draft as I don't think this PR should be merged without resolving underlying issues.
cc @glemaitre @ogrisel

github-actions · 2025-02-08T10:35:22Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 3c0fce6. Link to the linter CI: here}

lucyleeow · 2025-02-08T10:36:38Z

sklearn/metrics/tests/test_common.py

 def test_regression_sample_weight_invariance(name):
-    n_samples = 50
-    random_state = check_random_state(0)
+    n_samples = 51


This cannot be an even number due to #17370, inspired by @ogrisel last paragraph in #17370 (comment), even though I agree it's a temporary fix.

Since #29907 was merged, it should be possible to update median_absolute_error to handle even number of data points (with or without weights) as explained below.

lucyleeow · 2025-02-08T10:48:00Z

sklearn/metrics/tests/test_common.py

-    n_samples = 50
-    random_state = check_random_state(0)
+    n_samples = 51
+    random_state = check_random_state(1)


I don't quite understand why this test would be 'flaky' for median_absolute_error for the 'check that the weighted and unweighted scores are unequal' check in check_sample_weight_invariance. Both 0 and 2 fail here 🤷

To check that it was not our _weighted_percentile doing something wrong, I used numpys quantile with sample weights (which I trust, because I saw that PR and the scrutiny and all the tests):

import numpy as np from sklearn.metrics import median_absolute_error from sklearn.utils.validation import check_random_state n_samples = 51 random_state = check_random_state(0) # regression y_true = random_state.random_sample(size=(n_samples,)) y_pred = random_state.random_sample(size=(n_samples,)) rng = np.random.RandomState(0) sample_weight = rng.randint(1, 10, size=len(y_true)) no_weights = np.median(np.abs(y_pred - y_true), axis=0) # gives np.float64(0.26234528145390845) weights = np.quantile( np.abs(y_pred - y_true), 0.5, axis=0, method='inverted_cdf', weights=sample_weight ) # gives array(0.26234528)

I may be missing something, but I don't know why this would be likely to be the same with and without wights.

Note that changing max value in sample_weight = rng.randint(1, 10, size=len(y_true)) from 10 to a smaller value, 5 or 8 etc, was also very effective.

On even numbers of samples (taking weights into account), inverted_cdf yields an asymmetric result, while np.median (on repeated samples) yields a symmetric result (as it uses the "linear" interpolation definition of "in between data points" quantiles).

We need to use a symmetric version of _weighted_percentile which named _averaged_weighted_percentile which was merged yesterday as part of #29907 which is equivalent to the method="averaged_inverted_cdf" of np.quantile.

Sorry I was not clear, random state seed being 0 or 2 fail. But 1 passes.
The specific check that is failing is:

scikit-learn/sklearn/metrics/tests/test_common.py

Lines 1466 to 1475 in 2c2e970

# check that the weighted and unweighted scores are unequal

weighted_score = metric(y1, y2, sample_weight=sample_weight)

# use context manager to supply custom error message

with pytest.raises(AssertionError):

assert_allclose(unweighted_score, weighted_score)

raise ValueError(

"Unweighted and weighted scores are unexpectedly "

"almost equal (%s) and (%s) "

"for %s" % (unweighted_score, weighted_score, name)

Which is odd right? See above sanity check.

Also I agree that it would be better to use numpy implementation of weights but our min support version is not high enough. Do you think we should vendor the code?

Okay I looked into this more, and it seems that this is just not a good check for median_absolute_error (red line is median value, weighted 'median' is calculated with np.quantile(method="inverted_cdf")):

Code

fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(12, 10)) # Different weight ranges for each row weight_ranges = [(None, "No Weights"), (np.arange(1, 10), "Weights (1-9)"), (np.arange(1, 5), "Weights (1-5)")] # Different random seeds for each column seeds = [0, 1, 2] # Loop over rows (weights) and columns (seeds) for row, (weight_range, weight_label) in enumerate(weight_ranges): for col, seed in enumerate(seeds): random_state = check_random_state(seed) y_true = random_state.random_sample(size=(51,)) y_pred = random_state.random_sample(size=(51,)) values = np.abs(y_pred - y_true) rng = np.random.RandomState(seed) sample_weight = rng.randint(weight_range.min(), weight_range.max() + 1, size=len(y_true)) if weight_range is not None else None # Plot histogram if sample_weight is None: axes[row, col].hist(values, bins=15, edgecolor="black", alpha=0.7, density=True) median_value = np.median(values) else: axes[row, col].hist(values, bins=15, weights=sample_weight, edgecolor="black", alpha=0.7, density=True) median_value = np.quantile(values, 0.5, method="inverted_cdf", weights=sample_weight) # Add median line axes[row, col].axvline(median_value, color="red", linestyle="dashed", linewidth=1, label="Median") # Set title axes[row, col].set_title(f"Seed {seed} - {weight_label}") # Adjust layout for readability plt.tight_layout() plt.show()

Weights don't change the distribution of abs(y_pred - y_true) enough to change the median value.

We can change the test to make this check (e.g., we could try adding only high or low weights to a subset of the values), but I am not 100% sure about changing checking this for all regression metrics. To be fair this test is just ensuring that weighted and un-weighted are different, so it should probably be okay?

cc @ogrisel

Also I agree that it would be better to use numpy implementation of weights but our min support version is not high enough. Do you think we should vendor the code?

If the complexity of vendoring is not to high, we should definitely benefit from it.

Considering we added _averaged_weighted_percentile recently in #29907 and also the open PR #29034 , I agree that vendoring is a good option... (as we are having to spend a lot of effort maintaining this function anyway)

OK I see. I think that I would define weights as np.arange(len(y_true)) and I think that it would good enough for this test. I indeed agree with your analysis @lucyleeow.

I'll check the subsequent changes in the test to see if np.arange could be an issue.

I would like to use _averaged_weighted_percentile directly to latter be able to add an array API compatible version.

At the time of writing numpy.quantile(..., method="averaged_inverted_cdf") does not support weights (yet) and is not part of the array API spec. So we need a version that supports both weights, the symmetric behavior with even numbers of data points, and possible array API support.

ogrisel · 2025-02-08T15:20:36Z

If all 3 checks are done in a metrics, we are effectively checking the sample_weight is the correct length 3 times.

This is a waste, but shape consistency checks are not very costly (compared to value dependent checks, e.g. checking that there are no negative weights or no nan/inf weights), so maybe we can live with it. Unless, removing those redundant checks can help simplify the code, in which case we should do it.

ogrisel · 2025-02-08T15:22:37Z

I think we should amend test_multilabel_sample_weight_invariance so multi-output regression data is passed to the MULTIOUTPUT_METRICS tests (any maybe even change the name of this various to make it clear that these are regression metrics)

test_multilabel_sample_weight_invariance implies classification to me. We should probably have two different functions, one for multilabel classification and another for multioutput regression. And maybe another one for multioutput multiclass classification (if needed, I am not sure we have any metric with an API that can handle this type of predictions).

lucyleeow · 2025-02-11T11:15:09Z

Thanks @ogrisel.

With regard to duplicate checks: #30787 (comment), thinking about it more, I think the problem is more:

Lack of consistency between metrics. Only 2 regression metrics _check_sample_weight but AFAICT other metrics that accept sample_weight would also benefit from this check?
It's hard to tell what is being checked where, and which check will ultimately raise an error that the user sees:
- It looks like the new array API supporting (regression) metrics use _check_reg_targets_with_floating_dtype, which does take sample_weight BUT it only matches sample_weights dtype, and does not do length consistency check or check_array because _check_reg_targets does not accept sample_weight
- check_consistent_length is generally called twice. The first time it is called within _check_reg_targets and raises an error when y_pred and y_true are inconsistent. The second time, it is passed y_pred, y_true and sample_weight but will only raise an error when sample_weight is incorrect, as it was already called inside _check_reg_targets to check y_pred and y_true.

What do you think of _check_reg_targets_with_floating_dtype also doing the sample_weight checks, if sample_weight is passed? Or at least we could reduce to 2 checks; _check_reg_targets* and _check_sample_weight?

WDYT?

ogrisel · 2025-02-11T13:08:25Z

It sounds like a good plan.

lucyleeow · 2025-02-12T22:55:12Z

sklearn/metrics/tests/test_common.py

    # regression
    y_true = random_state.random_sample(size=(n_samples,))
    y_pred = random_state.random_sample(size=(n_samples,))
+    sample_weight = np.arange(len(y_true))


Changed as suggested in #30787 (comment). This works great.

I've amended check_sample_weight_invariance to optionally take sample_weight as I didn't want to change the sample_weight for all tests that use check_sample_weight_invariance, WDYT @glemaitre ?

lucyleeow · 2025-05-30T05:16:32Z

sklearn/metrics/tests/test_common.py

        # check that the score is invariant under scaling of the weights by a
        # common factor
-        for scaling in [2, 0.3]:
+        scaling_values = [2] if name == "median_absolute_error" else [2, 0.3]


This is a problem due to numerical instability in cumulative sum (NOT fixed by stable_cumsum) in _weighted_percentile. It is rare for this problem to appear.

In this test, the final value in cumulative_sum was a small amount (within tolerance in stable_cumsum) higher than 'actual' value, this resulted in the adjusted_percentile_rank being a small amount higher than the 'actual' value:

adjusted_percentile_rank was 17.400000000000002 , when it should have been 17.4, which just happens to be a value in weight_cdf. Thus when we do searchsorted

scikit-learn/sklearn/utils/stats.py

Lines 98 to 99 in bff3d7d

xp.searchsorted(

weight_cdf[feature_idx, ...], adjusted_percentile_rank[feature_idx]

the index should be that of the 17.4 element in weight_cdf, but instead it is the next index. stable_cumsum does not fix this particular problem as it does not matter how close adjusted_percentile_rank value is to the true value, if the true value is itself present within weight_cdf, searchsorted will take the adjacent index.

Note that I checked using numpy.quantile (with "inverted_cdf`, which now supports weights) and got the same test failure).

For completeness, I will reference the recent discussion regarding use of stable_cumsum (#29431 (comment)) - it was decided to not be required in _weighted_percentile (cc @ogrisel).
Further numpy's quantile implementation simply uses cumsum.

Note that I checked using numpy.quantile (with "inverted_cdf`, which now supports weights) and got the same test failure).

Can you check the behavior of numpy.quantile with "averaged_inverted_cdf" and _averaged_weighted_percentile on this kind of edge case?

EDIT: I am not sure if the comment above was written before the switch from _weighted_percentile to _averaged_weighted_percentile or not.

Sorry I should have clarified that the failure persists with _averaged_weighted_percentile. But I agree that I also thought _averaged_weighted_percentile should fix the problem so I did some more digging.

_averaged_weighted_percentile errors with:

E Max absolute difference among violations: 0.00704464 E Max relative difference among violations: 0.0095152 E ACTUAL: array(0.733313) E DESIRED: array(0.740357)

_weighted_percentile errors with:

E Max absolute difference among violations: 0.01408929 E Max relative difference among violations: 0.01903039 E ACTUAL: array(0.726268) E DESIRED: array(0.740357)

So _averaged_weighted_percentile halves the error. And the reason is because:

with (+)array the cumulative sum has the instability problem (i.e. adjusted_percentile_rank is 17.400000000000002)

with -array the cumulative sum does not have the instability problem (i.e. adjusted_percentile_rank is 17.4) 🙃

If cumulative sum had the instability problem with both array and -array I think the problem would have been fixed.
numpy.quantile with weights only supports inverted_cdf (I double checked in dev docs https://numpy.org/devdocs/reference/generated/numpy.quantile.html and numpy PRs).

Just documenting a reminder to myself - potentially when we refactor _averaged_weighted_percentile to avoid sorting twice, we won't use -array, in which case scale=0.3 may pass.

(I've been wanting to do the refactoring, but it has not reached the top of my todo yet 😬 )

Note, refactoring does not avoid this error. In #31775 we still use the adjusted_percentile_rank and base the 'next higher' data point on this so cumulative sum errors still affect result.

ogrisel · 2025-06-11T09:41:24Z

pylatest_pip_openblas_pandas reached a timeout at 2h after having successfully executed 97% of the tests. The last test executed before timeout was in utils/tests/test_optimize.py which seems unrelated to this PR. All tests in metrics/tests/test_common.py passed.

We probably need to enable faulthandler on the CI to understand what's going on. Maybe we should always enable faulthander with a global timeout to 45 min or so (or a 30s timeout per test) before dumping the tracebacks.

ogrisel

The diff of the PR looks good to me, but we need to understand where the CI failure comes from. Maybe it's not related to that PR in particular?

If so, +1 for merge on my side.

lucyleeow · 2025-06-12T03:45:09Z

Gentle ping to @glemaitre.

I re-run the failed check and it passed, so seems to just be a timeout. I'll try and look into faulthandler if I have time.

glemaitre

LGTM as well.

…T` (scikit-learn#30787)

lucyleeow added 3 commits February 7, 2025 16:31

add len check

2e9796e

fix test list

890aa74

fix tests

5bfc5b6

lucyleeow marked this pull request as draft February 8, 2025 10:34

github-actions bot added the module:metrics label Feb 8, 2025

lucyleeow commented Feb 8, 2025

View reviewed changes

lint

967212c

lucyleeow added the No Changelog Needed label Feb 8, 2025

glemaitre self-requested a review February 12, 2025 09:55

use arange sample weight

92884bc

lucyleeow commented Feb 12, 2025

View reviewed changes

lucyleeow force-pushed the med_abs_er branch from b353a95 to 92884bc Compare February 13, 2025 00:47

lucyleeow mentioned this pull request Feb 14, 2025

MNT Use regression data for check_sample_weight_invariance test on multioutput regression metrics #30829

Merged

lucyleeow mentioned this pull request Feb 24, 2025

MNT Make sample_weight checking more consistent in regression metrics #30886

Merged

lucyleeow added 2 commits May 29, 2025 21:31

merge main

0b6e9b6

try av weighted percentile

acd8442

lucyleeow mentioned this pull request May 30, 2025

Add array API support to median_absolute_error #31406

Merged

amend test

0bfb464

lucyleeow marked this pull request as ready for review May 30, 2025 04:42

revert even sample size

20e97f4

lucyleeow commented May 30, 2025

View reviewed changes

add comment

4c7b87e

ogrisel removed the No Changelog Needed label Jun 2, 2025

lucyleeow added 2 commits June 4, 2025 10:52

merge main

c672285

amend comment

917930c

lucyleeow added 3 commits June 4, 2025 14:16

add whats new

c2cb13e

fix comment

0f54650

remove cruft

3c0fce6

ogrisel approved these changes Jun 11, 2025

View reviewed changes

lucyleeow added the Waiting for Second Reviewer First reviewer is done, need a second one! label Jun 12, 2025

glemaitre approved these changes Jun 12, 2025

View reviewed changes

glemaitre merged commit d03054b into scikit-learn:main Jun 12, 2025
40 checks passed

lucyleeow deleted the med_abs_er branch June 13, 2025 01:27

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 15, 2025

FIX Remove median_absolute_error from `METRICS_WITHOUT_SAMPLE_WEIGH…

3ccb677

…T` (scikit-learn#30787)

jeremiedbb mentioned this pull request Jul 15, 2025

Release 1.7.1 #31762

Merged

13 tasks

jeremiedbb mentioned this pull request Sep 3, 2025

Release 1.7.2 #32092

Merged

13 tasks

lucyleeow mentioned this pull request Sep 13, 2025

MNT Refactor _average_weighted_percentile to avoid double sort #31775

Merged

	# check that the weighted and unweighted scores are unequal
	weighted_score = metric(y1, y2, sample_weight=sample_weight)

	# use context manager to supply custom error message
	with pytest.raises(AssertionError):
	assert_allclose(unweighted_score, weighted_score)
	raise ValueError(
	"Unweighted and weighted scores are unexpectedly "
	"almost equal (%s) and (%s) "
	"for %s" % (unweighted_score, weighted_score, name)

	xp.searchsorted(
	weight_cdf[feature_idx, ...], adjusted_percentile_rank[feature_idx]

Uh oh!

Conversation

lucyleeow commented Feb 8, 2025

What does this implement/fix? Explain your changes.

Redundancy in checking

Quantile problems

Classification data used to test regression metrics

Any other comments?

Uh oh!

github-actions bot commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Feb 8, 2025

Uh oh!

ogrisel commented Feb 8, 2025

Uh oh!

lucyleeow commented Feb 11, 2025

Uh oh!

ogrisel commented Feb 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lucyleeow Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jun 11, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

lucyleeow commented Jun 12, 2025

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 8, 2025 •

edited

Loading

lucyleeow Feb 12, 2025 •

edited

Loading

lucyleeow Feb 12, 2025 •

edited

Loading

lucyleeow May 30, 2025 •

edited

Loading

lucyleeow Jun 4, 2025 •

edited

Loading

lucyleeow Jun 19, 2025 •

edited

Loading