Skip to content

MNT Add option to raise when all sample weights are 0 in _check_sample_weight#32212

Merged
ogrisel merged 10 commits intoscikit-learn:mainfrom
j-hendricks:weighted-percentile-nan-zero-weights
Jan 8, 2026
Merged

MNT Add option to raise when all sample weights are 0 in _check_sample_weight#32212
ogrisel merged 10 commits intoscikit-learn:mainfrom
j-hendricks:weighted-percentile-nan-zero-weights

Conversation

@j-hendricks
Copy link
Copy Markdown
Contributor

Reference Issues/PRs

Fixes #31032.

What does this implement/fix? Explain your changes.

Make _weighted_percentile return nan and _check_sample_weight raise error when all sample weights are 0.

Previously, _weighted_percentile would return the last element in the array, which was unexpected behavior and unintuitive to the user. To ensure this issue is caught further upstream,
_check_sample_weight now raises a ValueError when sample weights are all 0.

Additionally, parameter allow_zero_weights was added to _check_sample_weight for additional flexibility regarding the raising of the ValueError.

Any other comments?

Modified tests in utils/tests/test_stats.py and utils/tests/test_validation.py to check for these changes.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Sep 17, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 61605cc. Link to the linter CI: here

@j-hendricks j-hendricks marked this pull request as draft September 18, 2025 00:49
@j-hendricks
Copy link
Copy Markdown
Contributor Author

#31775 to be merged before ready for review

@j-hendricks j-hendricks changed the title return nan when all sample weights are 0 Return nan when all sample weights are 0 Sep 25, 2025
@j-hendricks j-hendricks changed the title Return nan when all sample weights are 0 MNT Return nan when all sample weights are 0 Sep 25, 2025
@j-hendricks j-hendricks marked this pull request as ready for review September 25, 2025 14:39
Copy link
Copy Markdown
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Small nits only.

Just a note, please make sure lines, even in rst files, are <88 characters in length

@j-hendricks
Copy link
Copy Markdown
Contributor Author

@lucyleeow thanks for the recommendations! I agree with all of them and have updated the PR accordingly :)

@lucyleeow
Copy link
Copy Markdown
Member

Just noticed that the default is to error, i.e. allow_all_zero_weights=False

In which case we have changed the behaviour everywhere that _check_sample_weights is used. We may want to update any common sample weight tests to include a check that the error is raised correctly.

We will need to add a whats new entry to advise of this change in behaviour. It will be many estimators and metrics affected. @ogrisel do we want to list all of them in maybe doc/whats_new/upcoming_changes/many-modules ?

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Sep 27, 2025

+1 for extending a common test to check for invalid sample weight related error messages and global level changed model changelog entry.

@j-hendricks
Copy link
Copy Markdown
Contributor Author

I added a common test for all zero sample weights, but came across some edge cases that I need to investigate further.

If you run pytest sklearn/tests/test_common.py::test_check_all_zero_sample_weights_error, the following estimators fail:

  • NuSVC(): "ValueError: negative dimensions are not allowed"
  • Perceptron(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"
  • SGDClassifier(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"

I'm not sure what's going on with NuSVC and SGDClassifier, but I think Perceptron is failing because it doesn't validate sample weights despite supporting this parameter (hence no ValueError raised).

@j-hendricks j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from 1e80aba to a2cd6c0 Compare September 28, 2025 08:26
@j-hendricks
Copy link
Copy Markdown
Contributor Author

I added a common test for all zero sample weights, but came across some edge cases that I need to investigate further.

If you run pytest sklearn/tests/test_common.py::test_check_all_zero_sample_weights_error, the following estimators fail:

  • NuSVC(): "ValueError: negative dimensions are not allowed"
  • Perceptron(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"
  • SGDClassifier(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"

I'm not sure what's going on with NuSVC and SGDClassifier, but I think Perceptron is failing because it doesn't validate sample weights despite supporting this parameter (hence no ValueError raised).

I figured out the issues I raised yesterday:

  • NuSVC(): Was not included in the list of SVC estimators that raise a more informative ValueError for all 0 sample weights. Now added in sklearn/svm/src/libsvm/svm.cpp.
  • Perceptron(max_iter=5): _make_validation_split raises a more informative error only when early_stopping=True. I've adjusted the logic.
  • SGDClassifier(max_iter=5): same as the perceptron above.

I've solved each of these with some minor tweaks. Thanks for your patience while I figured this out!

@j-hendricks j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from d391eac to b104e63 Compare September 30, 2025 03:01
@j-hendricks
Copy link
Copy Markdown
Contributor Author

@lucyleeow I think we're ready for final reviews before merging, but before we do that I'm going to add you as a co-author given you were the one who outlined the approach I implemented.

Copy link
Copy Markdown
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments but pretty much LGTM

Comment on lines +2 to +4
ineffective sampling during fitting. This change applies to all estimators that
support the parameter `sample_weight`. This change also affects metrics that validate
sample weights.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any metrics that support sample_weight but do not validate sample weights?

Copy link
Copy Markdown
Contributor Author

@j-hendricks j-hendricks Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer is yes, assuming your definition of "validate" means sample_weight goes through _check_sample_weight. The one example I found was r2_score in the metrics package. Although it does apply utils.validation.column_or_1d on sample_weight, it does not apply _check_sample_weight.

Given the above example, I think the statement "This change also affects metrics that validate sample weights" still applies.

@lucyleeow lucyleeow added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 29, 2025
@lucyleeow
Copy link
Copy Markdown
Member

@ogrisel may be interested to take a look?

@lucyleeow lucyleeow changed the title MNT Return nan when all sample weights are 0 MNT Add option to raise when all sample weights are 0 in _check_sample_weight Oct 31, 2025
@lucyleeow
Copy link
Copy Markdown
Member

@j-hendricks just checking if you are still interested in working on this?

@j-hendricks
Copy link
Copy Markdown
Contributor Author

@j-hendricks just checking if you are still interested in working on this?

@lucyleeow Yup! Working on it right now

@j-hendricks j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from b037bba to 212606e Compare November 18, 2025 14:12
Copy link
Copy Markdown
Member

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI failure is at check_classifiers_one_label_sample_weights for RandomForestClassifier(n_estimators=5)

The data looks like:

    X_train = rnd.uniform(size=(10, 10))
    X_test = rnd.uniform(size=(10, 10))
    y = np.arange(10) % 2
    sample_weight = y.copy()  # select a single class

With sample size of 10, and half of those being 0 sample weight, just by chance we have a case where for one tree, all subsampled samples have sample weight of 0. Thus we end up raising this new error we added, instead of the one we are checking for.

CI test output
 ../sklearn/ensemble/_forest.py:188: in _parallel_build_trees
    tree._fit(
        X          = array([[0.5488135 , 0.71518934, 0.60276335, 0.5448832 , 0.4236548 ,
        0.6458941 , 0.4375872 , 0.891773  , 0.9636...787, 0.7163272 , 0.2894061 ,
        0.18319136, 0.5865129 , 0.02010755, 0.82894003, 0.00469548]],
      dtype=float32)
        bootstrap  = True
        class_weight = None
        curr_sample_weight = array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

see: https://github.com/scikit-learn/scikit-learn/actions/runs/19469178351/job/55711625308?pr=32212#step:6:994

This test is run on all classifiers so we should be careful in amending - to not make the test suite too computationally expensive or affect tests on other estimators. We could:

  • reduce the number of 0 sample weights. This would also make imbalanced classes, as we want the 0 sample weights to all correspond to one class, but as we are only testing the error message, this should be okay (?)
  • increase the sample size so all 0 sample weights is less likely

cc @ogrisel

Comment on lines +1497 to +1506
"""The following estimators have custom error messages:

NuSVC: Invalid input - all samples have zero or negative weights.

Perceptron: The sample weights for validation set are all zero, consider using a
different random state.

SGDClassifier: The sample weights for validation set are all zero, consider using a
different random state.
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use # instead.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And let's specify that; all others will output "Sample weights must contain at least one non-zero number." message from _check_sample_weights.

Comment on lines +631 to +638
# Skip check that validation weights are not all zero when `early_stopping` is
# set to True as `_make_validation_split` will raise a more informative error.
sample_weight = _check_sample_weight(
sample_weight,
X,
dtype=X.dtype,
allow_all_zero_weights=self.early_stopping,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For second reviewer: I don't love this but I can't think of a better way to force raise of the more informative error message.

@ogrisel
Copy link
Copy Markdown
Member

ogrisel commented Nov 27, 2025

reduce the number of 0 sample weights. This would also make imbalanced classes, as we want the 0 sample weights to all correspond to one class, but as we are only testing the error message, this should be okay (?)

We could do that. Or we could just add this particular common test to PER_ESTIMATOR_XFAIL_CHECKS for random forests for now. I am pretty sure that #31529 will fix this problem (just in case you want to review ;)

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the following and the above, LGTM!

@j-hendricks j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from 212606e to e32b3c5 Compare November 28, 2025 23:32
@adrinjalali adrinjalali added this to Labs Dec 11, 2025
@adrinjalali adrinjalali moved this to Todo in Labs Dec 11, 2025
@adrinjalali adrinjalali moved this from Todo to In progress in Labs Dec 11, 2025
@lucyleeow
Copy link
Copy Markdown
Member

lucyleeow commented Dec 19, 2025

@ogrisel are you happy to merge this or do you want to wait for #31529 (FYI I reviewed that, with minor nits only)?

Copy link
Copy Markdown
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass with the latest changes and this looks good to merge. No need to wait for the concurrent RF fix.

@ogrisel ogrisel enabled auto-merge (squash) January 8, 2026 16:38
@ogrisel ogrisel merged commit 66fbe2d into scikit-learn:main Jan 8, 2026
38 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in Labs Jan 8, 2026
dschult pushed a commit to dschult/scikit-learn that referenced this pull request Jan 25, 2026
dschult pushed a commit to dschult/scikit-learn that referenced this pull request Jan 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:utils Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

weighted_percentile should error/warn when all sample weights 0

5 participants