MNT Add option to raise when all sample weights are 0 in `_check_sample_weight` by j-hendricks · Pull Request #32212 · scikit-learn/scikit-learn

j-hendricks · 2025-09-17T17:31:33Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Make _weighted_percentile return nan and _check_sample_weight raise error when all sample weights are 0.

Previously, _weighted_percentile would return the last element in the array, which was unexpected behavior and unintuitive to the user. To ensure this issue is caught further upstream,
_check_sample_weight now raises a ValueError when sample weights are all 0.

Additionally, parameter allow_zero_weights was added to _check_sample_weight for additional flexibility regarding the raising of the ValueError.

Any other comments?

Modified tests in utils/tests/test_stats.py and utils/tests/test_validation.py to check for these changes.

github-actions · 2025-09-17T17:32:33Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 61605cc. Link to the linter CI: here}

j-hendricks · 2025-09-18T00:50:27Z

#31775 to be merged before ready for review

lucyleeow

Thanks for the PR. Small nits only.

Just a note, please make sure lines, even in rst files, are <88 characters in length

sklearn/linear_model/_stochastic_gradient.py

sklearn/utils/tests/test_validation.py

doc/whats_new/upcoming_changes/sklearn.utils/32212.fix.rst

sklearn/utils/validation.py

j-hendricks · 2025-09-27T01:01:25Z

@lucyleeow thanks for the recommendations! I agree with all of them and have updated the PR accordingly :)

sklearn/linear_model/tests/test_sgd.py

lucyleeow · 2025-09-27T01:11:32Z

Just noticed that the default is to error, i.e. allow_all_zero_weights=False

In which case we have changed the behaviour everywhere that _check_sample_weights is used. We may want to update any common sample weight tests to include a check that the error is raised correctly.

We will need to add a whats new entry to advise of this change in behaviour. It will be many estimators and metrics affected. @ogrisel do we want to list all of them in maybe doc/whats_new/upcoming_changes/many-modules ?

ogrisel · 2025-09-27T04:51:56Z

+1 for extending a common test to check for invalid sample weight related error messages and global level changed model changelog entry.

j-hendricks · 2025-09-27T22:19:01Z

I added a common test for all zero sample weights, but came across some edge cases that I need to investigate further.

If you run pytest sklearn/tests/test_common.py::test_check_all_zero_sample_weights_error, the following estimators fail:

NuSVC(): "ValueError: negative dimensions are not allowed"
Perceptron(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"
SGDClassifier(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"

I'm not sure what's going on with NuSVC and SGDClassifier, but I think Perceptron is failing because it doesn't validate sample weights despite supporting this parameter (hence no ValueError raised).

j-hendricks · 2025-09-28T20:25:10Z

I added a common test for all zero sample weights, but came across some edge cases that I need to investigate further.

If you run pytest sklearn/tests/test_common.py::test_check_all_zero_sample_weights_error, the following estimators fail:

NuSVC(): "ValueError: negative dimensions are not allowed"

Perceptron(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"

SGDClassifier(max_iter=5): "AssertionError: Did not raise: [<class 'ValueError'>]"

I'm not sure what's going on with NuSVC and SGDClassifier, but I think Perceptron is failing because it doesn't validate sample weights despite supporting this parameter (hence no ValueError raised).

I figured out the issues I raised yesterday:

NuSVC(): Was not included in the list of SVC estimators that raise a more informative ValueError for all 0 sample weights. Now added in sklearn/svm/src/libsvm/svm.cpp.
Perceptron(max_iter=5): _make_validation_split raises a more informative error only when early_stopping=True. I've adjusted the logic.
SGDClassifier(max_iter=5): same as the perceptron above.

I've solved each of these with some minor tweaks. Thanks for your patience while I figured this out!

doc/whats_new/upcoming_changes/many-modules/32212.fix.rst

sklearn/utils/estimator_checks.py

doc/whats_new/upcoming_changes/many-modules/32212.fix.rst

sklearn/linear_model/_stochastic_gradient.py

j-hendricks · 2025-10-06T11:44:00Z

@lucyleeow I think we're ready for final reviews before merging, but before we do that I'm going to add you as a co-author given you were the one who outlined the approach I implemented.

sklearn/svm/tests/test_svm.py

lucyleeow

Small comments but pretty much LGTM

lucyleeow · 2025-10-29T05:03:25Z

doc/whats_new/upcoming_changes/many-modules/32212.fix.rst

+  ineffective sampling during fitting. This change applies to all estimators that
+  support the parameter `sample_weight`. This change also affects metrics that validate
+  sample weights.


Are there any metrics that support sample_weight but do not validate sample weights?

Short answer is yes, assuming your definition of "validate" means sample_weight goes through _check_sample_weight. The one example I found was r2_score in the metrics package. Although it does apply utils.validation.column_or_1d on sample_weight, it does not apply _check_sample_weight.

Given the above example, I think the statement "This change also affects metrics that validate sample weights" still applies.

sklearn/linear_model/_stochastic_gradient.py

sklearn/utils/estimator_checks.py

doc/whats_new/upcoming_changes/many-modules/32212.fix.rst

lucyleeow · 2025-10-29T05:12:02Z

@ogrisel may be interested to take a look?

lucyleeow · 2025-11-18T10:16:56Z

@j-hendricks just checking if you are still interested in working on this?

j-hendricks · 2025-11-18T12:11:18Z

@j-hendricks just checking if you are still interested in working on this?

@lucyleeow Yup! Working on it right now

lucyleeow

The CI failure is at check_classifiers_one_label_sample_weights for RandomForestClassifier(n_estimators=5)

The data looks like:

    X_train = rnd.uniform(size=(10, 10))
    X_test = rnd.uniform(size=(10, 10))
    y = np.arange(10) % 2
    sample_weight = y.copy()  # select a single class

With sample size of 10, and half of those being 0 sample weight, just by chance we have a case where for one tree, all subsampled samples have sample weight of 0. Thus we end up raising this new error we added, instead of the one we are checking for.

CI test output

 ../sklearn/ensemble/_forest.py:188: in _parallel_build_trees
    tree._fit(
        X          = array([[0.5488135 , 0.71518934, 0.60276335, 0.5448832 , 0.4236548 ,
        0.6458941 , 0.4375872 , 0.891773  , 0.9636...787, 0.7163272 , 0.2894061 ,
        0.18319136, 0.5865129 , 0.02010755, 0.82894003, 0.00469548]],
      dtype=float32)
        bootstrap  = True
        class_weight = None
        curr_sample_weight = array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

see: https://github.com/scikit-learn/scikit-learn/actions/runs/19469178351/job/55711625308?pr=32212#step:6:994

This test is run on all classifiers so we should be careful in amending - to not make the test suite too computationally expensive or affect tests on other estimators. We could:

reduce the number of 0 sample weights. This would also make imbalanced classes, as we want the 0 sample weights to all correspond to one class, but as we are only testing the error message, this should be okay (?)
increase the sample size so all 0 sample weights is less likely

cc @ogrisel

lucyleeow · 2025-11-19T03:38:39Z

sklearn/utils/estimator_checks.py

+    """The following estimators have custom error messages:
+
+    NuSVC: Invalid input - all samples have zero or negative weights.
+
+    Perceptron: The sample weights for validation set are all zero, consider using a
+    different random state.
+
+    SGDClassifier: The sample weights for validation set are all zero, consider using a
+    different random state.
+    """


Let's use # instead.

And let's specify that; all others will output "Sample weights must contain at least one non-zero number." message from _check_sample_weights.

lucyleeow · 2025-11-19T03:48:33Z

sklearn/linear_model/_stochastic_gradient.py

+        # Skip check that validation weights are not all zero when `early_stopping` is
+        # set to True as `_make_validation_split` will raise a more informative error.
+        sample_weight = _check_sample_weight(
+            sample_weight,
+            X,
+            dtype=X.dtype,
+            allow_all_zero_weights=self.early_stopping,
+        )


For second reviewer: I don't love this but I can't think of a better way to force raise of the more informative error message.

ogrisel · 2025-11-27T11:11:34Z

reduce the number of 0 sample weights. This would also make imbalanced classes, as we want the 0 sample weights to all correspond to one class, but as we are only testing the error message, this should be okay (?)

We could do that. Or we could just add this particular common test to PER_ESTIMATOR_XFAIL_CHECKS for random forests for now. I am pretty sure that #31529 will fix this problem (just in case you want to review ;)

ogrisel

Besides the following and the above, LGTM!

sklearn/utils/tests/test_stats.py

sklearn/utils/estimator_checks.py

lucyleeow · 2025-12-19T01:58:12Z

@ogrisel are you happy to merge this or do you want to wait for #31529 (FYI I reviewed that, with minor nits only)?

ogrisel

I did another pass with the latest changes and this looks good to merge. No need to wait for the concurrent RF fix.

…le_weight` (scikit-learn#32212) Co-authored-by: John Hendricks <[email protected]> Co-authored-by: Olivier Grisel <[email protected]>

github-actions bot added the module:utils label Sep 17, 2025

j-hendricks marked this pull request as draft September 18, 2025 00:49

j-hendricks changed the title ~~return nan when all sample weights are 0~~ Return nan when all sample weights are 0 Sep 25, 2025

j-hendricks changed the title ~~Return nan when all sample weights are 0~~ MNT Return nan when all sample weights are 0 Sep 25, 2025

j-hendricks marked this pull request as ready for review September 25, 2025 14:39

lucyleeow reviewed Sep 26, 2025

View reviewed changes

lucyleeow mentioned this pull request Sep 26, 2025

Error in metrics/estimators where all zero weights does not make sense #32277

Closed

j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from 10b0788 to eb35131 Compare September 27, 2025 00:08

lucyleeow reviewed Sep 27, 2025

View reviewed changes

sklearn/linear_model/tests/test_sgd.py Outdated Show resolved Hide resolved

j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from 1e80aba to a2cd6c0 Compare September 28, 2025 08:26

lucyleeow reviewed Sep 29, 2025

View reviewed changes

j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from d391eac to b104e63 Compare September 30, 2025 03:01

lucyleeow reviewed Oct 29, 2025

View reviewed changes

sklearn/svm/tests/test_svm.py Show resolved Hide resolved

lucyleeow reviewed Oct 29, 2025

View reviewed changes

lucyleeow added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 29, 2025

lucyleeow mentioned this pull request Oct 31, 2025

Made _weighted_percentile raise value error when weights are all zero. #31041

Closed

lucyleeow changed the title ~~MNT Return nan when all sample weights are 0~~ MNT Add option to raise when all sample weights are 0 in _check_sample_weight Oct 31, 2025

j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from b037bba to 212606e Compare November 18, 2025 14:12

lucyleeow reviewed Nov 19, 2025

View reviewed changes

ogrisel approved these changes Nov 27, 2025

View reviewed changes

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved

jhendricks487 and others added 8 commits November 28, 2025 18:06

Added back ValueError and renamed new parameters

654e8fc

Implement common test (WIP - edge cases pending)

559ee24

Resolved failing edge cases

83f0890

Replace error message in test_svm.py

a72cf2f

Improved formatting

27ce9c2

Add co-author

b9b307e

Add what's new and additional comments

faca889

Convert string to comment

e32b3c5

j-hendricks force-pushed the weighted-percentile-nan-zero-weights branch from 212606e to e32b3c5 Compare November 28, 2025 23:32

Handle edge cases where sample weights may be all zero

61605cc

adrinjalali added this to Labs Dec 11, 2025

adrinjalali moved this to Todo in Labs Dec 11, 2025

adrinjalali assigned ogrisel Dec 11, 2025

adrinjalali moved this from Todo to In progress in Labs Dec 11, 2025

adrinjalali unassigned ogrisel Jan 6, 2026

Merge branch 'main' into weighted-percentile-nan-zero-weights

22ed510

ogrisel approved these changes Jan 8, 2026

View reviewed changes

ogrisel enabled auto-merge (squash) January 8, 2026 16:38

ogrisel mentioned this pull request Jan 8, 2026

FIX: Raise error for zero sample_weight classes in compute_class_weight #32941

Open

4 tasks

ogrisel merged commit 66fbe2d into scikit-learn:main Jan 8, 2026
38 checks passed

github-project-automation bot moved this from In progress to Done in Labs Jan 8, 2026

This was referenced Jan 15, 2026

[python-package] support all-0 sample weights in scikit-learn 1.9 lightgbm-org/LightGBM#7128

Merged

[ci] fail linting CI job if PowerShell ScriptAnalyzer failed lightgbm-org/LightGBM#7127

Merged

This was referenced Mar 3, 2026

SGDClassifier.partial_fit mutates the model when sample_weight is all zeros #33436

Open

FIX SGD partial_fit no-op on all-zero sample_weight #33449

Closed

Uh oh!

Conversation

j-hendricks commented Sep 17, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

j-hendricks commented Sep 18, 2025

Uh oh!

lucyleeow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

j-hendricks commented Sep 27, 2025

Uh oh!

Uh oh!

lucyleeow commented Sep 27, 2025

Uh oh!

ogrisel commented Sep 27, 2025

Uh oh!

j-hendricks commented Sep 27, 2025

Uh oh!

j-hendricks commented Sep 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

j-hendricks commented Oct 6, 2025

Uh oh!

Uh oh!

lucyleeow left a comment

Choose a reason for hiding this comment

Uh oh!

lucyleeow Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

j-hendricks Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Oct 29, 2025

Uh oh!

lucyleeow commented Nov 18, 2025

Uh oh!

j-hendricks commented Nov 18, 2025

Uh oh!

lucyleeow left a comment

Choose a reason for hiding this comment

Uh oh!

lucyleeow Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

lucyleeow Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

lucyleeow Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Nov 27, 2025

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lucyleeow commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 17, 2025 •

edited

Loading

j-hendricks Nov 18, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading

lucyleeow commented Dec 19, 2025 •

edited

Loading