DOC update and improve the `sample_weight` entry in the glossary by ogrisel · Pull Request #30564 · scikit-learn/scikit-learn

ogrisel · 2024-12-31T17:00:27Z

As discussed in #29907 (comment).

Link to practical usage examples.
Removed some FIXMEs and TODOs.
Fixed the description of the min_samples in DBSCAN and contrasted it with the min_sample_leaf / min_weight_fraction_in_leaf parameters pair.
Refine the description of the interplay with the class_weight parameter.
Reference ongoing work and the tracking meta-issue.

github-actions · 2024-12-31T17:01:41Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 8c8d043. Link to the linter CI: here}

ogrisel · 2025-01-02T15:21:43Z

For information, for the examples of sample weight usage, we could be more specifice, for instance:

Survival Analysis: accelerated failure time modeling via a reduction to weighted linear regression
- AFT:
  - paper: https://www.sciencedirect.com/science/article/pii/S0047259X83710286?via%3Dihub
  - code: https://github.com/sebp/scikit-survival/
    blob/7c8bdb95fc5df4b24e19342d8021501cba3b0cf9/sksurv/linear_model/aft.py#L184-L185
- Hazardous' SurvivalBoost:
  - https://hal.science/hal-04617672
  - code: https://github.com/soda-inria/hazardous/blob/main/hazardous/_survival_boost.py#L444
Causal Inference: double Machine Learning
- paper: https://arxiv.org/abs/1712.04912
- code: https://github.com/py-why/EconML/blob/d995bf32f7867512d98461b544dba3e66b30a702/econml/dml/dml.py#L175-L181
Unfairness mitigation: exponentiated gradient reduction
- paper: http://proceedings.mlr.press/v80/agarwal18a.html
- code: https://github.com/fairlearn/fairlearn/blob/main/fairlearn/reductions/_exponentiated_gradient/_lagrangian.py#L222

But linking to the papers does not necessarily make it explicit that this can be implemented via scikit-learn's sample_weight and the code of those libraries is not always easy to understand without reading the papers first and furthermore I am not sure how to link to a source code snippet while making it easy to check that the link stays relevant over time.

virchan

Just wanted to offer a few alternative word choices for consideration.

doc/glossary.rst

lorentzenchr · 2025-01-03T17:25:05Z

Could you wait for my review before merging? Even if it takes 2 weeks?

lorentzenchr

Just a high-level review: This PR makes sample weights the longest entry in the glossary. I would rather prefer a concise entry.

ogrisel · 2025-01-08T09:04:31Z

Just a high-level review: This PR makes sample weights the longest entry in the glossary. I would rather prefer a concise entry.

I could move some of the details to a dedicated section of the user guide (maybe part of the estimator API) and keep a short version in the glossary with a link to the user guide for more details if people prefer.

lorentzenchr · 2025-01-10T17:43:53Z

FYI: In supervised learning, one of the main reasons for using weights, e.g. in insurance frequency models (Poisson loss), is to account for the different size/time/exposure/... of each observation. The assumptions are usually:

$Y=\frac{C}{w}$ is the ratio of interest, C is an amount (counts, EUR, kilogram, ...).
$Var[Y] \sim \frac{1}{w}$

In the end, the weights enter the loss. The idea is that the "correct" weights improve the efficiency of the estimation of the expected loss** (=empirical mean of per observation losses). However, to my knowledge, there is no direct connection between $Var[Y_i]$ and $Var[loss_i]$ , see Section 5.2.2 in https://arxiv.org/abs/2202.12780.

**expected loss = statistical risk

doc/glossary.rst

GaelVaroquaux

I would also mention reweighting for transfer learning, and define in one sentence what it means.

Also, I find that this paragraph is very useful, but either put it in a fold or in a section of the docs

StefanieSenger

Thanks @ogrisel. I have added a few comments.

doc/glossary.rst

Co-authored-by: Gael Varoquaux <[email protected]> Co-authored-by: Stefanie Senger <[email protected]>

ogrisel · 2025-02-24T13:55:53Z

Thanks for the reviews @StefanieSenger @GaelVaroquaux @lorentzenchr. I addressed all the specific feedback. Personally, I am fine with keeping all this info centralized into a largish glossary entry, but I can also split it into a dedicated section in the user guide (where?) and cross-reference if people prefer.

betatim

I think this is a good entry, even though it is quite long.

It would be nicer to have a short glossary entry and a dedicated user guide section. But I think we'd need to put in more work to make the user guide section. For example expanding on the third party paragraph from this PR with some links to examples of third-party estimators (I'm wondering if the text as it is, is too brief for people who aren't experts to understand it and too long for those who are experts/third-party implementers).

The final paragraph on KMeans and sampling is another one that I think we could expand in the user guide. At least I am not sure I fully understand what it is trying to tell me without an example.

However, maybe we should merge this now already and make an issue to improve the user-guide and reduce the glossary entry. Mostly because making the user guide entry would be quite a bit of additional work

adrinjalali · 2025-04-30T09:06:31Z

doc/glossary.rst

+        floats, so that sample weights are usually equivalent up to a constant
+        positive scaling factor.


equivalent to what? I don't understand this sentence.

Invariance to the scale of weights is actually not guaranteed for all estimators or hyperparameter choices. So let's be more generic.

Suggested change

floats, so that sample weights are usually equivalent up to a constant

positive scaling factor.

floats to express fractional relative importance of data points with respect

to one another others.

Do you mean to say this?

Suggested change

floats, so that sample weights are usually equivalent up to a constant

positive scaling factor.

floats, since for many estimators, only the relative values of weights matter,

not their absolute scale.

I did not want to explicitly state that because we are not yet sure to which extent this is true or not: we do haven't started testing this aspect systematically, but we know that this is not the case for many important ones such as LogisticRegression and Ridge. Both those estimators have a regularization parameter whose effect depends on the scale of the weights.

I think the suggestion above:

floats to express fractional relative importance of data points with respect to one another others.

is more easily understood. But if we decide to keep existing, I don't think we need the "up":

"so that sample weights are usually equivalent to a constant positive scaling factor."

Alternatively, what if we state the scale invariance does not hold in all cases? As from:

but we know that this is not the case for many important ones such as LogisticRegression and Ridge

we know it does not hold for some? If I read "usually equivalent", it would just make me want to ask more questions.

I pushed a rephrasing of the end of this paragraph to be more explicit. Let me know what you think.

doc/glossary.rst

adrinjalali · 2025-04-30T09:10:23Z

doc/glossary.rst

+        Third-party libraries can also use `sample_weight`-compatible
+        estimators as building blocks to reduce a specific statistical task
+        into a weighted regression or classification task. For instance sample
+        weights can be constructed to adjust a time-to-event model for
+        censoring in a predictive survival analysis setting. In causal
+        inference, it is possible to reduce a conditional average treatment
+        effect estimation task to a weighted regression task under some
+        assumptions. Sample weights can also be used to mitigate
+        fairness-related harms based on a given quantitative definition of
+        fairness.


This paragraph doesn't feel like it belongs here.

Personally, I find this paragraph very important: it grounds an abstract concept into practical application cases. Have those use cases in mind is helpful both for users and contributors.

I agree with @adrinjalali, that this paragraph doesn't operate in the same scope as the rest of this glossary entry. A glossary is meant to define terms clearly and serve as a quick lookup.
In the user guide, this highlighting a specific use case would be great in a toggle or visually marked in a different style. Here however, it breaks the format and usability of the glossary.

Point taken: this means we will need a dedicated section in the doc to speak about the semantics and the practical use of the sample weights in more details.

doc/glossary.rst

lucyleeow

Thanks for this, it's a great improvement! Not expert here, so can only make readability remarks and ask some questions.

One thought about including the cases where weights could be used is that are we confident our implementation of weights match the use cases? I really do not know the answer to this. In the stata docs on analytical weights they say:

Most commands that allow aweights handle them in this manner. That is, if you specify aweights,
they are
1. normalized to sum to N and then
2. inserted in the calculation formulas in the same way as fweights.

not sure if we necessarily need to normalize but I think this implies that these weights are scale invariant, which I don't think all our implementations are...?

Also, this may be a can of worms we do not want to open, but do we want to make any general remarks about handling of negative weights? I think we do not dis-allow it. Apparently it is useful in high energy physics #12464

(Also +0.5 to expanding more in the user guide and trying to cut the glossary section down, but admit this would be a bit of work. Happy for this to go in and iterate later, as it is already a good improvement from current)

lucyleeow · 2025-09-10T10:23:48Z

doc/glossary.rst

-
-        In classification, sample weights can also be specified as a function
-        of class with the :term:`class_weight` estimator :term:`parameter`.
+        A relative weight for each sample. Intuitively, if all weights are


To me 'relative weight' means scale invariant. If scale invariance is not guaranteed as we discuss below, maybe we shouldn't use the term 'relative weight'?

Indeed, weight scale invariance is not guaranteed everywhere: we do not test for this property uniformly at the moment, and we know several examples where this test would fail.

doc/glossary.rst

lucyleeow · 2025-09-10T10:27:57Z

doc/glossary.rst

+        floats, so that sample weights are usually equivalent up to a constant
+        positive scaling factor.


I think the suggestion above:

floats to express fractional relative importance of data points with respect to one another others.

is more easily understood. But if we decide to keep existing, I don't think we need the "up":

"so that sample weights are usually equivalent to a constant positive scaling factor."

Alternatively, what if we state the scale invariance does not hold in all cases? As from:

but we know that this is not the case for many important ones such as LogisticRegression and Ridge

we know it does not hold for some? If I read "usually equivalent", it would just make me want to ask more questions.

lucyleeow · 2025-09-11T03:33:04Z

doc/glossary.rst

+        also useful to model the frequency of an event of interest per unit of
+        time on a dataset of observations with different exposure durations per
+        individual (see


I think (?) in these 2 examples, where we use exposure as weights, it is an example of what other packages call 'analytical weights' (stata term, I've also seen this be referred to as 'precision weights', 'reliability weights' and 'inverse variance weights). If we wanted to talk about using weights for this type of case, I think it would be nicer to introduce the general idea of these types of weights instead of being more specific, like we have here. Also using these terms could bring us in line with terms used in other statistical packages.

It is not mentioned explicitly in the examples but I think we are weighting samples with a longer exposure duration higher, as these would be considered more reliable, or lower variance. I think in the Poisson model variance equals mean, which equals count / exposure , thus longer exposure -> lower variance. This seems to match what Christian is describing here: #30564 (comment) , where variance is inversely proportional to weight.

It's difficult to find a good source/definition of these types of weights. Julia stats says:

These are typically used when the observations being weighted are aggregate values (e.g., averages) with differing variances.

Stata docs (see section 20.23) says:

Analytic weights — analytic is a term we made up — statistically arise in one particular problem:
linear regression on data that are themselves observed means

Later (when talking about the difference between analytical weights and sampling weights (aka inverse propensity weighting)), there is the example:

Consider 2 observations, one recording means over two subjects and the
other means over 100,000 subjects. You would expect the variance of the residual to be less in the
100,000-subject observation; that is, there is more information in the 100,000-subject observation than
in the two-subject observation

The best I could find on wiki is this section on 'reliability weights' inside the section on 'Weighted arithmetic mean':

If the weights are instead reliability weights (non-random values reflecting the sample's relative trustworthiness, often derived from sample variance), we can determine a correction factor to yield an unbiased estimator.

I think we indeed need to put more thinking into the weight semantics we want to enforce consistently throughout the library and we should probably write more detailed doc in the user guide beyond what the glossary entry can provide. This is related to audit the state of the library w.r.t. weight scale invariance I believe.

Let me remove that sentence for now.

doc/glossary.rst

Co-authored-by: Lucy Liu <[email protected]>

…review comments.

adrinjalali

Thanks @ogrisel . LGTM.

I'll let another set of eyes have a look and possibly merge.

lucyleeow

Sorry for delay, just a question and nitpick, otherwise LGTM.

I will make an issue just to document that we want to add a user guide for sample weights.

lucyleeow · 2025-10-23T10:47:59Z

doc/glossary.rst

+        specified as floats: many estimators and scorers are invariant to a
+        rescaling of the weights by a constant positive factor but there are
+        exceptions.


Sorry but I don't quite follow how "Weights may also be specified as floats" relates to "many estimators and scorers are invariant to a rescaling of the weights by a constant positive factor but there are exceptions".

Because floating-point valued weights could be used to express relative weights as opposed to absolute (integer) repetitions if you divide integer weights by their sum across the training set, for instance. However, there are known case where this is not equivalent (and we are still not sure if it's a bug or a feature).

Feel free to make a suggestion for rephrasing this sentence based on your understanding of my above reply. I don't know to convey the above without being overly verbose.

I think I see what you are saying.
Integer weights could also be expressed by a float e.g., float 'relative' weights of [0.1, 0.2, 0.3], can be rescaled by a constant factor of 10 and would become integer weights [1, 2, 3] ?

I guess if weights were scale invariant, the above weights would effectively be the same.

I was recently reading this conversation: numpy/numpy#8935 (comment) (it's about weights specifically in quantiles) but it made me question if some interpretations of weight is sample size dependent (i.e. scale variant)..?

Sorry, this is probably not helpful in moving this forward.

Yes, the point is that many scikit-learn estimators have a fit method that is naturally sample_weight-scale invariant, but there are exceptions (sometimes only for some combinations of constructor param values). We would need to conduct an exhaustive survey similarly to what we did to check the validity of the integer frequency semantics.

For estimators that are weight-scale invariant, the semantics of fractional weights (passed as floating point values) is easy to define: they are the same as the integer frequency weights because dividing or multiplying by an arbitrary constant that does not impact the results.

For estimators that are not weight scale invariant, the semantics of non-integer weights is not easy to define, but I don't want to make the glossary entry too complex.

Also note that sample_weight scale invariance is strongly related to invariance to duplicating the whole training set: some estimators have this property by default, others will require to change a regularization parameter or a max_samples parameter to get the same behavior (in distribution if their fit method is stochastic).

What about:

Weights can also be specified as floats, and can have the same effect as above, as many estimators and scorers are scale invariant. For example, weights [1, 2, 3] would be equivalent to weights [0.1, 0.2, 0.3] as they differ by a constant factor of 10. Note, there are exceptions to this scale invariance (see below).

This is not ideal, open to re-phrasing.

I pushed 8c8d043 but I removed the "(see below)" reference because we actually no longer speak about exceptions to weight scale invariance anymore in the rest of the glossary entry.

doc/glossary.rst

Co-authored-by: Lucy Liu <[email protected]>

lucyleeow

LGTM, will merge tomorrow, just in case @ogrisel you want to do any last changes.

lucyleeow · 2025-10-28T01:54:32Z

doc/glossary.rst

+        At the time of writing (version 1.8), not all scikit-learn estimators
+        correctly implement the weight-repetition equivalence property. The
+        `#16298 meta issue
+        <https://github.com/scikit-learn/scikit-learn/issues/16298>`_ tracks
+        ongoing work to detect and fix remaining discrepancies.


The 'see below' was referring to this part, but I think this is fine as is!

lucyleeow · 2025-10-31T05:07:30Z

Thanks for working on this @ogrisel !

…kit-learn#30564) Co-authored-by: Gael Varoquaux <[email protected]> Co-authored-by: Stefanie Senger <[email protected]> Co-authored-by: Lucy Liu <[email protected]>

DOC update and improve the sample_weight entry in the glossary

7771446

github-actions bot added the Documentation label Dec 31, 2024

ogrisel mentioned this pull request Dec 31, 2024

Fix sample weight passing in KBinsDiscretizer #29907

Merged

8 tasks

ogrisel mentioned this pull request Jan 2, 2025

List of estimators with known incorrect handling of sample_weight #16298

Open

54 tasks

ogrisel changed the title ~~DOC update and improve the sample_weight entry in the glossary~~ DOC update and improve the sample_weight entry in the glossary Jan 2, 2025

virchan reviewed Jan 3, 2025

View reviewed changes

doc/glossary.rst Outdated Show resolved Hide resolved

doc/glossary.rst Outdated Show resolved Hide resolved

doc/glossary.rst Outdated Show resolved Hide resolved

Improve phrasing

876261d

ogrisel commented Jan 3, 2025

View reviewed changes

doc/glossary.rst Outdated Show resolved Hide resolved

Typo

a93a221

ogrisel commented Jan 3, 2025

View reviewed changes

doc/glossary.rst Show resolved Hide resolved

lorentzenchr reviewed Jan 6, 2025

View reviewed changes

ogrisel mentioned this pull request Jan 10, 2025

SLEP006: default routing #26179

Open

GaelVaroquaux reviewed Jan 15, 2025

View reviewed changes

doc/glossary.rst Outdated Show resolved Hide resolved

GaelVaroquaux reviewed Jan 15, 2025

View reviewed changes

StefanieSenger reviewed Jan 16, 2025

View reviewed changes

doc/glossary.rst Outdated Show resolved Hide resolved

doc/glossary.rst Outdated Show resolved Hide resolved

doc/glossary.rst Outdated Show resolved Hide resolved

ogrisel and others added 4 commits February 24, 2025 14:32

Apply suggestions from code review

865eed7

Co-authored-by: Gael Varoquaux <[email protected]> Co-authored-by: Stefanie Senger <[email protected]>

Reorder paragraphs and reformat

1ece29e

Simpler phrasing

ef31eb9

Cross-reference metadata-routing

5e2d5f6

betatim approved these changes Feb 26, 2025

View reviewed changes

ogrisel added the Waiting for Second Reviewer First reviewer is done, need a second one! label Mar 11, 2025

ogrisel mentioned this pull request Mar 27, 2025

TargetEncoder should respect sample_weights #28881

Open

adrinjalali reviewed Apr 30, 2025

View reviewed changes

ogrisel mentioned this pull request Jun 16, 2025

FIX Draw indices using sample_weight in Bagging #31414

Merged

ogrisel commented Jun 18, 2025

View reviewed changes

doc/glossary.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

d592044

ogrisel commented Jun 18, 2025

View reviewed changes

doc/glossary.rst Outdated Show resolved Hide resolved

Apply suggestions from code review

55e5e0a

lucyleeow reviewed Sep 11, 2025

View reviewed changes

ogrisel and others added 3 commits October 7, 2025 11:47

Apply suggestions from code review

3f34620

Co-authored-by: Lucy Liu <[email protected]>

Merge branch 'main' into enh-sample-weight-glossary and address some …

2ef867b

…review comments.

Remove sentence about modeling the frequency per unit of time

c7c78a8

adrinjalali approved these changes Oct 8, 2025

View reviewed changes

lucyleeow approved these changes Oct 23, 2025

View reviewed changes

ogrisel and others added 2 commits October 23, 2025 17:34

Update doc/glossary.rst

11ab2ec

Co-authored-by: Lucy Liu <[email protected]>

Merge branch 'main' into enh-sample-weight-glossary

a58dd6f

lucyleeow mentioned this pull request Oct 24, 2025

DOC: Add user guide entry for sample weights #32561

Open

ghost mentioned this pull request Oct 25, 2025

DOC: Update sample_weight glossary entry with comprehensive examples #32561 #32573

Closed

Expand a bit on floating point weights

8c8d043

lucyleeow approved these changes Oct 28, 2025

View reviewed changes

lucyleeow merged commit 1e9b523 into scikit-learn:main Oct 31, 2025
38 checks passed

github-project-automation bot moved this from In Progress to Done in Losses and solvers Oct 31, 2025

lucyleeow mentioned this pull request Dec 23, 2025

FIX Draw indices using sample_weight in Random Forests #31529

Merged

6 tasks

		floats, so that sample weights are usually equivalent up to a constant
		positive scaling factor.

Uh oh!

Conversation

ogrisel commented Dec 31, 2024

Uh oh!

github-actions bot commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

ogrisel commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

virchan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr commented Jan 3, 2025

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Jan 8, 2025

Uh oh!

lorentzenchr commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

GaelVaroquaux left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lucyleeow left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Dec 31, 2024 •

edited

Loading

ogrisel commented Jan 2, 2025 •

edited

Loading

lorentzenchr commented Jan 10, 2025 •

edited

Loading

GaelVaroquaux left a comment •

edited

Loading

ogrisel commented Feb 24, 2025 •

edited

Loading

ogrisel Oct 7, 2025 •

edited

Loading

lucyleeow left a comment •

edited

Loading