ENH Support sample weights in HGBT by adrinjalali · Pull Request #14696 · scikit-learn/scikit-learn

adrinjalali · 2019-08-20T16:12:03Z

Fixes #14830

This PR adds sample weight support to HGBT estimators.

…ndles it for now

adrinjalali · 2019-09-02T10:42:41Z

Needs more tests, but already passes some edge case tests. @NicolasHug wanna have a look?

NicolasHug

Thanks Adrin! Made a first pass.

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/loss.py

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

sklearn/ensemble/_hist_gradient_boosting/loss.py

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

NicolasHug · 2019-09-05T11:19:15Z

Apart from the minor hessians_are_constant issue I think this looks pretty good now, we just need some tests. Some basic ones I can think of are

make sure hessiasn_are_constant is always True when sw is not None
make sure the hessians array is not initialized to [1] when sw is not None
make sure the sum_hessian field of the histograms corresponds to the weighted sum of the sample weights
make sure hessiasn and gradients are multiplied by the sw

Oh and I just realized, we need the PDPs to take SW into account now ^^
There was a PR for SW and the PDPs of the other trees: #13193

adrinjalali · 2019-09-08T11:16:52Z

This test would fail, which I'm trying to figure out why:

import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_classification

model = HistGradientBoostingClassifier(max_iter=2, max_depth=2,
                                       random_state=42,
                                       validation_fraction=None)
X, y = make_classification(n_classes=2, flip_y=.3, n_redundant=10,
                           random_state=41)
n_samples = X.shape[0]
X_ = np.r_[X, X[:n_samples // 2, :]]
y_ = np.r_[y, y[:n_samples // 2, ]]
sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2

no_dup_no_sw = model.fit(X, y).predict(X)
dup_no_sw = model.fit(X_, y_).predict(X)
no_dup_sw = model.fit(X, y, sample_weight=sample_weight).predict(X)
print(np.all(dup_no_sw == no_dup_sw))
print(np.all(no_dup_no_sw == dup_no_sw))

glemaitre

sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2

This is weird for me. Instead, should it be:

sample_weight = np.zeros(shape=(n_samples,))
sample_weight[:n_samples // 2] = 1

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

Too long ago

NicolasHug · 2020-01-28T16:33:07Z

I would also be happy with removing support for weights in the binning (wrong but still good enough), merge the PR and deal with that later. The estimators are still experimental so that's fine.

I'd like to get this PR merged so we can focus on the categorical support, which I'd really like to get in for 0.23. If we want to release in May, we need to start soon.

amueller · 2020-02-03T20:41:46Z

I'd love to move forward with this so we can get started on the categorical variables. So I see three options:

Keep it as it was and don't support sample-weights for the binning (which I think will be fine).
Do the subsampling that @NicolasHug proposes.
Implement weighted percentiles.

Honestly I'm happy with any of these. 3) adds a bit more complexity but it's not terrible and I'd rather move forward so we can do categorical variables for 0.21.

adrinjalali · 2020-02-10T11:17:49Z

@ogrisel did you have a chance to look into this and the weighted percentiles?

NicolasHug · 2020-02-10T23:31:39Z

I believe @ogrisel was OK with my proposal according to #14696 (comment)

If that helps @adrinjalali , you can consider resampling with replacement as a general framework for handling sample_weight. This is applicable to any estimator, not just to the binner.
It should be a very close approximation, provided that we have enough samples. If we sample for an infinite amount of times, it gives the true result.

Again, I'm fine with reverting to the previous state where we don't handle sample_weight in the binner.

I'm happy to help moving this forward if you need.

This reverts commit a4877f9.

This reverts commit 9bc22d9.

This reverts commit 76dc710.

NicolasHug

Thanks @adrinjalali

@ogrisel the SW support for binning has been reverted (like in LightGBM). The PR is in a solid state and I think reviewing should be reasonably easy now

NicolasHug · 2020-02-11T18:54:29Z

doc/whats_new/v0.22.rst

+  - |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by
+    `Adrin Jalali`_ and `Nicolas Hug`_.


should be in the 0.23 now

Moved to 0.23

NicolasHug · 2020-02-11T18:56:31Z

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

+
+        if getattr(self, '_fitted_with_sw', False):
+            raise NotImplementedError("{} does not support partial dependence"
+                                      " plots when sample weights were given "


"with the 'recursion' method"

(for some reason I can't make suggestions on github)

Addressed comment

NicolasHug · 2020-02-11T20:00:01Z

Maybe @thomasjpfan @glemaitre can also give it a second look

thomasjpfan

Awesomeee!

thomasjpfan · 2020-02-18T21:02:52Z

doc/modules/ensemble.rst

+    >>> gb.predict([[1, 0]])
+    array([1])
+
+As you can see, the `[1, 0]` is comfortably classified as `1` since the first


Would the probability show more comfort?

gb.predict_proba([[1, 0]])[0, 1] # 0.99...

thomasjpfan · 2020-02-18T21:16:53Z

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx

+        # gradient = sign(raw_predicition - y_pred) * sample_weight
+        gradients[i] = sample_weight[i] * (2 *
+                        (y_true[i] - raw_predictions[i] < 0) - 1)
+        hessians[i] = sample_weight[i]


Does this works because sample_weight is non-negative? If so, lets leave a comment?

Does this works because sample_weight is non-negative?

No, this works because:

accounting for SW means we need to multiply gradients and hessians by SW

without SW, the hessian of this loss is constant and is equal to 1.

I was thinking about the math. lightgbm does the same thing with the l1 loss.

https://github.com/microsoft/LightGBM/blob/4adb9ff71f41f6b5c7a51f667a8fb9adf38cf602/src/objective/regression_objective.hpp#L225-L229

When I see derivative of sign(x), I think of the dirac delta function: https://en.wikipedia.org/wiki/Sign_function

Ah I think I see what you did here: #13896 (comment)

yeah, also, if you look at the sklearn wrapper in lightgbm, all hessians and gradients are simply multiplied by SW.

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py

thomasjpfan · 2020-02-22T21:11:24Z

doc/whats_new/v0.22.rst

+  - |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by
+    `Adrin Jalali`_ and `Nicolas Hug`_.


Moved to 0.23

sklearn/ensemble/_hist_gradient_boosting/loss.py

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py

thomasjpfan · 2020-02-22T21:39:18Z

sklearn/ensemble/_hist_gradient_boosting/tests/test_loss.py

+    if sample_weight == 'ones':
+        sample_weight = np.ones(shape=n_samples, dtype=Y_DTYPE)
+    else:
+        sample_weight = rng.normal(size=n_samples).astype(Y_DTYPE)


Do we want to include negative numbers here?

There doesn't seem to be a place where we check for non-negative weights, and I'm not sure if the math requires the weights to be positive.

For ref, this is discussed in #15531

I think we can safely ignore the topic for now. (I'm reasonably confident that negative SW would work "as expected" here, though I've no idea what it would look like with binning)

…earn into hgbt/sample_weights

thomasjpfan

LGTM

remykarem · 2020-02-27T11:21:23Z

Hi, I'm experiencing an ImportError while building the documentation after this PR has been merged:

ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss'

Here is the full log:

sphinx-build -D plot_gallery=0 -b html -T -d _build/doctrees  -j auto  . _build/html/stable
Running Sphinx v2.4.1

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
    execfile_(filename, namespace)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
    exec(code, _globals)
  File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
    from sklearn.experimental import enable_hist_gradient_boosting  # noqa
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
    from ..ensemble._hist_gradient_boosting.gradient_boosting import (
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
    from .loss import _LOSSES
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
    from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/cmd/build.py", line 275, in build_main
    args.tags, args.verbosity, args.jobs, args.keep_going)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/application.py", line 219, in __init__
    self.config = Config.read(self.confdir, confoverrides or {}, self.tags)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 193, in read
    namespace = eval_config_file(filename, tags)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 358, in eval_config_file
    raise ConfigError(msg % traceback.format_exc())
sphinx.errors.ConfigError: There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
    execfile_(filename, namespace)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
    exec(code, _globals)
  File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
    from sklearn.experimental import enable_hist_gradient_boosting  # noqa
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
    from ..ensemble._hist_gradient_boosting.gradient_boosting import (
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
    from .loss import _LOSSES
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
    from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)


Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
    execfile_(filename, namespace)
  File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
    exec(code, _globals)
  File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
    from sklearn.experimental import enable_hist_gradient_boosting  # noqa
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
    from ..ensemble._hist_gradient_boosting.gradient_boosting import (
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
    from .loss import _LOSSES
  File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
    from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)

make: *** [html-noplot] Error 2

Is anyone experiencing the same problem or am I missing something?

adrinjalali · 2020-02-27T11:34:12Z

It's probably an issue with different sklearn's in your path or something. If you start with a clean environment and install the nightly build, it should work I suppose @remykarem

NicolasHug · 2020-02-27T15:04:23Z

It's probably just due to the fact that you haven't cythonized the new files. Before cleaning your env I'd suggest to just run make in or pip install -e . in the directory, that should fix it

remykarem · 2020-02-28T11:50:57Z

That works, thank you for the tip!

adrinjalali added 4 commits August 20, 2019 11:46

check consistent lengths

376c477

changes to loss and gradient_boosting.py

f9e0a1b

pep8

3793661

merge upstream/master

e02126c

adrinjalali mentioned this pull request Aug 28, 2019

sample_weight support in HistGradientBoostingClassifier #14830

Closed

adrinjalali added 9 commits August 30, 2019 12:53

revert loss.py to not take into account sample weights, the caller ha…

4e686d4

…ndles it for now

merge upstream/master

8cd484c

gb tests pass, loss has a different average method

a9c30d5

loss handles sample weight

f09beb6

more fixes for tests

c922d30

pep8

39282a0

fix constant hessian and sample weight

934323e

fix classification losses, and test

54d3c27

fix the test

df28919

adrinjalali changed the title ~~[WIP] Support sample weights in HGBT~~ Support sample weights in HGBT Sep 1, 2019

adrinjalali marked this pull request as ready for review September 1, 2019 10:41

minor fix

56b385a

NicolasHug reviewed Sep 2, 2019

View reviewed changes

adrinjalali added 2 commits September 3, 2019 13:10

Merge remote-tracking branch 'upstream/master' into hgbt/sample_weights

434d0d0

address comments, move sample_weight to cython

66f7ad3

NicolasHug reviewed Sep 3, 2019

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/_loss.pyx Outdated Show resolved Hide resolved

adrinjalali added 2 commits September 4, 2019 22:57

change loss API

a1440bb

_loss perf improvement

34ad5a5

NicolasHug reviewed Sep 4, 2019

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py Show resolved Hide resolved

adding more tests

892a5b5

glemaitre self-requested a review September 10, 2019 07:55

glemaitre reviewed Sep 10, 2019

View reviewed changes

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Outdated Show resolved Hide resolved

sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py Show resolved Hide resolved

adrinjalali added 4 commits February 11, 2020 11:11

Revert "fix local var"

d5f98d7

This reverts commit a4877f9.

Revert "pass ints to choice and don't always subsample"

cfaa057

This reverts commit 9bc22d9.

Revert "sample with replacement before binning"

f5960f2

This reverts commit 76dc710.

merge upstream/master

c9030bc

NicolasHug approved these changes Feb 11, 2020

View reviewed changes

CLN Address comments

13d120a

thomasjpfan reviewed Feb 22, 2020

View reviewed changes

adrinjalali added 3 commits February 24, 2020 11:24

address Thomas's comments

dcccf01

Merge branch 'hgbt/sample_weights' of github.com:adrinjalali/scikit-l…

1c81ebd

…earn into hgbt/sample_weights

Merge remote-tracking branch 'upstream/master' into hgbt/sample_weights

47b11d6

thomasjpfan approved these changes Feb 24, 2020

View reviewed changes

thomasjpfan changed the title ~~Support sample weights in HGBT~~ ENH Support sample weights in HGBT Feb 24, 2020

thomasjpfan merged commit e24998f into scikit-learn:master Feb 24, 2020

adrinjalali deleted the hgbt/sample_weights branch February 24, 2020 21:53

panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020

ENH Support sample weights in HGBT (scikit-learn#14696)

44bef5e

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH Support sample weights in HGBT (scikit-learn#14696)

d53172d

NicolasHug mentioned this pull request Oct 23, 2022

How observations with sample_weight of zero influence the fit of HistGradientBoostingRegressor #24728

Closed

NicolasHug mentioned this pull request Dec 15, 2022

partial_dependence should respect sample weights #24872

Closed

vitaliset mentioned this pull request Dec 19, 2022

ENH partial_dependece plot for HistGradientBoosting estimator fitted with sample_weight #25210

Open

lorentzenchr mentioned this pull request Aug 20, 2023

Add sample_weight support to binning in HGBT #27117

Closed

		- \|Feature\| Estimators now support :term:`sample_weight`. :pr:`14696` by
		`Adrin Jalali`_ and `Nicolas Hug`_.

Uh oh!

Conversation

adrinjalali commented Aug 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Sep 2, 2019

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NicolasHug commented Sep 5, 2019

Uh oh!

adrinjalali commented Sep 8, 2019

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NicolasHug commented Jan 28, 2020

Uh oh!

amueller commented Feb 3, 2020

Uh oh!

adrinjalali commented Feb 10, 2020

Uh oh!

NicolasHug commented Feb 10, 2020

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Feb 11, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

adrinjalali commented Aug 20, 2019 •

edited

Loading