ENH Support sample weights in HGBT#14696
Conversation
|
Needs more tests, but already passes some edge case tests. @NicolasHug wanna have a look? |
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks Adrin! Made a first pass.
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Show resolved
Hide resolved
|
Apart from the minor hessians_are_constant issue I think this looks pretty good now, we just need some tests. Some basic ones I can think of are
Oh and I just realized, we need the PDPs to take SW into account now ^^ |
|
This test would fail, which I'm trying to figure out why: import numpy as np
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.datasets import make_classification
model = HistGradientBoostingClassifier(max_iter=2, max_depth=2,
random_state=42,
validation_fraction=None)
X, y = make_classification(n_classes=2, flip_y=.3, n_redundant=10,
random_state=41)
n_samples = X.shape[0]
X_ = np.r_[X, X[:n_samples // 2, :]]
y_ = np.r_[y, y[:n_samples // 2, ]]
sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2
no_dup_no_sw = model.fit(X, y).predict(X)
dup_no_sw = model.fit(X_, y_).predict(X)
no_dup_sw = model.fit(X, y, sample_weight=sample_weight).predict(X)
print(np.all(dup_no_sw == no_dup_sw))
print(np.all(no_dup_no_sw == dup_no_sw)) |
glemaitre
left a comment
There was a problem hiding this comment.
sample_weight = np.ones(shape=(n_samples))
sample_weight[:n_samples // 2] = 2This is weird for me. Instead, should it be:
sample_weight = np.zeros(shape=(n_samples,))
sample_weight[:n_samples // 2] = 1|
I would also be happy with removing support for weights in the binning (wrong but still good enough), merge the PR and deal with that later. The estimators are still experimental so that's fine. I'd like to get this PR merged so we can focus on the categorical support, which I'd really like to get in for 0.23. If we want to release in May, we need to start soon. |
|
I'd love to move forward with this so we can get started on the categorical variables. So I see three options:
Honestly I'm happy with any of these. 3) adds a bit more complexity but it's not terrible and I'd rather move forward so we can do categorical variables for 0.21. |
|
@ogrisel did you have a chance to look into this and the weighted percentiles? |
|
I believe @ogrisel was OK with my proposal according to #14696 (comment) If that helps @adrinjalali , you can consider resampling with replacement as a general framework for handling sample_weight. This is applicable to any estimator, not just to the binner. Again, I'm fine with reverting to the previous state where we don't handle sample_weight in the binner. I'm happy to help moving this forward if you need. |
NicolasHug
left a comment
There was a problem hiding this comment.
Thanks @adrinjalali
@ogrisel the SW support for binning has been reverted (like in LightGBM). The PR is in a solid state and I think reviewing should be reasonably easy now
doc/whats_new/v0.22.rst
Outdated
| - |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by | ||
| `Adrin Jalali`_ and `Nicolas Hug`_. |
|
|
||
| if getattr(self, '_fitted_with_sw', False): | ||
| raise NotImplementedError("{} does not support partial dependence" | ||
| " plots when sample weights were given " |
There was a problem hiding this comment.
"with the 'recursion' method"
(for some reason I can't make suggestions on github)
|
Maybe @thomasjpfan @glemaitre can also give it a second look |
| >>> gb.predict([[1, 0]]) | ||
| array([1]) | ||
|
|
||
| As you can see, the `[1, 0]` is comfortably classified as `1` since the first |
There was a problem hiding this comment.
Would the probability show more comfort?
gb.predict_proba([[1, 0]])[0, 1]
# 0.99...| # gradient = sign(raw_predicition - y_pred) * sample_weight | ||
| gradients[i] = sample_weight[i] * (2 * | ||
| (y_true[i] - raw_predictions[i] < 0) - 1) | ||
| hessians[i] = sample_weight[i] |
There was a problem hiding this comment.
Does this works because sample_weight is non-negative? If so, lets leave a comment?
There was a problem hiding this comment.
Does this works because sample_weight is non-negative?
No, this works because:
- accounting for SW means we need to multiply gradients and hessians by SW
- without SW, the hessian of this loss is constant and is equal to 1.
There was a problem hiding this comment.
I was thinking about the math. lightgbm does the same thing with the l1 loss.
When I see derivative of sign(x), I think of the dirac delta function: https://en.wikipedia.org/wiki/Sign_function
There was a problem hiding this comment.
Ah I think I see what you did here: #13896 (comment)
There was a problem hiding this comment.
yeah, also, if you look at the sklearn wrapper in lightgbm, all hessians and gradients are simply multiplied by SW.
doc/whats_new/v0.22.rst
Outdated
| - |Feature| Estimators now support :term:`sample_weight`. :pr:`14696` by | ||
| `Adrin Jalali`_ and `Nicolas Hug`_. |
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
sklearn/ensemble/_hist_gradient_boosting/tests/test_gradient_boosting.py
Outdated
Show resolved
Hide resolved
| if sample_weight == 'ones': | ||
| sample_weight = np.ones(shape=n_samples, dtype=Y_DTYPE) | ||
| else: | ||
| sample_weight = rng.normal(size=n_samples).astype(Y_DTYPE) |
There was a problem hiding this comment.
Do we want to include negative numbers here?
There was a problem hiding this comment.
There doesn't seem to be a place where we check for non-negative weights, and I'm not sure if the math requires the weights to be positive.
There was a problem hiding this comment.
For ref, this is discussed in #15531
I think we can safely ignore the topic for now. (I'm reasonably confident that negative SW would work "as expected" here, though I've no idea what it would look like with binning)
|
Hi, I'm experiencing an Here is the full log: sphinx-build -D plot_gallery=0 -b html -T -d _build/doctrees -j auto . _build/html/stable
Running Sphinx v2.4.1
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
execfile_(filename, namespace)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
exec(code, _globals)
File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
from sklearn.experimental import enable_hist_gradient_boosting # noqa
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
from ..ensemble._hist_gradient_boosting.gradient_boosting import (
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
from .loss import _LOSSES
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/cmd/build.py", line 275, in build_main
args.tags, args.verbosity, args.jobs, args.keep_going)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/application.py", line 219, in __init__
self.config = Config.read(self.confdir, confoverrides or {}, self.tags)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 193, in read
namespace = eval_config_file(filename, tags)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 358, in eval_config_file
raise ConfigError(msg % traceback.format_exc())
sphinx.errors.ConfigError: There is a programmable error in your configuration file:
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
execfile_(filename, namespace)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
exec(code, _globals)
File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
from sklearn.experimental import enable_hist_gradient_boosting # noqa
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
from ..ensemble._hist_gradient_boosting.gradient_boosting import (
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
from .loss import _LOSSES
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)
Configuration error:
There is a programmable error in your configuration file:
Traceback (most recent call last):
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/config.py", line 348, in eval_config_file
execfile_(filename, namespace)
File "/Users/raimibinkarim/anaconda3/envs/sklearn/lib/python3.7/site-packages/sphinx/util/pycompat.py", line 81, in execfile_
exec(code, _globals)
File "/Users/raimibinkarim/Desktop/scikit-learn/doc/conf.py", line 324, in <module>
from sklearn.experimental import enable_hist_gradient_boosting # noqa
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/experimental/enable_hist_gradient_boosting.py", line 22, in <module>
from ..ensemble._hist_gradient_boosting.gradient_boosting import (
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py", line 23, in <module>
from .loss import _LOSSES
File "/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/loss.py", line 21, in <module>
from ._loss import _update_gradients_hessians_least_squares
ImportError: cannot import name '_update_gradients_hessians_least_squares' from 'sklearn.ensemble._hist_gradient_boosting._loss' (/Users/raimibinkarim/Desktop/scikit-learn/sklearn/ensemble/_hist_gradient_boosting/_loss.cpython-37m-darwin.so)
make: *** [html-noplot] Error 2Is anyone experiencing the same problem or am I missing something? |
|
It's probably an issue with different sklearn's in your path or something. If you start with a clean environment and install the nightly build, it should work I suppose @remykarem |
|
It's probably just due to the fact that you haven't cythonized the new files. Before cleaning your env I'd suggest to just run |
|
That works, thank you for the tip! |
Fixes #14830
This PR adds sample weight support to HGBT estimators.