-
-
Notifications
You must be signed in to change notification settings - Fork 26.9k
[MRG+1] Add sample_weight support to Dummy Regressor #3779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ | |
| from .utils.validation import check_consistent_length | ||
| from .utils import deprecated | ||
| from .utils.random import random_choice_csc | ||
| from .utils.stats import _weighted_percentile | ||
| from .utils.multiclass import class_distribution | ||
|
|
||
|
|
||
|
|
@@ -366,7 +367,7 @@ def y_mean_(self): | |
| return self.constant_ | ||
| raise AttributeError | ||
|
|
||
| def fit(self, X, y): | ||
| def fit(self, X, y, sample_weight=None): | ||
| """Fit the random regressor. | ||
|
|
||
| Parameters | ||
|
|
@@ -378,6 +379,9 @@ def fit(self, X, y): | |
| y : array-like, shape = [n_samples] or [n_samples, n_outputs] | ||
| Target values. | ||
|
|
||
| sample_weight : array-like of shape = [n_samples], optional | ||
| Sample weights. | ||
|
|
||
| Returns | ||
| ------- | ||
| self : object | ||
|
|
@@ -389,25 +393,40 @@ def fit(self, X, y): | |
| "'mean', 'median', 'quantile' or 'constant'" | ||
| % self.strategy) | ||
|
|
||
| y = check_array(y, accept_sparse='csr', ensure_2d=False) | ||
| y = check_array(y, ensure_2d=False) | ||
| if len(y) == 0: | ||
| raise ValueError("y must not be empty.") | ||
| self.output_2d_ = (y.ndim == 2) | ||
|
|
||
| check_consistent_length(X, y) | ||
| self.output_2d_ = y.ndim == 2 | ||
| if y.ndim == 1: | ||
| y = np.reshape(y, (-1, 1)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Noob question: out of curiosity, is there any difference between doing this and I always use the latter.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh yes, I remember now :) |
||
| self.n_outputs_ = y.shape[1] | ||
|
|
||
| check_consistent_length(X, y, sample_weight) | ||
|
|
||
| if self.strategy == "mean": | ||
| self.constant_ = np.mean(y, axis=0) | ||
| self.constant_ = np.average(y, axis=0, weights=sample_weight) | ||
|
|
||
| elif self.strategy == "median": | ||
| self.constant_ = np.median(y, axis=0) | ||
| if sample_weight is None: | ||
| self.constant_ = np.median(y, axis=0) | ||
| else: | ||
| self.constant_ = [_weighted_percentile(y[:, k], sample_weight, | ||
| percentile=50.) | ||
| for k in range(self.n_outputs_)] | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is handled by the reshape. I will check tomorrow
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, sorry I looked at only the diff. I take back my comment.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just a nitpick: why not just set it to
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reshape must be done for the other np.mean and np.median. I think that the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, about my pretentious comments then ;)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No worry :-) |
||
| elif self.strategy == "quantile": | ||
| if self.quantile is None or not np.isscalar(self.quantile): | ||
| raise ValueError("Quantile must be a scalar in the range " | ||
| "[0.0, 1.0], but got %s." % self.quantile) | ||
|
|
||
| self.constant_ = np.percentile(y, axis=0, q=self.quantile * 100.0) | ||
| percentile = self.quantile * 100.0 | ||
| if sample_weight is None: | ||
| self.constant_ = np.percentile(y, axis=0, q=percentile) | ||
| else: | ||
| self.constant_ = [_weighted_percentile(y[:, k], sample_weight, | ||
| percentile=percentile) | ||
| for k in range(self.n_outputs_)] | ||
|
|
||
| elif self.strategy == "constant": | ||
| if self.constant is None: | ||
|
|
@@ -426,7 +445,6 @@ def fit(self, X, y): | |
| self.constant_ = self.constant | ||
|
|
||
| self.constant_ = np.reshape(self.constant_, (1, -1)) | ||
| self.n_outputs_ = np.size(self.constant_) # y.shape[1] is not safe | ||
| return self | ||
|
|
||
| def predict(self, X): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,6 +11,7 @@ | |
| from sklearn.utils.testing import assert_raises | ||
| from sklearn.utils.testing import assert_true | ||
| from sklearn.utils.testing import assert_warns_message | ||
| from sklearn.utils.stats import _weighted_percentile | ||
|
|
||
| from sklearn.dummy import DummyClassifier, DummyRegressor | ||
|
|
||
|
|
@@ -572,6 +573,24 @@ def test_most_frequent_strategy_sparse_target(): | |
| np.zeros((n_samples, 1))])) | ||
|
|
||
|
|
||
| def test_dummy_regressor_sample_weight(n_samples=10): | ||
| random_state = np.random.RandomState(seed=1) | ||
|
|
||
| X = [[0]] * n_samples | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would it better to generate X randomly? Just for a sanity check. |
||
| y = random_state.rand(n_samples) | ||
| sample_weight = random_state.rand(n_samples) | ||
|
|
||
| est = DummyRegressor(strategy="mean").fit(X, y, sample_weight) | ||
| assert_equal(est.constant_, np.average(y, weights=sample_weight)) | ||
|
|
||
| est = DummyRegressor(strategy="median").fit(X, y, sample_weight) | ||
| assert_equal(est.constant_, _weighted_percentile(y, sample_weight, 50.)) | ||
|
|
||
| est = DummyRegressor(strategy="quantile", quantile=.95).fit(X, y, | ||
| sample_weight) | ||
| assert_equal(est.constant_, _weighted_percentile(y, sample_weight, 95.)) | ||
|
|
||
|
|
||
| if __name__ == '__main__': | ||
| import nose | ||
| nose.runmodule() | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -44,3 +44,16 @@ def _rankdata(a, method="average"): | |
|
|
||
| except TypeError as e: | ||
| rankdata = _rankdata | ||
|
|
||
|
|
||
| def _weighted_percentile(array, sample_weight, percentile=50): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason why this is private?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because it is in utils.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, I was confused since there any many functions in utils which are not private. |
||
| """Compute the weighted ``percentile`` of ``array`` with ``sample_weight``. """ | ||
| sorted_idx = np.argsort(array) | ||
|
|
||
| # Find index of median prediction for each sample | ||
| weight_cdf = sample_weight[sorted_idx].cumsum() | ||
| percentile_or_above = weight_cdf >= (percentile / 100.0) * weight_cdf[-1] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for being stupid, but I am not able to get this to work. My arguments are [3, 2, 4] and [1, 2, 3] for array and sample_weight respectively. The sorted_idx is an array and thus throwing a TypeError. I wonder what are the expected arguments here.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sample_weight should be a numpy array On Sun, Oct 19, 2014 at 12:26 PM, Saurabh Jha [email protected]
Godspeed,
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks!
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @MechCoder ! |
||
| percentile_idx = percentile_or_above.argmax() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I'm understanding right, these two lines can be replaced by or am I wrong?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think this could be optimized in another pr? I have just taken what @pprett has done previously and put it there to be useful to more than just gradient boosting.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. okay, unless @pprett thinks if it is ok, to change this over here. |
||
|
|
||
| return array[sorted_idx[percentile_idx]] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed
accept_sparse='csr'since it's not supported.