[MRG + 1] Improve tests for sample_weight in LinearRegression and Ridge#5526
Conversation
b58187f to
66633b8
Compare
|
ping @sonnyhu @agramfort @ogrisel ? |
|
or @GaelVaroquaux who reviewed the PR together with myself. |
|
LGTM |
|
hmm, shouldn't this also show up in @ainafp 's common tests? |
|
👍 |
sklearn/linear_model/base.py
Outdated
There was a problem hiding this comment.
this is not equivalent as now you rescale the non-centered data. Which means that sample weight interact with intercept.
can you clarify why it's more correct this way?
There was a problem hiding this comment.
how to update sci kit to get this bug fixed?
On Thu, Oct 22, 2015 at 9:56 AM, Alexandre Gramfort <
[email protected]> wrote:
In sklearn/linear_model/base.py
#5526 (comment)
:if sample_weight is not None: # Sample weight can be implemented via a simple rescaling. X, y = _rescale_data(X, y, sample_weight)
X, y, X_mean, y_mean, X_std = self._center_data(X, y, self.fit_intercept, self.normalize, self.copy_X,sample_weight=None)or I saw #5357 (comment)
#5357 (comment)—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/5526/files#r42750244.
There was a problem hiding this comment.
@Sandy4321 we are working on solving the issue at the moment. In the meantime, if you want to use sample weights, you can instead multiply X(row by row) and y with the square root of the weights, as it is done in the code above. Fitting LinearRegression without sample weights still runs without issues.
There was a problem hiding this comment.
There was a problem hiding this comment.
I see , but question is : how technically to do this, simple to use pip to
reinstall?
On Thu, Oct 22, 2015 at 10:42 AM, Giorgio Patrini [email protected]
wrote:
In sklearn/linear_model/base.py
#5526 (comment)
:if sample_weight is not None: # Sample weight can be implemented via a simple rescaling. X, y = _rescale_data(X, y, sample_weight)
X, y, X_mean, y_mean, X_std = self._center_data(X, y, self.fit_intercept, self.normalize, self.copy_X,sample_weight=None)@Sandy4321 https://github.com/Sandy4321 we are working on solving the
issue at the moment. In the meantime, if you want to use sample weights,
you can instead multiply X(row by row) and y with the square root of the
weights, as it is done in the code above. Fitting LinearRegression
without sample weights still runs without issues.—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/5526/files#r42756415.
|
@agramfort We are reducing a weighted least square problem to a non-weigthed by first performing a change of variables. In the new variables, normalization/centering are weights-independent. I believe |
|
I believe intercept_ has to interact with sample_weight. If we think to the
intercept as a dummy column of X, from a linear algebra point of view, a
diagonal matrix of the sample weights multiplies that column too.
I am correct in thinking that it's the same picture as a weighted mean?
|
|
It is. We are minimizing a (weighted) L2-norm here. Indeed, the current implementation does the trick a-priori, before the change of variable. See here. |
70ea898 to
c1c35d0
Compare
|
I think I agree with @giorgiop More simply, if we just take the loss function as |
There was a problem hiding this comment.
This should just be a big with fit_intercept=True right? But I guess it is good to test.
|
iirc when you do the orthogonalization against a constant (ie center), in On Friday, October 23, 2015, Manoj Kumar [email protected] wrote:
|
I don't know. I had a look at the test but I don't have an answer just yet. |
2c2897a to
c7e276a
Compare
|
So, |
|
#5515 shouldn't have a constant sample weight vector. It is drawn uniformly On Friday, October 23, 2015, Giorgio Patrini [email protected]
|
|
Using |
|
maybe linear regression is not among the checked estimators for some reason? On Friday, October 23, 2015, Giorgio Patrini [email protected]
|
b2e6dcb to
2dcad5e
Compare
|
@mblondel I think that kind of test is not support to pass. I have removed it and used the idea based on weighted least square that we discussed, with data augmentation by dummy variable. It should be all right. I am amending with some cosmetics as suggest by @agramfort in person, and squashing. |
|
@agramfort suggested also that we move the test based on weighted least square on |
|
That looks like it could be done within or in conjunction with the other PR On Tue, Nov 10, 2015 at 5:01 PM, Giorgio Patrini [email protected]
|
2dcad5e to
b76d5a4
Compare
|
Sure we can. It's going to follow a different logic, but has the same goal. |
|
let me know when I shall review
|
Please go ahead! |
|
also please add note that the same tests are needed for sparse data |
|
that's it for me! |
There is a comment already in the code for this. I will move it to a place where it is more evident. |
b76d5a4 to
1d16ec4
Compare
|
ok. Let me know when I can do another round
|
|
Comments should have been addressed. |
Indeed this is a much more solid test. Thanks for that! |
[MRG + 1] Improve tests for sample_weight in LinearRegression and Ridge
|
thanks @giorgiop ! |
There was a problem hiding this comment.
this loop is pretty useless now, right?
There was a problem hiding this comment.
Yes it is, my bad.
I have noticed in #5357 that a test introduced in 0.17 was not correctly checking for the effect of
sample_weightin LinearRegression. To reproduce the issue on master: