[MRG]: MAINT center_data for linear models#5357
[MRG]: MAINT center_data for linear models#5357giorgiop wants to merge 1 commit intoscikit-learn:masterfrom
Conversation
23273f5 to
876c8c8
Compare
|
Regarding point (a). Edited from this test. |
492c5cb to
73b30a5
Compare
|
Were you able to figure out the bugs in |
|
Which issue are you referring to? |
|
In the Travis build. We were trying to fix this issue before, but since |
682b219 to
f1356ff
Compare
|
Question: can we just implement the |
05a9d0e to
6e8290b
Compare
|
Down to 81 errors. |
I found |
7176457 to
610c21b
Compare
|
Down to 35. |
|
If LassoLars has normalize=True, LassoLarsCV hopefully has normalize=True, too, right? |
Yes, all classes in least_angle.py. |
610c21b to
c183cf2
Compare
|
@GaelVaroquaux as you are around, do you have a better idea than to warn by default in lars? I don't like it. |
What's the question? (I have a grant proposal to write before the sprint, |
|
We are deprecating |
|
@GaelVaroquaux Also good luck :) |
|
@GaelVaroquaux Also good luck :)
Thanks. I hate grant writing.
|
I am mentionned in my comment on that issue, this is really a bad idea. I The whole value of this option is that it makes linear models robust in We convinced ourselves that we shouldn't do this change and keep the I am sorry, I said 👍 to this change a few days ago. But Gorgio hit |
|
ok
so we keep the fact that if fit_intercept=False normalize=True has no
effect?
|
sklearn/linear_model/base.py
Outdated
There was a problem hiding this comment.
this will blow up the memory if you have large number of samples.
There was a problem hiding this comment.
Indeed, what was the problem with the existing implementation in master?
There was a problem hiding this comment.
Sorry my bad, I don't remember what I was thinking at that time.
|
that's it for me now. Sorry for the super slow reaction time... ping when I shall take a look again. thanks @giorgiop ! |
c29c00b to
ea666c5
Compare
|
@agramfort I finally had time to resume this one :) |
|
looks ok at first sight. maybe @ogrisel has some time to have a look? |
sklearn/linear_model/base.py
Outdated
|
|
||
| normalize : boolean, optional, default False | ||
| If True, the regressors X will be normalized before regression. | ||
| When the regressors are normalized, the fitted `coef_` are the same |
There was a problem hiding this comment.
I don't quite understand this. Did you mean the "fitted coef_ are of the same scale"?. How will it be same?
There was a problem hiding this comment.
I agree this may not be the best wording. The discussion is in #5447 with @GaelVaroquaux. Let me know if you guys have a better and concise way to explain this property here.
effdf97 to
c822947
Compare
sklearn/linear_model/base.py
Outdated
|
|
||
| normalize : boolean, optional, default False | ||
| If True, the regressors X will be normalized before regression. | ||
| This parameter is ignored when `fit_intercept` is set to `False`. |
There was a problem hiding this comment.
OK. This is what I infer, when X is scaled down to unit variance, the squared loss term goes up by a certain factor (I'm not sure we can particularly determine this because y is not normalized), hence for different number of samples, alpha has to be set differently to achieve the same effect. Even when X is scaled to unit length, there is this effect but to a much lesser extent.
Should we just write something like "Setting normalized equals to true scales down the input regressors to unit length. Note that this makes the hyperparameters learnt more robust and almost independent of the number of samples. The same property contd ... "?
There was a problem hiding this comment.
I am fine with current formulation
There was a problem hiding this comment.
It's just that the "fiited coef_ are the same" term that I am worried about.
There was a problem hiding this comment.
indeed. "same" is not correct.
+1 for
Note that this makes the hyperparameters learnt more robust and almost independent of the number of samples.
thx @MechCoder
|
Just left 2 comments. |
| # inplace_csr_row_normalize_l2 must be changed such that it | ||
| # can return also the norms computed internally | ||
|
|
||
| # transform variance to norm in-place |
There was a problem hiding this comment.
We can use csr_row_norms which already does this, but we can leave that for later.
43d0f2c to
d40f1f1
Compare
|
@giorgiop Can you fix the "fitted coef_" are same comment for all estimators? After that we can merge. |
d40f1f1 to
ec0a97a
Compare
ec0a97a to
acdd1aa
Compare
|
Done ! |
|
Merged @giorgiop |
|
thanks heaps @giorgiop ! 🍻 |
|
Thanks to all reviewers for the combined effort! |
Fixes #2601no more.TO DO
center_dataandsparse_center_datainto a new private function, and deprecate them# TODOand# XXXAND NOT TO DO (see #2601 and discussion below):
normalize(signature and property) and introducestandardizebehaviourcenter_datainternally + related testssample_weightswith unused variables (a) . Done in [MRG + 1] Improve tests for sample_weight in LinearRegression and Ridge #5526center_databehaviour w.r.t.fit_intercept. Currently, input data is not touched if we not are fitting the intercept. See separate issue BUG: linear model inputs are not normalized when fit_intercept=False #5799