-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Fix PLS coef_ bug #7819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix PLS coef_ bug #7819
Conversation
|
Yes, it looks like this is a bug. Though, we could choose to report the coefficients with respect to the unscaled data, and then not do the scaling by |
|
Yes, we could also keep the normalization by (1/x_std_x) in coef_ and remove it from predict. However, we still need to remove the mean of X in the predict method so it seems to be more consistent to leave the full standardization of X (X->(X-x_mean_)/x_std_) in the predict method. Actually, to be fully consistent we should remove the y_std_ factor from coef_ as well and move it to the predict method. |
|
Ping @arthurmensch for an opinion...? |
|
is it possible to have a test or improve existing ones? maybe compare with R solution? |
|
I'm not an R user but it's easy to show with a simple example that the proposed fix produce estimations with significantly better r2: |
|
Yes, something like that might be an appropriate test, but not with a separate validation set: can we just test that the model is able to accurately fit and predict a training set? |
|
Either that or build a test around a toy case or something written up in a paper if possible. |
|
You will get the same results if you look at r2 on the training set. I think the change to coef_ is quite trivial to need to write a paper about it. |
|
Maybe this is a more clear example of the bug. Just multiplying X by a factor 1000, you see the big difference in r2 for the current implementation and the new one: |
|
I'm not saying you need to write a paper about it. I'm saying that a good On 10 November 2016 at 00:53, jayzed82 [email protected] wrote:
|
|
or maybe can you provide the solution with an R or matlab package showing the numbers match? you're code certainly correct but these tests are the guarantee that we don't break it later... thx |
|
Not an expert at all, but can we not compare the score on non-scaled and scaled data and check that they are comparable, i.e. something like this: import numpy as np
from numpy.testing import assert_approx_equal
import scipy as sp
from sklearn.cross_decomposition import PLSRegression
T = 1000
N = 5
M = 10
Q = sp.randn(N, M)
Y = np.random.randn(T, N)
X = sp.dot(Y, Q) + 2*np.random.randn(T, M)
X_scaled = 1000 * X
pls = PLSRegression(n_components=5)
pls.fit(X, Y)
score = pls.score(X, Y)
pls.fit(X_scaled, Y)
score_scaled = pls.score(X_scaled, Y)
assert_approx_equal(score, score_scaled)This snippet fails on master and pass on this PR. |
|
For completeness the scores are quite different on master: |
|
What is the current status on this issue? I use these methods extensively, so I can help verifying. A quick note though - lesteve's test will not fail if the PLSRegression object is called with scale=False. Also, there is no strict necessity for the algorithm to be run with standardized data, in fact in my field (Metabolomics/Chemometrics), we use it with a varying range of scaling factors, including mean centring only. Why not just remove the scaling from the object - maybe even make mean centering optional (so RobustScalers can be used before...) and keep the behaviour in a way that can be nicely used in an sklearn Pipeline, as with other sklearn objects? Btw, I have compared this algorithm recently with other software (commercial and open source), and so far everything worked flawlessly ;) - but I always run scale=False and handle the scaling explicitly myself... Let me know what is the status, maybe there is something I can help with. |
Help would be great I don't think anyone one on the development team is an expert of |
|
As it stands, I am of the opinion that this patch is correct, but from a more "strategic/big picture" point of view, I am of the opinion of just unequivocally split/remove scaling considerations from the algorithm/PLS object. I'm basing this on my experience with this code: it works fine, as long as I tell it to leave these things to me! Also, is this not related to pull request #6077 ? |
Good to know. What do you think about the test in #7819 (comment). Could you suggest a better one? Maybe something that we can check in a textbook or against another implementation.
Thanks for digging this up I knew there was a similar PR but I was not able to find it in a reasonable amount of time. There may be related issues lying around as well. The problem with #6077 was that there seemed to be a lot unrelated changes which is probably the main reason why it was never merged. |
|
pinging @arthurmensch just in case he has some memories of #6077. |
We should go for the small fix first so that To be completely honest, I don't think it is very likely that we are going to drop the |
|
I think that the fix is important, and that most users probably want to have the scaling done right by default , without handling it explicitly. So I'm in favor of merging that one, but with a test, e.g. based on the above example. |
Remove x_std divisor in coef_ because X is normalized in fit function
PEP8
|
Ah, its been a few weeks since I looked in to this so forgot completely, but I had some extensions to @lesteve previous test to suggest... They are just some extra checks, to see if the behaviour of scale is still coherent when using the StandardScaler. Basically, additionally checking that selecting non-scaled data and adding scale=True is the same as using pre-scaled data followed by scale=False, and that pre-scaled data should give the same result with either scale=True or False. What do you think? btw, running this through current version all fails. import numpy as np
from numpy.testing import assert_approx_equal
import scipy as sp
from sklearn.cross_decomposition import PLSRegression
from sklearn.preprocessing import StandardScaler
T = 1000
N = 5
M = 10
Q = sp.randn(N, M)
Y = np.random.randn(T, N)
X = sp.dot(Y, Q) + 2*np.random.randn(T, M)
X = 1000*X
# One instance of the PLS regression object with scale=False...
pls_nscale = PLSRegression(n_components=5, scale=False)
# And another for scale = True
pls = PLSRegression(n_components=5, scale=True)
# Fit both to "original" data
pls.fit(X, Y)
pls_nscale.fit(X, Y)
# Obtain both scores = Should be different here
score_xoriginal_scale_off = pls_nscale.score(X,Y)
score_xoriginal_scale_on = pls.score(X, Y)
# Scale the data
X_scaled = StandardScaler().fit_transform(X)
# Refit...
pls.fit(X_scaled, Y)
pls_nscale.fit(X_scaled, Y)
# Re-score now they should be the same as the data was pre-scaled
score_xscaled_scale_on = pls.score(X_scaled, Y)
score_xscaled_scale_off = pls_nscale.score(X_scaled, Y)
# Scaling inside the PLS object should be give same behaviour as scaling the object independently
assert_approx_equal(score_xscaled_scale_off, score_xoriginal_scale_on)
assert_approx_equal(score_xscaled_scale_on, score_xoriginal_scale_on)
# If the object has been scaled before, should be no difference between behaviour anyway
assert_approx_equal(score_xscaled_scale_on, score_xscaled_scale_off) |
|
Nice thanks @Gscorreia89. I think this makes sense to add these tests as well. |
|
I guess the scores are not that different so we could relax the equality check. |
|
Hum... True. PS: thanks for the '''py tip ;) |
|
Let me adda related point: for consistency with linear_model, we should probably replace |
I think I saw in another issue that |
|
But we should no demean X if the user does not want to. |
|
@bthirion - agree, although in this case its quite standard to do so (same behaviour as it is in PCA). I already mentioned before I personally would prefer to have the scaling completely off and tell the user to apply the standard scaler or equivalent anyway. But I think there are more pressing issues here: fix the scaler behaviour, then solve a couple of other PLS related issues (including confirming the overlap of this with #6077 that was never fully applied?) |
|
Regarding #6077 Clearly both PRs are tackling the same issue indeed. I have the impression that your fix is more appropriate, but this has to be checked. |
|
I am in favour of merging this first and tackle deeper issues in another PR. |
|
A couple of things before merging: The problem with #6077 is that there is a lot going on there, especially with point 3: Point 2 brings another scaling issue - what is happening to the scaling of Y. Now a bit of my opinion and arguments regarding the issue with the coefficients: Scaling changes what the model actually "sees" does from the data, as expected, so I am of the opinion that the model coefficients should always reflect how the model actually works, not a back-projection to the original data - because the R2, score, projections etc we are obtaining are only valid for the transformed data, not to the original. These are not simply standardized regression coefficients. The parameters of the PLS components that maximized covariance between X and Y in the scaled data will not be the same up to a scaling of those who would have been found if the user didn't scale the data, so back-porting the coefficients will not give "the direction of maximum covariance in the raw data space", just give a non-representative and possibly confusing view. I am of the opinion that if the user scaled the data, its on him to remember what the interpretation of the coefficients should be. |
|
A clarification of something I wrote in previous post that was not absolutely True: "because the R2, score, projections etc we are obtaining are only valid for the transformed data, not to the original." Not "True" as the R2 is being calculated in the back-scaled data correctly (or will be once predict is fixed) - but the model "maximized R2" by fitting on the scaled data - so by not scaling the data it might be possible to get a better R2 for the non-scaled Y. Some of these things might be quite subtle or even show more obviously in multi Y cases. Most PLS software and open source implementations follows the convention of providing R2 values obtained in the scaled data, and the original data if no scaling is selected. |
Cosmetic improvement of whats_new
|
I am of the opinion of merging this as is, i.e. simple fix + simple test. @Gscorreia89 feel free to open PRs about adding more tests. |
|
I'll wait for the CIs and merge this. |
|
I think the predict as mentioned in #6077 should be patched next. Then its more a mater of documentation. |
|
Merging this one, thanks a lot for your input @Gscorreia89 and @bthirion ! |
|
@Gscorreia89 do you have a good reference for using PLS in with 1d y? |
Reference Issue
What does this implement/fix? Explain your changes.
In the PLS class, the Y predicted by fit function was being divided twice by x_std because X is standardized inside the fit function and the coef_ matrix was also divided by x_std.
I have removed the x_std coefficient from self.coef_.
Any other comments?
Remove x_std divisor in coef_ because X is normalized in fit function