-
Notifications
You must be signed in to change notification settings - Fork 1
Zero init for the intercept of GLMs #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| assert_allclose(model.coef_, np.r_[coef, coef], rtol=rtol) | ||
| atol = 1e-6 | ||
| assert model.intercept_ == pytest.approx(intercept, rel=rtol, abs=atol) | ||
| assert_allclose(model.coef_, np.r_[coef, coef], rtol=rtol, atol=atol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: that I had to add quite a largish atol for this test to pass with the new init. Otherwise the tests would fail for a few random seeds, either with wide or long datasets for various distribution families.
I am not very happy about this but I am not sure what I can do about it.
| # Possible causes: 1 error in function or gradient evaluation; | ||
| # 2 rounding error dominate computation. | ||
| pytest.xfail() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would no longer fail, so I removed the xfail. Not sure if it's related to the change of this PR or if it is a consequence of the increased MAXLS setting in 79ec862.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because of this PR.
|
This is not ready to merge, I got the following failures on my local all random seeds run: Detailson top of this tolerance adjustments, I also observed weirder failures in other tests: Details |
|
Zero init intercept causes |
For Tweedie with power>=1 (Poisson) and identity link, we must initialize away from zero, otherwise we have y_predict=identity(raw_prediction)=0 which is not allowed, i.e. produces inf. |
|
Ok I pushed a hack to make all tests pass (I tried with the default 42 random seed and currently waiting for 'all' to complete). I had to do several hacks:
All in all, I am not sure it's worth it. The smart intercept init in |
|
It took 27 min but: completed successfully on this PR. |
|
TODO next: quantify impact on convergence speed on realistic datasets. |
| # infinite gradients | ||
| CLIP = 1e100 | ||
| np.clip(grad_per_sample, -CLIP, CLIP, out=grad_per_sample) | ||
| np.clip(loss, None, CLIP, out=loss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be needed. We might just want to silence the RuntimeWarning if we know that the code is robust to inf values as suggested in scikit-learn#23314 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment means that line search handles inf correctly. It does not mean that the result will get finite again if all gradient elements are inf.
If possible, we should get rid of these clips.
lorentzenchr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel Thanks for investigating this.
sklearn/linear_model/_glm/glm.py
Outdated
| # link function (in particular for Tweedie with p>=1 | ||
| # with identity link) to be able to compute the first | ||
| # gradient. | ||
| coef[-1] = 64 * np.finfo(loss_dtype).eps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic 64 again 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works for a variety of small eps but getting too close to the minimum eps would make convergence less numerically stable for some combinations of family / link.
| # Possible causes: 1 error in function or gradient evaluation; | ||
| # 2 rounding error dominate computation. | ||
| pytest.xfail() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's because of this PR.
| # XXX: Do we have any theoretical guarantees why this should be the | ||
| # case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are papers about gradient descent and minimum norm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they cover LBFGS, line search and the combination of stopping criteria we use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a direct answer to the question above, but the following quite recent paper seems relevant:
In particular this paper is interesting because it also investigate the case of under-parameterized linear classifiers trained on linearly separable data.
Maybe @genji would like to comment on this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should conduct some experiments to see the impact of the intercept init on the test accuracy.
We could also maybe explore how to do the smart intercept init and re-project on the data-span as done in the paper above (not sure if it makes sense for all GLMs).
| # infinite gradients | ||
| CLIP = 1e100 | ||
| np.clip(grad_per_sample, -CLIP, CLIP, out=grad_per_sample) | ||
| np.clip(loss, None, CLIP, out=loss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment means that line search handles inf correctly. It does not mean that the result will get finite again if all gradient elements are inf.
If possible, we should get rid of these clips.
|
Giving it a lot of thought: If it's too hard to reach the minimum norm solution for GLMs, then it's maybe not worth it. |
I agree. However I think it's still interesting to understand the need or not to gradient clipping because the same convergence problem can happen with |
Co-authored-by: Christian Lorentzen <[email protected]>
|
Next steps:
|
|
I close as we can now open PRs directly to scikit-learn main. |
Fixes scikit-learn#23670.
This is a PR to scikit-learn#23619 (need to be updated to target
maininstead) to make sure that GLMs withfit_intercept=Trueand the default LBFGS solver converge to the minimum norm solution whenalpha=0..The main change is to init the intercept at zero (or very close to zero for GLMs with identity link) instead of the previously smart init of the intercept that was set to:
This PR also updates the tolerance in some tests and does a few hacks to keep the model converging in all cases which seems harder with zero-init-intercept instead of the previous "smart init" of the intercept.
Note
I have not yet benchmarked if the use of a zero init for the intercept has any impact on the convergence speed on a non-toy dataset.
Note 2
test_ridge.pyhas a similar problem:https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/tests/test_ridge.py#L317-L322
and I believe the solution would probably be similar. However the
Ridgeclass has many different solvers so I am afraid to start investigating before getting a good understanding of the impacts in terms of convergence speed and numerical stability in the case of GLMs with the LBFGS solver.