-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG] ENH simplify transform for uniform output in QuantileTransformer #12827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH simplify transform for uniform output in QuantileTransformer #12827
Conversation
| # for inverse transform, match a uniform distribution | ||
| with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
| X_col = output_distribution.cdf(X_col) | ||
| if output_distribution == 'normal': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we use != 'uniform' to make this a bit more future-ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and still use getattr(stats, output_distribution) as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, okay... I'm not too fussed.
| # for inverse transform, match a uniform distribution | ||
| with np.errstate(invalid='ignore'): # hide NaN comparison warnings | ||
| X_col = output_distribution.cdf(X_col) | ||
| if output_distribution == 'normal': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, okay... I'm not too fussed.
|
Ok I just fixed a typo in the doc. Thanks for the reviews @jnothman and @peterkinalex. |
| # find the value to clip the data to avoid mapping to | ||
| # infinity. Clip such that the inverse transform will be | ||
| # consistent | ||
| clip_min = stats.norm.ppf(BOUNDS_THRESHOLD - np.spacing(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should clip the data when output_distribution == 'uniform' to avoid backward incompatibility and to keep inverse_transform consistent with transform? though the difference seems small (1e-7)
(Actually I agree that we don't need to clip when output_distribution == 'uniform', but if we decide to do so, we'll need to update inverse_transform accordingly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gentle ping if you have time here :) @albertcthomas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact I think that for the uniform distribution the clipping is done L2246
X_col[upper_bounds_idx] = upper_bound_y
X_col[lower_bounds_idx] = lower_bound_ywhich is done for transform and inverse_transform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we're talking about different things right? @albertcthomas
import numpy as np
from sklearn.preprocessing import QuantileTransformer
rng = np.random.RandomState(0)
X = [[1], [2], [3], [4]]
qt = QuantileTransformer(n_quantiles=10, random_state=0)
qt.fit_transform(X)
Before your PR:
array([[9.99999998e-08],
[3.33333333e-01],
[6.66666667e-01],
[9.99999900e-01]])
After your PR
array([[0. ],
[0.33333333],
[0.66666667],
[1. ]])
Although the new version might be more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm even wondering whether we need clip_min and clip_max when output_distribution == 'normal'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the easiest solution IMO will be to avoid backward incompatibility (i.e., clip when output_distribution == 'uniform') and open an issue to discuss whether we need clip_min and clip_max.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes thanks for the clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.
Happy to regard it as a bug fix.
If so, we need to update inverse_transform accordingly. (around L2222)
|
I should have some time to take your review into account in the next few weeks @qinhanmin2014. Thanks for the review and sorry for the delay |
|
I think we should be good now. I also added a test that fails if we take BOUNDS_THRESHOLD into account for the uniform distribution. |
qinhanmin2014
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I don't think we need a what's new entry since the difference is small, but feel free to add one if you want.
|
We need someone to double check this PR, ping @jnothman @agramfort |
|
Thanks @qinhanmin2014. It could be useful to have the opinion of @glemaitre or @ogrisel as they worked a lot on the original implementation. |
| upper_bounds_idx = (X_col + BOUNDS_THRESHOLD > | ||
| upper_bound_x) | ||
| if output_distribution == 'uniform': | ||
| lower_bounds_idx = (X_col == lower_bound_x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a float comparison, isn't it?
| lower_bounds_idx = (X_col == lower_bound_x) | |
| lower_bounds_idx = np.isclose(X_col, lower_bound_x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test that I added is failing with np.isclose. For transform, lower_bound is a float but one element of X_col (the min). For inverse_transform, lower_bound is an int.
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Co-Authored-By: albertcthomas <[email protected]>
|
Thanks @glemaitre! |
|
Thanks albert
…On Tue, 26 Feb 2019 at 17:46, Albert Thomas ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/preprocessing/data.py
<#12827 (comment)>
:
> # find index for lower and higher bounds
with np.errstate(invalid='ignore'): # hide NaN comparison warnings
- lower_bounds_idx = (X_col - BOUNDS_THRESHOLD <
- lower_bound_x)
- upper_bounds_idx = (X_col + BOUNDS_THRESHOLD >
- upper_bound_x)
+ if output_distribution == 'normal':
+ lower_bounds_idx = (X_col - BOUNDS_THRESHOLD <
+ lower_bound_x)
+ upper_bounds_idx = (X_col + BOUNDS_THRESHOLD >
+ upper_bound_x)
+ if output_distribution == 'uniform':
+ lower_bounds_idx = (X_col == lower_bound_x)
The test that I added is failing with np.isclose. For transform,
lower_bound is a float but one element of X_col (the min). For
inverse_transform, lower_bound is an int.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12827 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHG9P3lbP3k67BEXzcse98YS2gKPuiirks5vRWTRgaJpZM4ZZ1Gi>
.
--
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
|
…rmer (scikit-learn#12827)" This reverts commit ad918d4.
…rmer (scikit-learn#12827)" This reverts commit ad918d4.


Reference Issues/PRs
Fixes #12775
What does this implement/fix? Explain your changes.
Any other comments?
Check whether we can use the samples instead ofI will try this on my side and open a separate PR if successful.n_quantiles.