Skip to content

Conversation

@albertcthomas
Copy link
Contributor

@albertcthomas albertcthomas commented Dec 19, 2018

Reference Issues/PRs

Fixes #12775

What does this implement/fix? Explain your changes.

  • Add details in doc about what is done in the transform: basically G^-1(F(X)) where F is the empirical cdf of the data and G the output distribution. In the case of the uniform distribution output this becomes F(X). Also add in the doc that such a transform is monotonic and thus preserves the rank of the values along each feature.
  • Simplify transform for uniform distribution as cdf and ppf of the uniform distribution on [0,1] are the identity function

Any other comments?

Check whether we can use the samples instead of n_quantiles. I will try this on my side and open a separate PR if successful.

@albertcthomas albertcthomas changed the title [WIP] ENH simplify transform for uniform output in QuantileTransformer [MRG] ENH simplify transform for uniform output in QuantileTransformer Dec 19, 2018
@albertcthomas
Copy link
Contributor Author

The examples are unchanged:
uniform
gaussian

@albertcthomas
Copy link
Contributor Author

Rendered doc

# for inverse transform, match a uniform distribution
with np.errstate(invalid='ignore'): # hide NaN comparison warnings
X_col = output_distribution.cdf(X_col)
if output_distribution == 'normal':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we use != 'uniform' to make this a bit more future-ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and still use getattr(stats, output_distribution) as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, okay... I'm not too fussed.

# for inverse transform, match a uniform distribution
with np.errstate(invalid='ignore'): # hide NaN comparison warnings
X_col = output_distribution.cdf(X_col)
if output_distribution == 'normal':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, okay... I'm not too fussed.

@albertcthomas
Copy link
Contributor Author

Ok I just fixed a typo in the doc. Thanks for the reviews @jnothman and @peterkinalex.

# find the value to clip the data to avoid mapping to
# infinity. Clip such that the inverse transform will be
# consistent
clip_min = stats.norm.ppf(BOUNDS_THRESHOLD - np.spacing(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should clip the data when output_distribution == 'uniform' to avoid backward incompatibility and to keep inverse_transform consistent with transform? though the difference seems small (1e-7)
(Actually I agree that we don't need to clip when output_distribution == 'uniform', but if we decide to do so, we'll need to update inverse_transform accordingly)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gentle ping if you have time here :) @albertcthomas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact I think that for the uniform distribution the clipping is done L2246

X_col[upper_bounds_idx] = upper_bound_y
X_col[lower_bounds_idx] = lower_bound_y

which is done for transform and inverse_transform

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we're talking about different things right? @albertcthomas

import numpy as np
from sklearn.preprocessing import QuantileTransformer
rng = np.random.RandomState(0)
X = [[1], [2], [3], [4]]
qt = QuantileTransformer(n_quantiles=10, random_state=0)
qt.fit_transform(X)

Before your PR:

array([[9.99999998e-08],
       [3.33333333e-01],
       [6.66666667e-01],
       [9.99999900e-01]])

After your PR

array([[0.        ],
       [0.33333333],
       [0.66666667],
       [1.        ]])

Although the new version might be more reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm even wondering whether we need clip_min and clip_max when output_distribution == 'normal'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the easiest solution IMO will be to avoid backward incompatibility (i.e., clip when output_distribution == 'uniform') and open an issue to discuss whether we need clip_min and clip_max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes thanks for the clarification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me it's a bug fix as there is no reason to clip for uniform distribution. Difference is also super tiny (order of float32 eps which is what is used for BOUND_THRESHOLDS). I am fine with the current PR as proposed.

Happy to regard it as a bug fix.
If so, we need to update inverse_transform accordingly. (around L2222)

@albertcthomas
Copy link
Contributor Author

I should have some time to take your review into account in the next few weeks @qinhanmin2014. Thanks for the review and sorry for the delay

@albertcthomas
Copy link
Contributor Author

I think we should be good now. I also added a test that fails if we take BOUNDS_THRESHOLD into account for the uniform distribution.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I don't think we need a what's new entry since the difference is small, but feel free to add one if you want.

@qinhanmin2014
Copy link
Member

We need someone to double check this PR, ping @jnothman @agramfort

@albertcthomas
Copy link
Contributor Author

Thanks @qinhanmin2014. It could be useful to have the opinion of @glemaitre or @ogrisel as they worked a lot on the original implementation.

upper_bounds_idx = (X_col + BOUNDS_THRESHOLD >
upper_bound_x)
if output_distribution == 'uniform':
lower_bounds_idx = (X_col == lower_bound_x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a float comparison, isn't it?

Suggested change
lower_bounds_idx = (X_col == lower_bound_x)
lower_bounds_idx = np.isclose(X_col, lower_bound_x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test that I added is failing with np.isclose. For transform, lower_bound is a float but one element of X_col (the min). For inverse_transform, lower_bound is an int.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Co-Authored-By: albertcthomas <[email protected]>
@glemaitre glemaitre merged commit 0f09297 into scikit-learn:master Feb 26, 2019
@albertcthomas
Copy link
Contributor Author

Thanks @glemaitre!

@glemaitre
Copy link
Member

glemaitre commented Feb 26, 2019 via email

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DOC: add details to QuantileTransformer documentation

6 participants