Skip to content

Conversation

@CloseChoice
Copy link
Contributor

@CloseChoice CloseChoice commented May 17, 2020

fixes #14685

Order output of np.quantile/np.percentile monotonically if quantiles/percentiles are ordered monotonically.

Change the way of performing the linear interpolation from (2) to (1) as given in
https://math.stackexchange.com/questions/907327/accurate-floating-point-linear-interpolation

@CloseChoice CloseChoice force-pushed the BUG-order_percentile-monotonically branch from 717997a to 229f14c Compare May 17, 2020 13:51
@eric-wieser
Copy link
Member

This is a duplicate of #15098. For reviewers:

@CloseChoice
Copy link
Contributor Author

This is a duplicate of #15098. For reviewers:

* `master` uses `lerp(a, b, t) = a*(1-t) + b*t`

* This Pr uses `lerp(a, b, t) = a + (b-a)*t`

* #15098 uses `lerp(a, b, t)  = a + (b-a)*t if t < 0.5 else b - (b-a)*(1-t)`

Hi,
thanks for the answer and you're totally correct concerning my and master's formula. I just realized that there is an open PR when I was done writing my code, so I thought I'd try to give it a chance. Can close this PR though, if you advice to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thought of writing this as a hypothesis test but hypothesis module was never imported and I don't know your conventions. But can change it hypothesis for sure if that is desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently decided we were willing to accept hypothesis tests - if you think the test is easier to read with hypothesis, please use it, so that maintainers like me who haven't used it can get the hang of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a hypothesis test. Am curious for your feedback. The purpose of this test is not to increase readability but to cover more test cases without writing multiple tests and to find failing egde cases we might not have thought of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand what out means in this context. If you can give me a hint I might be able to simplify the code even more. I guess there is the possibility to get rid of x1 = x1.squeeze(0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In working out what hint to give you, I ended up doing some cleanup of this code myself. Thanks for drawing it to my attention. It might be best for you to wait until my cleanup goes through, and then to rebase your tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience - my cleanup is in. It's unlikely you'll be able to usefully merge the implementation, but you should be able to keep the tests.

Your cleanup made it way easier to implement the montonic lerp. Thanks for that.

@charris charris changed the title Bug order percentile monotonically BUG: Order percentile monotonically May 17, 2020
@eric-wieser
Copy link
Member

Thanks for your patience - my cleanup is in. It's unlikely you'll be able to usefully merge the implementation, but you should be able to keep the tests.

@CloseChoice CloseChoice force-pushed the BUG-order_percentile-monotonically branch from af5dd48 to be592a4 Compare May 21, 2020 10:30
@CloseChoice CloseChoice requested a review from eric-wieser May 21, 2020 10:34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#import pdb; pdb.set_trace()

Comment on lines 3115 to 3125
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
equals_sorted = np.sort(quantile) == quantile
assert equals_sorted.all()
assert_equal(np.sort(quantile), quantile)

@eric-wieser
Copy link
Member

My fear with this patch is that we also care about the other properties in #14685 (comment) (vs #15098 which claims to satisfy all of them).

Do you think you could put together a hypothesis tests that can prove that those properties are not satisfied with your solution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert np.lib.function_base._lerp(a, b, t) == np.lib.function_base._lerp(b, a, t)
assert np.lib.function_base._lerp(a, b, t) == np.lib.function_base._lerp(b, a, 1-t)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or more likely,

Suggested change
assert np.lib.function_base._lerp(a, b, t) == np.lib.function_base._lerp(b, a, t)
assert np.lib.function_base._lerp(a, b, 1 - (1- t)) == np.lib.function_base._lerp(b, a, 1 - t)

To ensure that the precision of t is the same in both directions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@CloseChoice
Copy link
Contributor Author

CloseChoice commented May 26, 2020

My fear with this patch is that we also care about the other properties in #14685 (comment) (vs #15098 which claims to satisfy all of them).

Do you think you could put together a hypothesis tests that can prove that those properties are not satisfied with your solution?

I added the tests but both solutions (mine and #15098) don't satisfy the exact symmetry condition but of course both satisfy the condition with np.isclose.

EDIT: This is outdated. When we apply assert np.lib.function_base._lerp(a, b, 1 - (1- t)) == np.lib.function_base._lerp(b, a, 1 - t) #15098 passes the test while my initial formula fails it

@eric-wieser
Copy link
Member

Even with #16273 (comment)?

@CloseChoice
Copy link
Contributor Author

Even with #16273 (comment)?

Only with #16273 (comment). Using assert np.lib.function_base._lerp(a, b, t) == np.lib.function_base._lerp(b, a, (1 - t)) fails it.

@CloseChoice CloseChoice force-pushed the BUG-order_percentile-monotonically branch from 90eba5f to d895113 Compare May 26, 2020 19:46
@CloseChoice CloseChoice force-pushed the BUG-order_percentile-monotonically branch from 18815e4 to 214e830 Compare May 27, 2020 17:51
Comment on lines 3134 to 3136
width=32),
b= st.floats(allow_nan=False, allow_infinity=False,
width=32))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be 64 for this test to be interesting.

Copy link
Contributor Author

@CloseChoice CloseChoice May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing to 64bit leads ot overflows in the basic add and subtract methods:

b=1.7976931348623157e+308
a=-1.7976931348623157e+308
np.subtract(b, a)
*** RuntimeWarning: overflow encountered in subtract

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's potentially a problem, the previous implementation did not fail like that for those cases.

For now, can you do something like clip a and b to reasonable magnitudes? We still want to test the precision of 64-bit, even if we decide we don't care about the range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the range from -1e300 up to 1e300, seems to work fine.

Are you sure that it wasn't a problem before? The overflow happens in the basic add and subtract methods and should be reproducible for specific combinations for the former implementation aswell since we are still using those methods.

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, assuming @seberg is happy.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, may just merge soon. I cannot say, I love the way the indentation works out for the hypothesis tests. I think I would prefer a hanging indent, maybe of 8 spaces. There may be long lines.

I also wonder if those tests (except the symmetric one maybe?), are actually likely to stress the issue. On first sight, it feels like we may need an absurd amount of fuzzy trials until the test could proof that the old version was bad?

EDIT: To be clear, I am not willing to stress on code style in the tests here...

@seberg seberg self-requested a review June 18, 2020 18:05
@seberg
Copy link
Member

seberg commented Jun 27, 2020

Thanks @CloseChoice, lets put put this in :), I think the tests should cover the relevant paths well, even if I am not sure that all the tests actually stress the monotonicity very strongly.

EDIT: To be clear, if someone feels like doing style/test touch-ups that is of course welcome, but I don't feel its important enough to delay/stress out over.

@seberg seberg merged commit 4d5b255 into numpy:master Jun 27, 2020
@charris charris added 09 - Backport-Candidate PRs tagged should be backported and removed 09 - Backport-Candidate PRs tagged should be backported labels Jun 27, 2020
@CloseChoice
Copy link
Contributor Author

@seberg : Thanks for merging ;)

@glemaitre
Copy link

@CloseChoice @seberg @eric-wieser Thanks for taking care of this issue and sorry to have disappear from the map.
This will be really useful.

def _lerp(a, b, t, out=None):
""" Linearly interpolate from a to b by a factor of t """
return add(a*(1 - t), b*t, out=out)
diff_b_a = subtract(b, a)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case this was overlooked mistakenly, seems like this now disallows booleans in percentile:

>>> np.percentile([True, False, False], q=0.5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in percentile
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3818, in percentile
    return _quantile_unchecked(
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3937, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3515, in _ureduce
    r = func(a, **kwargs)
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 4064, in _quantile_ureduce_func
    r = _lerp(x_below, x_above, weights_above, out=out)
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3961, in _lerp
    diff_b_a = subtract(b, a)
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

which worked before (NumPy 1.19):

0.0


FWIW, this behaviour change can also be triggered by pandas's:

>>> pd.DataFrame({"i": [0, 1, 2], "b": [False, False, True], "s": ["x", "y", "z"]}).quantile(q=0.5, numeric_only=True)

Before:

i    1.0
b    0.0
Name: 0.5, dtype: float64

After:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python3.8/site-packages/pandas/core/frame.py", line 9266, in quantile
    result = data._mgr.quantile(
  File "/.../python3.8/site-packages/pandas/core/internals/managers.py", line 491, in quantile
    block = b.quantile(axis=axis, qs=qs, interpolation=interpolation)
  File "/.../python3.8/site-packages/pandas/core/internals/blocks.py", line 1592, in quantile
    result = nanpercentile(
  File "/.../python3.8/site-packages/pandas/core/nanops.py", line 1675, in nanpercentile
    return np.percentile(values, q, axis=axis, interpolation=interpolation)
  File "<__array_function__ internals>", line 5, in percentile
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3818, in percentile
    return _quantile_unchecked(
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3937, in _quantile_unchecked
    r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3515, in _ureduce
    r = func(a, **kwargs)
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 4064, in _quantile_ureduce_func
    r = _lerp(x_below, x_above, weights_above, out=out)
  File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3961, in _lerp
    diff_b_a = subtract(b, a)
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to open an issue instead of commenting on the merge PR. I am not sure that it will draw attention.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that I am not sure if this is an intended behaviour or a bug. It could make sense either way 1. bool != numeric but 2. bool inherits int in Python.

I believe I provided enough details with a self-contained reproducer so I would defer to other maintainers here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually opening an issue with the regression tag would make sense here. What do you thi
nk @eric-wieser @seberg ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open the issue, otherwise the discussion will get lost even quicker. If we figure that it is OK if this breaks now, we can just close the issue again. Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: #19154

HyukjinKwon referenced this pull request in apache/spark May 28, 2021
…mns_should_be_discarded_if_numeric_only_is_true

### What changes were proposed in this pull request?

This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at #32657.

Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.

pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.

For the time being, I propose to exclude boolean case alone in percentile/quartile test case

### Why are the changes needed?

To keep the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

I roughly locally tested. But it should pass in CI.

Closes #32690 from HyukjinKwon/SPARK-35510.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: numpy.percentile output is not sorted

6 participants