-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
BUG: Order percentile monotonically #16273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Order percentile monotonically #16273
Conversation
717997a to
229f14c
Compare
|
This is a duplicate of #15098. For reviewers:
|
Hi, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought of writing this as a hypothesis test but hypothesis module was never imported and I don't know your conventions. But can change it hypothesis for sure if that is desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We recently decided we were willing to accept hypothesis tests - if you think the test is easier to read with hypothesis, please use it, so that maintainers like me who haven't used it can get the hang of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a hypothesis test. Am curious for your feedback. The purpose of this test is not to increase readability but to cover more test cases without writing multiple tests and to find failing egde cases we might not have thought of.
numpy/lib/function_base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand what out means in this context. If you can give me a hint I might be able to simplify the code even more. I guess there is the possibility to get rid of x1 = x1.squeeze(0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In working out what hint to give you, I ended up doing some cleanup of this code myself. Thanks for drawing it to my attention. It might be best for you to wait until my cleanup goes through, and then to rebase your tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience - my cleanup is in. It's unlikely you'll be able to usefully merge the implementation, but you should be able to keep the tests.
Your cleanup made it way easier to implement the montonic lerp. Thanks for that.
|
Thanks for your patience - my cleanup is in. It's unlikely you'll be able to usefully merge the implementation, but you should be able to keep the tests. |
af5dd48 to
be592a4
Compare
numpy/lib/function_base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #import pdb; pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| equals_sorted = np.sort(quantile) == quantile | |
| assert equals_sorted.all() | |
| assert_equal(np.sort(quantile), quantile) |
|
My fear with this patch is that we also care about the other properties in #14685 (comment) (vs #15098 which claims to satisfy all of them). Do you think you could put together a hypothesis tests that can prove that those properties are not satisfied with your solution? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert np.lib.function_base._lerp(a, b, t) == np.lib.function_base._lerp(b, a, t) | |
| assert np.lib.function_base._lerp(a, b, t) == np.lib.function_base._lerp(b, a, 1-t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or more likely,
| assert np.lib.function_base._lerp(a, b, t) == np.lib.function_base._lerp(b, a, t) | |
| assert np.lib.function_base._lerp(a, b, 1 - (1- t)) == np.lib.function_base._lerp(b, a, 1 - t) |
To ensure that the precision of t is the same in both directions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
I added the tests but both solutions (mine and #15098) don't satisfy the exact symmetry condition but of course both satisfy the condition with EDIT: This is outdated. When we apply |
|
Even with #16273 (comment)? |
Only with #16273 (comment). Using assert |
90eba5f to
d895113
Compare
Co-authored-by: Eric Wieser <[email protected]>
18815e4 to
214e830
Compare
| width=32), | ||
| b= st.floats(allow_nan=False, allow_infinity=False, | ||
| width=32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need to be 64 for this test to be interesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing to 64bit leads ot overflows in the basic add and subtract methods:
b=1.7976931348623157e+308
a=-1.7976931348623157e+308
np.subtract(b, a)
*** RuntimeWarning: overflow encountered in subtract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's potentially a problem, the previous implementation did not fail like that for those cases.
For now, can you do something like clip a and b to reasonable magnitudes? We still want to test the precision of 64-bit, even if we decide we don't care about the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the range from -1e300 up to 1e300, seems to work fine.
Are you sure that it wasn't a problem before? The overflow happens in the basic add and subtract methods and should be reproducible for specific combinations for the former implementation aswell since we are still using those methods.
Co-authored-by: Eric Wieser <[email protected]>
eric-wieser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, assuming @seberg is happy.
Co-authored-by: Eric Wieser <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, may just merge soon. I cannot say, I love the way the indentation works out for the hypothesis tests. I think I would prefer a hanging indent, maybe of 8 spaces. There may be long lines.
I also wonder if those tests (except the symmetric one maybe?), are actually likely to stress the issue. On first sight, it feels like we may need an absurd amount of fuzzy trials until the test could proof that the old version was bad?
EDIT: To be clear, I am not willing to stress on code style in the tests here...
|
Thanks @CloseChoice, lets put put this in :), I think the tests should cover the relevant paths well, even if I am not sure that all the tests actually stress the monotonicity very strongly. EDIT: To be clear, if someone feels like doing style/test touch-ups that is of course welcome, but I don't feel its important enough to delay/stress out over. |
|
@seberg : Thanks for merging ;) |
|
@CloseChoice @seberg @eric-wieser Thanks for taking care of this issue and sorry to have disappear from the map. |
| def _lerp(a, b, t, out=None): | ||
| """ Linearly interpolate from a to b by a factor of t """ | ||
| return add(a*(1 - t), b*t, out=out) | ||
| diff_b_a = subtract(b, a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case this was overlooked mistakenly, seems like this now disallows booleans in percentile:
>>> np.percentile([True, False, False], q=0.5)Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<__array_function__ internals>", line 5, in percentile
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3818, in percentile
return _quantile_unchecked(
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3937, in _quantile_unchecked
r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3515, in _ureduce
r = func(a, **kwargs)
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 4064, in _quantile_ureduce_func
r = _lerp(x_below, x_above, weights_above, out=out)
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3961, in _lerp
diff_b_a = subtract(b, a)
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
which worked before (NumPy 1.19):
0.0
FWIW, this behaviour change can also be triggered by pandas's:
>>> pd.DataFrame({"i": [0, 1, 2], "b": [False, False, True], "s": ["x", "y", "z"]}).quantile(q=0.5, numeric_only=True)Before:
i 1.0
b 0.0
Name: 0.5, dtype: float64
After:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python3.8/site-packages/pandas/core/frame.py", line 9266, in quantile
result = data._mgr.quantile(
File "/.../python3.8/site-packages/pandas/core/internals/managers.py", line 491, in quantile
block = b.quantile(axis=axis, qs=qs, interpolation=interpolation)
File "/.../python3.8/site-packages/pandas/core/internals/blocks.py", line 1592, in quantile
result = nanpercentile(
File "/.../python3.8/site-packages/pandas/core/nanops.py", line 1675, in nanpercentile
return np.percentile(values, q, axis=axis, interpolation=interpolation)
File "<__array_function__ internals>", line 5, in percentile
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3818, in percentile
return _quantile_unchecked(
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3937, in _quantile_unchecked
r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3515, in _ureduce
r = func(a, **kwargs)
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 4064, in _quantile_ureduce_func
r = _lerp(x_below, x_above, weights_above, out=out)
File "/.../python3.8/site-packages/numpy/lib/function_base.py", line 3961, in _lerp
diff_b_a = subtract(b, a)
TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to open an issue instead of commenting on the merge PR. I am not sure that it will draw attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that I am not sure if this is an intended behaviour or a bug. It could make sense either way 1. bool != numeric but 2. bool inherits int in Python.
I believe I provided enough details with a self-contained reproducer so I would defer to other maintainers here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually opening an issue with the regression tag would make sense here. What do you thi
nk @eric-wieser @seberg ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please open the issue, otherwise the discussion will get lost even quicker. If we figure that it is OK if this breaks now, we can just close the issue again. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: #19154
…mns_should_be_discarded_if_numeric_only_is_true ### What changes were proposed in this pull request? This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at #32657. Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`. pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy. For the time being, I propose to exclude boolean case alone in percentile/quartile test case ### Why are the changes needed? To keep the test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? I roughly locally tested. But it should pass in CI. Closes #32690 from HyukjinKwon/SPARK-35510. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
fixes #14685
Order output of
np.quantile/np.percentilemonotonically if quantiles/percentiles are ordered monotonically.Change the way of performing the linear interpolation from (2) to (1) as given in
https://math.stackexchange.com/questions/907327/accurate-floating-point-linear-interpolation