Initial sort() by support and complex sort() change by vrakesh · Pull Request #16700 · numpy/numpy

vrakesh · 2020-06-27T22:47:10Z

This is the first step in addressing gh-15981

Related mailing list thread can be found here

As stated in the original thread, We need to start by having a sort() function for complex numbers that can do it based on keys, rather than plain arithmetic ordering.

There are two broad ways to approach a sorting function that supports keys (Not just for complex numbers).

Add a key kwarg to the sort() (function and method). To support key based sorting on arrays.
Use a new function on the lines off sortby(c_arr, key=(c_arr.real, c_arr.imag)

In this PR I have chosen approach 1 for the following reasons

Approach 1 means it is more easier to deal with both in-place method and the function. Since we can make the change in the c-sort function, have minimal change in the python layer. This I hope results, minimal impact on current code that handles complex sorting. One example within numpy is is linalg module's svd() function.
With approach 2 when we deprecate complex arithmetic ordering, existing methods using sort() for complex types, need to update their signature.

As it stands the PR does the following 3 things within the Python-C Array method implementation of sort

Checks for complex type- If array is of complex-type, it creates a default key(When no key is passed) which mimics the current arithmethic ordering in Numpy .
Uses the keys to perform a Py_LexSort and generate indices.
We perform the take_along_axis via C call back and copy over the result to the original array (pseudo in-place).

I am requesting feedback/help on implementing take_along_axis logic in C level in a in-place manner and the approach in general.

This will further feed into max() and min() as well. Once we figure this out. Next step would be to deprecate arithmetic ordering for complex types (Which I think will be a PR on it's own)

UPDATE:
The latest version uses A new Function PyArray_Keysort() to argsort a 1D slice of indices, and then use the indices to move contents of the same 1D slice

numpy/__init__.pyi

vrakesh · 2020-07-15T19:34:29Z

Some initial performance comparisions

In [5]: carr = np.arange(27, dtype=np.complex128)[::-1].reshape(3,3,3)                                                                                        

# New keysort() c function internally handling complex numbers
In [6]: %timeit carr.sort()                                                                                                                                   
11.2 µs ± 749 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [7]: carr = np.arange(27, dtype=np.complex128)[::-1].reshape(3,3,3)                                                                                        

# Complex sorting with using lexsort+ take_ along_axis
In [8]: %timeit np.take_along_axis(carr, np.lexsort((carr.imag, carr.real,)), axis=0)                                                                         
14.9 µs ± 724 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

seberg

Thanks, I like this approach much more. It could be interesting how it holds up with other use-cases, such as:

c = np.random.random(size=(1000, 1000)) * (1 - 1j)
%timeit np.sort(c, by=(c.real, c.imag), axis=0)
%timeit np.sort(c, by=(c.real, c.imag), axis=0)

seberg · 2020-07-16T20:38:46Z

numpy/core/src/multiarray/item_selection.c

I think you can delete rit entirely. It is only actually used in the !needcopy copy, since in the other path indbuffer is used, and there is no reason to actually copy it back. You can use indbuffer identically in that branch, since we do not actually return the index array.

Deleting rit includes deleting ret entirely.

Yes, I was also thinking the same. Will get it done

seberg · 2020-07-16T20:39:59Z

numpy/core/src/multiarray/item_selection.c

This seems incorrect, self can have any datatype, so it should be selstride != PyArray_DESCR(self)->elsize.

seberg · 2020-07-16T20:52:55Z

numpy/core/src/multiarray/item_selection.c

I have to admit, I am curious if we cannot just replace the below code with a manual memmove, you have to move elements (elsize), we know that every item ends up in exactly one other place, and we do not have to check the indices for out-of-bound values...

so a simple for loop with a memmove inside could be just as well. If it turns out a bit slow for some (smaller) dtypes, we could take the approach we have in npy_fastputmask to have the compiler generate different versions for the most common dtype sizes.

After doing that, we could experiment with an in-place algorithm, which seems like it should be possible as long as we also modify indbuffer at the same time.

In any case, right now I think a simple for-loop with a memmove inside is probably just as fast (no overhead and no index checking) or aster. And probably even slightly easier to read.

Ah will, look into this.

hameerabbasi

Question: Do we want to allow the sorting keys to be broadcastable?

hameerabbasi · 2020-07-16T05:56:31Z

numpy/core/fromnumeric.py

Suggested change

if by is not None and not isinstance(by, tuple):

if by is not None and not isinstance(by, (tuple, list)):

To match block.

hameerabbasi · 2020-07-16T05:58:26Z

numpy/core/src/multiarray/item_selection.c

Nit:

Suggested change

PyArray_DIMS(mps[0]),

PyArray_NDIM(mps[0])))) {

PyArray_DIMS(mps[0]),

PyArray_NDIM(mps[0])))) {

hameerabbasi

A couple of notes, and a question, do we want the sorting keys to be broadcastable to the array shape?

vrakesh · 2020-07-18T22:11:43Z

@hameerabbasi I am not sure, if we want it to be broadcast able, thoughts?? @seberg

seberg · 2020-07-18T22:20:16Z

Broadcastable sounds nice, but unless lexsort has it (which I do not think), it also doesn't strike me as a vital feature for sorting?

vrakesh · 2020-07-22T18:45:11Z

On the most recent commit. Performance numbers

In [1]: import numpy as np                                                                                                                                                                                        

In [2]: c = np.random.random(size=(1000, 1000)) * (1 - 1j) 
   ...: %timeit np.sort(c, by=(c.real, c.imag), axis=0)                                                                                                                                                           
65.6 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: c = np.random.random(size=(1000, 1000)) * (1 - 1j)                                                                                                                                                        

In [4]: %timeit np.take_along_axis(c, np.lexsort((c.imag, c.real,)), axis=0)                                                                                                                                      
70 ms ± 617 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

seberg · 2020-07-22T19:12:30Z

numpy/core/src/multiarray/item_selection.c

So this is not much faster, but I still like it a bit more, plus there is slightly more opportunity to speed it up if we ever want to, and less memory overhead.

Can't you do this in the for-loop directly? Maybe we could also make a small helper for it, since I think it should look identical in both paths of the needscopy if/else.

Made the change on this

anirudh2290 · 2020-07-22T21:58:56Z

numpy/core/src/multiarray/item_selection.c

should the shape of individual keys should be same as shape of the array we are sorting by ?

Yes they should, IIUC. We're not broadcasting. #16700 (comment)

anirudh2290 · 2020-07-22T22:03:19Z

numpy/core/src/multiarray/item_selection.c

will this do stable sort even if other sort kinds provided and is that expected behavior ?

That is a very good point, we have to use stable sort (except for the first sort I guess), otherwise the approach does not work. We should maybe give an error if by is passed together with the sort-kind (or sort-kind other than "stable").

anirudh2290 · 2020-07-22T22:04:25Z

numpy/core/src/multiarray/item_selection.c

can we choose a better name to describe what this int is supposed to do ?

anirudh2290 · 2020-07-22T22:13:11Z

numpy/core/src/multiarray/item_selection.c

Is there a reason we want to deprecate this ?

I think this was copied over from lexsort(?). 0-d arrays can't be reasonably sorted along an axis (except None, which has ravel behavior), but we allow it.

@vrakesh this also means that I think we should just go through with the deprecation. Which I think just means deleting the first if block so that check_and_adjust_axis is always run. np.array(0).sort(axis=None) fails, so it should fail here as well.

This one is important I think, and should get a test-case.

anirudh2290 · 2020-07-22T22:33:13Z

numpy/core/src/multiarray/item_selection.c

nit:

Suggested change

for(i=0;i<N;i++){

for (i = 0; i < N; i++) {

anirudh2290 · 2020-07-22T22:36:05Z

numpy/core/src/multiarray/item_selection.c

Suggested change

for(i=0;i<N;i++){

for (i = 0; i < N; i++) {

anirudh2290 · 2020-07-22T22:45:48Z

numpy/core/fromnumeric.py

I think it would help here to call out what order the keys are sorted in (similar to lexsort documentation). Since trying to follow lexsort it seems like to sort lexicograpphically with keys for complex numbers one should do arr.sort(by=(c.imag, c.real))

anirudh2290 · 2020-07-22T22:47:31Z

numpy/core/tests/test_multiarray.py

maybe also add the carr.real, carr.imag test mentioned above. Also, would help to check for different shapes of keys and mismatch between keys and arr shape for error assertions.

eric-wieser · 2020-07-23T11:04:37Z

numpy/core/src/multiarray/item_selection.c

Can we push these declarations down-file to where they are first used, C99-style?

Not sure on this one, every other function in the same file does not follow it, shouldn't be a problem changing, only question is uniformity.

That's because other functions were written before the language allowed it. IMO we should write all new code in c99 style.

eric-wieser · 2020-07-23T11:06:04Z

numpy/core/src/multiarray/item_selection.c

Suggested change

swaps = malloc(NPY_LIKELY(n > 0) ? n * sizeof(int) : 1);

swaps = malloc(NPY_LIKELY(n > 0) ? n * sizeof(*swaps) : 1);

I agree the swaps variable can be removed, will do it.

eric-wieser · 2020-07-23T11:06:54Z

numpy/core/src/multiarray/item_selection.c

Suggested change

if(sortedbuffer == NULL) {

if (sortedbuffer == NULL) {

eric-wieser · 2020-07-23T11:07:18Z

numpy/core/src/multiarray/item_selection.c

This should be addressed at the point those things happen, not here

numpy/core/src/multiarray/item_selection.c

mattip · 2020-08-09T08:13:12Z

numpy/core/code_generators/numpy_api.py

Should be removed. We try to err on the side of not changing the C API if possible.

mattip · 2020-08-09T08:15:37Z

numpy/core/fromnumeric.py

+1 for not doing this conversion. The argument is documented as as sequence.

numpy/core/include/numpy/ndarraytypes.h

mattip · 2020-08-09T08:17:04Z

numpy/core/include/numpy/ndarraytypes.h

see above comment

numpy/core/fromnumeric.py

numpy/core/src/multiarray/item_selection.c

vrakesh · 2020-08-13T17:36:34Z

@seberg Got the in-place part done as well. Does this look good to go?

seberg · 2020-08-13T18:41:49Z

numpy/core/src/multiarray/item_selection.c

I think it would be good to refactor this into a small helper. Deduplicates the code/explains what is going on in this case (through the function name), and adds an obvious place to add a comment of whats going on, and maybe a reference for the approach (if you have).

I will look at the rest soon, sorry, please ping me if I forget...

Sure, will refactor it into a helper.

seberg

Thanks @vrakesh sorry for taking so long to make another pass. I think this should be settling up with some smaller cleanups (mostly style nits), and removing the complex special path.

seberg · 2020-08-31T17:27:07Z

numpy/core/fromnumeric.py

This seems like it should be unnecessary, because the check should live in ndarray.sort().

I am actually happy with this decision: I.e. force a tuple() right now, and think about generalizing later. In which case I think we should go with explicitly tuple, and even disallow lists.

Thanks Sebastian, will address these changes

seberg · 2020-08-31T17:27:48Z

numpy/core/src/multiarray/item_selection.c

Suggested change

#include "mapping.h"

Probably not necessary anymore.

seberg · 2020-08-31T17:29:01Z

numpy/core/src/multiarray/item_selection.c

Some style nitpicks, like a space before the *. I like to try and do some doxygen style comments nowadays, but no need (style nitpicks also apply below, always space before and after the *).

seberg · 2020-08-31T17:30:34Z

numpy/core/src/multiarray/item_selection.c

You actually mean "contiguous" here and not aligned (although aligned is also true). I think you can actually just remove the stride part of the comment, stride always means the same thing in the NumPy code base.

seberg · 2020-08-31T17:38:13Z

numpy/core/src/multiarray/item_selection.c

Not sure if it helps without thinking too much about it, but maybe we can write: The following code reorders the data with respect to the index. The inner while loop places a single element to the right place until it reaches a fully cycle (an already ordered element). Each element may be visited twice, but will be sorted on the first visit. The second visit finds it already sorted and immediately continues.

seberg · 2020-08-31T17:46:38Z

numpy/core/src/multiarray/methods.c

Still holds, this path should be removed, IMO (it can be replaced with a deprecation warning, but it is likely nicer to do that in a followup)

seberg · 2020-08-31T17:47:30Z

numpy/core/src/multiarray/methods.c

Style nit: No need for the \ and I would probably just put the text on the next line with 8 space indentation.

seberg · 2020-08-31T17:49:28Z

numpy/core/tests/test_multiarray.py

This test should be transformed into one using by=, lets not modify the complex right now.

seberg · 2020-08-31T17:51:27Z

numpy/core/src/multiarray/item_selection.c

This one is important I think, and should get a test-case.

seberg · 2020-08-31T17:52:57Z

numpy/core/src/multiarray/methods.c

Would be good to add a test for this choice, and I think if we limit so much (which I like), we might as well limit it to only tuples. Otherwise "sequence" would be arbitrary python sequences.

BvB93 · 2020-08-31T18:43:04Z

numpy/__init__.pyi

Suggested change

by: Union[None, Sequence[ndarray], ndarray] = ...,

by: Optional[Sequence[ArrayLike]] = ...,

Currently only sequences (i.e. lists and tuples) of array-like objects are allowed, correct?

Thanks, I am thinking if we should just make it tuple here. I don't mind making it sequences, but not sure there is much reason for not being strict (but I may be missing some earlier discussion).

Makes sense, will make the change

If we're going with tuple then Optional[Tuple[ArrayLike, ...]] should do the trick.

vrakesh · 2020-09-05T06:02:51Z

@seberg Addressed the changes, hopefully this is good to go

seberg · 2023-02-08T18:01:14Z

I guess we should close this PR, unfortunately. We discussed this and maybe we should not block deprecating complex by having this new API. It was a nice idea, but likely complex sorting efficiently may not be important enough to avoid arr[np.lexsort(arr.real, arr.imag)].

I don't hate an API like this though and this was a great push! But pushing this forward seems like a larger thing and probably not vital.

vrakesh requested review from anirudh2290, eric-wieser, mattip and seberg June 27, 2020 22:47

seberg marked this pull request as draft June 28, 2020 03:48

charris added 03 - Maintenance component: numpy._core labels Jun 28, 2020

vrakesh force-pushed the complex_types branch from 384f22f to 001f6a3 Compare July 11, 2020 18:43

BvB93 reviewed Jul 12, 2020

View reviewed changes

numpy/__init__.pyi Outdated Show resolved Hide resolved

vrakesh force-pushed the complex_types branch from 3f50298 to 13413f2 Compare July 15, 2020 19:04

vrakesh requested a review from hameerabbasi July 15, 2020 19:04

seberg reviewed Jul 16, 2020

View reviewed changes

hameerabbasi reviewed Jul 17, 2020

View reviewed changes

vrakesh force-pushed the complex_types branch 3 times, most recently from 8e00f51 to 69aabd9 Compare July 22, 2020 18:39

seberg reviewed Jul 22, 2020

View reviewed changes

vrakesh force-pushed the complex_types branch from 69aabd9 to 3f97db2 Compare July 22, 2020 19:33

anirudh2290 reviewed Jul 22, 2020

View reviewed changes

eric-wieser reviewed Jul 23, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

eric-wieser reviewed Jul 23, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

vrakesh mentioned this pull request Aug 8, 2020

MAINT: Deprecate Complex Ordering comparisons #17030

Closed

mattip reviewed Aug 9, 2020

View reviewed changes

eric-wieser reviewed Aug 9, 2020

View reviewed changes

numpy/core/fromnumeric.py Outdated Show resolved Hide resolved

eric-wieser reviewed Aug 9, 2020

View reviewed changes

numpy/core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved

vrakesh force-pushed the complex_types branch 2 times, most recently from b114f5f to 2396c7a Compare August 13, 2020 02:17

seberg reviewed Aug 13, 2020

View reviewed changes

vrakesh force-pushed the complex_types branch 5 times, most recently from 319fc3f to edaa2ea Compare August 14, 2020 16:42

seberg reviewed Aug 31, 2020

View reviewed changes

BvB93 reviewed Aug 31, 2020

View reviewed changes

vrakesh added 6 commits September 4, 2020 15:52

WIP: Initial keys() support and complex sort() change

fd0dc75

Make a long axis, implementation in C

4d1ede2

By as parameter

94417b2

keysort() C function to sort 1D slice at a time

347bf6d

Iterate on Keysort() copy mechanism, simplify

0abf3a8

CodeReview suggestions

ab2214b

vrakesh force-pushed the complex_types branch from edaa2ea to ab2214b Compare September 5, 2020 06:02

Base automatically changed from master to main March 4, 2021 02:04

seberg mentioned this pull request Nov 23, 2022

BUG: Quantile function on complex numbers doesn't error #22652

Closed

InessaPawson added the triage review Issue/PR to be discussed at the next triage meeting label Dec 22, 2022

seberg added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Feb 8, 2023

seberg closed this Feb 8, 2023

InessaPawson removed the triage review Issue/PR to be discussed at the next triage meeting label Feb 8, 2023

	if by is not None and not isinstance(by, tuple):
	if by is not None and not isinstance(by, (tuple, list)):

-                                       PyArray_DIMS(mps[0]),
-                                       PyArray_NDIM(mps[0])))) {
+                                          PyArray_DIMS(mps[0]),
+                                          PyArray_NDIM(mps[0])))) {

	swaps = malloc(NPY_LIKELY(n > 0) ? n * sizeof(int) : 1);
	swaps = malloc(NPY_LIKELY(n > 0) ? n * sizeof(*swaps) : 1);

	by: Union[None, Sequence[ndarray], ndarray] = ...,
	by: Optional[Sequence[ArrayLike]] = ...,

Uh oh!

Conversation

vrakesh commented Jun 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vrakesh commented Jul 15, 2020

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hameerabbasi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hameerabbasi left a comment

Choose a reason for hiding this comment

Uh oh!

vrakesh commented Jul 18, 2020

Uh oh!

seberg commented Jul 18, 2020

Uh oh!

vrakesh commented Jul 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anirudh2290 Jul 22, 2020 • edited by eric-wieser Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anirudh2290 Jul 22, 2020 • edited by eric-wieser Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vrakesh commented Jun 27, 2020 •

edited

Loading

anirudh2290 Jul 22, 2020 •

edited by eric-wieser

Loading

anirudh2290 Jul 22, 2020 •

edited by eric-wieser

Loading