ENH: New-style sorting for StringDType#30112
Conversation
dab50e6 to
220338d
Compare
|
Running this script: import numpy
import random
import timeit
print(numpy.__version__)
options = ["left", "right", "leftovers", "righty", "up", "down", numpy.nan]
lst = random.choices(options, k=1000)
arr_s = numpy.array(lst, dtype="T")
time = timeit.timeit(
"numpy.sort(arr_s, kind='stable')",
globals=globals(),
number=1000,
setup="import numpy",
)
print(f"Time taken for stable sort: {time:.6f} seconds")produces on main: and with this PR: Edit: just noting that this doesn't implement quick (non stable) sort so far - I didn't find a significant performance difference when working with this script and a few other tests, but I didn't really try out a lot of variety. It would also be more verbose if we added it. I'll try running experiments sometime. |
| // "output DType is parametric. (method: %s)", | ||
| // spec->name); | ||
| // return -1; | ||
| // } |
There was a problem hiding this comment.
This should not be merged!!! I noticed that we actually cannot skip resolve_descriptors for any parametric dtype, at least for sorts, because the output is parametric. Should we special-case somehow?
There was a problem hiding this comment.
I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.
There was a problem hiding this comment.
Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!
There was a problem hiding this comment.
Sorry, I just realized this is initialization, so we don't actually have the descriptors. Not sure what the condition could be...
|
|
||
| NPY_ARRAYMETHOD_FLAGS flags = 0; | ||
| method_flags = &flags; | ||
|
|
There was a problem hiding this comment.
Fixed a bug here and below!
| Py_INCREF(argsort_method->method); | ||
| Py_DECREF(argsort_method); | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Unfortunately we cannot use PyUFunc_AddLoopsWithSpecs if we group this with the ufuncs because NumPy is not fully initialized when multiarray is initialized.
There was a problem hiding this comment.
There was a problem hiding this comment.
Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.
|
Added a 2.4.0 milestone - it'd be really nice to get the performance boost shipped this month. I'll try to give this a once-over this week sometime. |
ngoldbaum
left a comment
There was a problem hiding this comment.
Overall seems nice but incomplete. Left some comments inline.
Did you think about using C++ instead of C?
IMO this needs explicit string sorting tests, especially given the custom algorithm code in this PR. I'm not sure offhand if there's a good test corpus - maybe look in the CPython test suite?
Even if you change to use a C++ standard library sorting algorithm, I'd still like to see some more tests for sorting in test_stringdtype.py before this is merged.
| extern "C" { | ||
| #endif | ||
|
|
||
| #define SMALL_STRING_MERGESORT 20 |
There was a problem hiding this comment.
what motivates this choice? maybe add a comment
There was a problem hiding this comment.
Same as the algorithm, this is directly taken from npysort :
| // "output DType is parametric. (method: %s)", | ||
| // spec->name); | ||
| // return -1; | ||
| // } |
There was a problem hiding this comment.
I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.
| memcpy(pj, vp, elsize); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This is pretty tricky C code - I would be very careful to make sure you're implementing it correctly (e.g. by adding a lot of tests). Or maybe use the C++ standard library instead?
There was a problem hiding this comment.
Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:
numpy/numpy/_core/src/npysort/mergesort.cpp
Lines 338 to 412 in 2b216b1
I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.
| *pj = vi; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
this too - can we maybe rely on the C++ standard library instead of having an algorithm implementation here in the stringdtype internals?
There was a problem hiding this comment.
Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!
|
|
||
| if (init_stringdtype_sorts() < 0) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
maybe you can avoid the numpy initialization problem you're having if you move this into init_string_dtype over in the dtype.c file in the multiarray/stringdtype folder.
There was a problem hiding this comment.
| needidxbuffer = rstride != sizeof(npy_intp); | ||
|
|
||
| if (method_flags != NULL) { | ||
| if (strided_loop != NULL) { |
There was a problem hiding this comment.
are the two changes above bugfixes?
There was a problem hiding this comment.
Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.
| Py_INCREF(argsort_method->method); | ||
| Py_DECREF(argsort_method); | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
MaanasArora
left a comment
There was a problem hiding this comment.
Thanks for reviewing! I didn't look much at C++ - it makes sense to do that I think, though as I've said in a comment I'm unsure if using their standard sorts is the way to go:
- The sorting code is directly copied from the
npy_mergesortcode innpysort, so this implementation provides some consistency. - In the future, if someone wants to do this more efficiently, add features, or even use SIMD (!), having custom code makes that easier.
That being said I totally see it can get very complicated to have lots of tricky algorithm code lying around. I actually see it as a very good reason to actually make our sorting infrastructure reusable, a public library, and perhaps use C++ templates to expose the generic sorts. Maybe this would be preferable to using C++ sorts simply because we can make extensions, even if we actually use C++ sorts internally! (See #29987 for a very initial step).
For the short term though, if it is very tricky to review this even given the reuse, I am happy to move to the C++ standard library - just making a general point. I'll add expanded test coverage soon. Thanks again!
| needidxbuffer = rstride != sizeof(npy_intp); | ||
|
|
||
| if (method_flags != NULL) { | ||
| if (strided_loop != NULL) { |
There was a problem hiding this comment.
Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.
| memcpy(pj, vp, elsize); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:
numpy/numpy/_core/src/npysort/mergesort.cpp
Lines 338 to 412 in 2b216b1
I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.
| *pj = vi; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!
| Py_INCREF(argsort_method->method); | ||
| Py_DECREF(argsort_method); | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.
| extern "C" { | ||
| #endif | ||
|
|
||
| #define SMALL_STRING_MERGESORT 20 |
There was a problem hiding this comment.
Same as the algorithm, this is directly taken from npysort :
|
|
||
| if (init_stringdtype_sorts() < 0) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
| // "output DType is parametric. (method: %s)", | ||
| // spec->name); | ||
| // return -1; | ||
| // } |
There was a problem hiding this comment.
Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!
|
Ah, that makes more sense, I didn't realize the sorting algorithm code was copied from elsewhere. Maybe we can refactor things so this calls into I'll try to poke around in a debugger to see if I can come up with a suggestion for the |
|
Thanks! Yeah unfortunately the main issue with calling into |
|
I guess that argues for merging that PR before this one. I'll try to give that one a look soon. Sorry for letting it drop... |
|
No worries at all! It actually does |
|
I've added tests using an ad-hoc string corpus; I couldn't find anything standard, sorry. It is two lorem ipsum passsages with some random unicode strings in between. I'm getting some very bizzare failures with running the new test, only for argsort, and only when |
|
Closing in favor of #30266. |
Implements merge sorts for StringDType using the feature introduced in #29737. The sorts allocate the mutex lock only once, thus fixes #26510.
Instead of utilizing the generic sorts as done in #29987, this PR completely rewrites the sorts. While that is more verbose, just duplicating the code can help move to more specialized string sorting algorithms and make optimizations independently from the
npysortcode. Also, we don't need to rely on the 'hack' in the other PR i.e. indirect use of sort compare function.Posting a benchmark in a comment below; there is ~3x gain in speed on stable sort. ping @seberg - thanks!
Edit: also supports descending sorts now!