ENH: New-style sorting for StringDType by MaanasArora · Pull Request #30112 · numpy/numpy

MaanasArora · 2025-10-31T03:37:54Z

Implements merge sorts for StringDType using the feature introduced in #29737. The sorts allocate the mutex lock only once, thus fixes #26510.

Instead of utilizing the generic sorts as done in #29987, this PR completely rewrites the sorts. While that is more verbose, just duplicating the code can help move to more specialized string sorting algorithms and make optimizations independently from the npysort code. Also, we don't need to rely on the 'hack' in the other PR i.e. indirect use of sort compare function.

Posting a benchmark in a comment below; there is ~3x gain in speed on stable sort. ping @seberg - thanks!

Edit: also supports descending sorts now!

MaanasArora · 2025-10-31T03:48:08Z

Running this script:

import numpy
import random
import timeit

print(numpy.__version__)

options = ["left", "right", "leftovers", "righty", "up", "down", numpy.nan]
lst = random.choices(options, k=1000)
arr_s = numpy.array(lst, dtype="T")

time = timeit.timeit(
    "numpy.sort(arr_s, kind='stable')",
    globals=globals(),
    number=1000,
    setup="import numpy",
)
print(f"Time taken for stable sort: {time:.6f} seconds")

produces on main:

2.4.0.dev0+git20251030.d8eb225
Time taken for stable sort: 0.397690 seconds

and with this PR:

2.4.0.dev0+git20251030.dab50e6
Time taken for stable sort: 0.120393 seconds

Edit: just noting that this doesn't implement quick (non stable) sort so far - I didn't find a significant performance difference when working with this script and a few other tests, but I didn't really try out a lot of variety. It would also be more verbose if we added it. I'll try running experiments sometime.

MaanasArora · 2025-10-31T03:49:29Z

numpy/_core/src/multiarray/array_method.c

+            //             "output DType is parametric. (method: %s)",
+            //             spec->name);
+            //     return -1;
+            // }


This should not be merged!!! I noticed that we actually cannot skip resolve_descriptors for any parametric dtype, at least for sorts, because the output is parametric. Should we special-case somehow?

I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.

Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!

Sorry, I just realized this is initialization, so we don't actually have the descriptors. Not sure what the condition could be...

MaanasArora · 2025-10-31T03:49:44Z

numpy/_core/src/multiarray/item_selection.c


+        NPY_ARRAYMETHOD_FLAGS flags = 0;
+        method_flags = &flags;
+


Fixed a bug here and below!

MaanasArora · 2025-10-31T03:50:33Z

numpy/_core/src/multiarray/stringdtype/sorts.c

+    Py_INCREF(argsort_method->method);
+    Py_DECREF(argsort_method);
+
+    return 0;


Unfortunately we cannot use PyUFunc_AddLoopsWithSpecs if we group this with the ufuncs because NumPy is not fully initialized when multiarray is initialized.

see https://github.com/numpy/numpy/pull/30112/files#r2501032204

Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.

… safety and consistency

ngoldbaum · 2025-11-03T20:47:19Z

Added a 2.4.0 milestone - it'd be really nice to get the performance boost shipped this month. I'll try to give this a once-over this week sometime.

ngoldbaum

Overall seems nice but incomplete. Left some comments inline.

Did you think about using C++ instead of C?

IMO this needs explicit string sorting tests, especially given the custom algorithm code in this PR. I'm not sure offhand if there's a good test corpus - maybe look in the CPython test suite?

Even if you change to use a C++ standard library sorting algorithm, I'd still like to see some more tests for sorting in test_stringdtype.py before this is merged.

ngoldbaum · 2025-11-06T22:06:05Z

numpy/_core/src/multiarray/stringdtype/sorts.h

+extern "C" {
+#endif
+
+#define SMALL_STRING_MERGESORT 20


what motivates this choice? maybe add a comment

Same as the algorithm, this is directly taken from npysort :

numpy/numpy/_core/src/npysort/mergesort.cpp

Line 40 in 2b216b1

#define SMALL_MERGESORT 20

ngoldbaum · 2025-11-06T22:09:58Z

numpy/_core/src/multiarray/array_method.c

+            //             "output DType is parametric. (method: %s)",
+            //             spec->name);
+            //     return -1;
+            // }


I didn't follow all the discussions that led to using an ArrayMethod here. I think you explained elsewhere but I'm not finding it: why don't the sort ArrayMethods care about resolve_descriptors? Because sorts have nin = nout = 1 and the input is the output, or something along those lines? If that's what it is, maybe this check needs to be conditional on the input and output descriptors so that trivial ArrayMethods can skip it.

ngoldbaum · 2025-11-06T22:11:16Z

numpy/_core/src/multiarray/stringdtype/sorts.c

+            memcpy(pj, vp, elsize);
+        }
+    }
+}


This is pretty tricky C code - I would be very careful to make sure you're implementing it correctly (e.g. by adding a lot of tests). Or maybe use the C++ standard library instead?

Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:

numpy/numpy/_core/src/npysort/mergesort.cpp

Lines 338 to 412 in 2b216b1

static void

npy_mergesort0(char *pl, char *pr, char *pw, char *vp, npy_intp elsize,

PyArray_CompareFunc *cmp, PyArrayObject *arr)

{

char *pi, *pj, *pk, *pm;

if (pr - pl > SMALL_MERGESORT * elsize) {

/* merge sort */

pm = pl + (((pr - pl) / elsize) >> 1) * elsize;

npy_mergesort0(pl, pm, pw, vp, elsize, cmp, arr);

npy_mergesort0(pm, pr, pw, vp, elsize, cmp, arr);

GENERIC_COPY(pw, pl, pm - pl);

pi = pw + (pm - pl);

pj = pw;

pk = pl;

while (pj < pi && pm < pr) {

if (cmp(pm, pj, arr) < 0) {

GENERIC_COPY(pk, pm, elsize);

pm += elsize;

pk += elsize;

}

else {

GENERIC_COPY(pk, pj, elsize);

pj += elsize;

pk += elsize;

}

}

GENERIC_COPY(pk, pj, pi - pj);

}

else {

/* insertion sort */

for (pi = pl + elsize; pi < pr; pi += elsize) {

GENERIC_COPY(vp, pi, elsize);

pj = pi;

pk = pi - elsize;

while (pj > pl && cmp(vp, pk, arr) < 0) {

GENERIC_COPY(pj, pk, elsize);

pj -= elsize;

pk -= elsize;

}

GENERIC_COPY(pj, vp, elsize);

}

}

}

NPY_NO_EXPORT int

npy_mergesort(void *start, npy_intp num, void *varr)

{

PyArrayObject *arr = (PyArrayObject *)varr;

npy_intp elsize = PyArray_ITEMSIZE(arr);

PyArray_CompareFunc *cmp = PyDataType_GetArrFuncs(PyArray_DESCR(arr))->compare;

char *pl = (char *)start;

char *pr = pl + num * elsize;

char *pw;

char *vp;

int err = -NPY_ENOMEM;

/* Items that have zero size don't make sense to sort */

if (elsize == 0) {

return 0;

}

pw = (char *)malloc((num >> 1) * elsize);

vp = (char *)malloc(elsize);

if (pw != NULL && vp != NULL) {

npy_mergesort0(pl, pr, pw, vp, elsize, cmp, arr);

err = 0;

}

free(vp);

free(pw);

return err;

}

I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.

ngoldbaum · 2025-11-06T22:12:22Z

numpy/_core/src/multiarray/stringdtype/sorts.c

+            *pj = vi;
+        }
+    }
+}


this too - can we maybe rely on the C++ standard library instead of having an algorithm implementation here in the stringdtype internals?

Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!

ngoldbaum · 2025-11-06T22:14:47Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp


+    if (init_stringdtype_sorts() < 0) {
+        return -1;
+    }


maybe you can avoid the numpy initialization problem you're having if you move this into init_string_dtype over in the dtype.c file in the multiarray/stringdtype folder.

See https://github.com/numpy/numpy/pull/30112/files#r2480095384

ngoldbaum · 2025-11-06T22:15:42Z

numpy/_core/src/multiarray/item_selection.c

    needidxbuffer = rstride != sizeof(npy_intp);

-    if (method_flags != NULL) {
+    if (strided_loop != NULL) {


are the two changes above bugfixes?

Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.

ngoldbaum · 2025-11-06T22:20:17Z

numpy/_core/src/multiarray/stringdtype/sorts.c

+    Py_INCREF(argsort_method->method);
+    Py_DECREF(argsort_method);
+
+    return 0;


see https://github.com/numpy/numpy/pull/30112/files#r2501032204

MaanasArora

Thanks for reviewing! I didn't look much at C++ - it makes sense to do that I think, though as I've said in a comment I'm unsure if using their standard sorts is the way to go:

The sorting code is directly copied from the npy_mergesort code in npysort, so this implementation provides some consistency.
In the future, if someone wants to do this more efficiently, add features, or even use SIMD (!), having custom code makes that easier.

That being said I totally see it can get very complicated to have lots of tricky algorithm code lying around. I actually see it as a very good reason to actually make our sorting infrastructure reusable, a public library, and perhaps use C++ templates to expose the generic sorts. Maybe this would be preferable to using C++ sorts simply because we can make extensions, even if we actually use C++ sorts internally! (See #29987 for a very initial step).

For the short term though, if it is very tricky to review this even given the reuse, I am happy to move to the C++ standard library - just making a general point. I'll add expanded test coverage soon. Thanks again!

MaanasArora · 2025-11-07T01:50:53Z

numpy/_core/src/multiarray/item_selection.c

    needidxbuffer = rstride != sizeof(npy_intp);

-    if (method_flags != NULL) {
+    if (strided_loop != NULL) {


Yes, bug fixes. This adjusts for the fact (below) that method_flags should actually never point to NULL, so it can be set in get_strided_loop.

MaanasArora · 2025-11-07T01:54:27Z

numpy/_core/src/multiarray/stringdtype/sorts.c

+            memcpy(pj, vp, elsize);
+        }
+    }
+}


Sure, one could move to C++ standard library! I didn't exactly think about it because this code is directly copied and adjusted from the npysort merge sorts:

numpy/numpy/_core/src/npysort/mergesort.cpp

Lines 338 to 412 in 2b216b1

static void

npy_mergesort0(char *pl, char *pr, char *pw, char *vp, npy_intp elsize,

PyArray_CompareFunc *cmp, PyArrayObject *arr)

{

char *pi, *pj, *pk, *pm;

if (pr - pl > SMALL_MERGESORT * elsize) {

/* merge sort */

pm = pl + (((pr - pl) / elsize) >> 1) * elsize;

npy_mergesort0(pl, pm, pw, vp, elsize, cmp, arr);

npy_mergesort0(pm, pr, pw, vp, elsize, cmp, arr);

GENERIC_COPY(pw, pl, pm - pl);

pi = pw + (pm - pl);

pj = pw;

pk = pl;

while (pj < pi && pm < pr) {

if (cmp(pm, pj, arr) < 0) {

GENERIC_COPY(pk, pm, elsize);

pm += elsize;

pk += elsize;

}

else {

GENERIC_COPY(pk, pj, elsize);

pj += elsize;

pk += elsize;

}

}

GENERIC_COPY(pk, pj, pi - pj);

}

else {

/* insertion sort */

for (pi = pl + elsize; pi < pr; pi += elsize) {

GENERIC_COPY(vp, pi, elsize);

pj = pi;

pk = pi - elsize;

while (pj > pl && cmp(vp, pk, arr) < 0) {

GENERIC_COPY(pj, pk, elsize);

pj -= elsize;

pk -= elsize;

}

GENERIC_COPY(pj, vp, elsize);

}

}

}

NPY_NO_EXPORT int

npy_mergesort(void *start, npy_intp num, void *varr)

{

PyArrayObject *arr = (PyArrayObject *)varr;

npy_intp elsize = PyArray_ITEMSIZE(arr);

PyArray_CompareFunc *cmp = PyDataType_GetArrFuncs(PyArray_DESCR(arr))->compare;

char *pl = (char *)start;

char *pr = pl + num * elsize;

char *pw;

char *vp;

int err = -NPY_ENOMEM;

/* Items that have zero size don't make sense to sort */

if (elsize == 0) {

return 0;

}

pw = (char *)malloc((num >> 1) * elsize);

vp = (char *)malloc(elsize);

if (pw != NULL && vp != NULL) {

npy_mergesort0(pl, pr, pw, vp, elsize, cmp, arr);

err = 0;

}

free(vp);

free(pw);

return err;

}

I think we need an extensive test suite either way, so I'll start with that (great point, thanks)! I don't think it matters, but it also might be slightly harder to make future optimizations or extensions if we use C++ sorts.

MaanasArora · 2025-11-07T01:56:36Z

numpy/_core/src/multiarray/stringdtype/sorts.c

+            *pj = vi;
+        }
+    }
+}


Same as above - I'd add an extensive test suite, and if it still seems to be an issue (due to complexity for example) we can see!

MaanasArora · 2025-11-07T01:57:21Z

numpy/_core/src/multiarray/stringdtype/sorts.c

+    Py_INCREF(argsort_method->method);
+    Py_DECREF(argsort_method);
+
+    return 0;


Unfortunately I tried init_string_dtype at first, and it still doesn't work because that is also initialized in multiarray as far as I can tell.

MaanasArora · 2025-11-07T01:58:51Z

numpy/_core/src/multiarray/stringdtype/sorts.h

+extern "C" {
+#endif
+
+#define SMALL_STRING_MERGESORT 20


Same as the algorithm, this is directly taken from npysort :

numpy/numpy/_core/src/npysort/mergesort.cpp

Line 40 in 2b216b1

#define SMALL_MERGESORT 20

MaanasArora · 2025-11-07T01:59:03Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp


+    if (init_stringdtype_sorts() < 0) {
+        return -1;
+    }


See https://github.com/numpy/numpy/pull/30112/files#r2480095384

MaanasArora · 2025-11-07T02:00:10Z

numpy/_core/src/multiarray/array_method.c

+            //             "output DType is parametric. (method: %s)",
+            //             spec->name);
+            //     return -1;
+            // }


Yes, it's because the output is essentially fake for sorts (we enforce data[0] == data[1]. Do you mean we can check if that is the case here and only error if it's not? If so makes sense -- I'll do that, thank you!

ngoldbaum · 2025-11-07T02:26:18Z

Ah, that makes more sense, I didn't realize the sorting algorithm code was copied from elsewhere. Maybe we can refactor things so this calls into npy_mergesort? That would avoid duplication. Or maybe that's what you were saying about making the sorting code reusable.

I'll try to poke around in a debugger to see if I can come up with a suggestion for the resolve_descriptors stuff.

MaanasArora · 2025-11-07T02:32:34Z

Thanks! Yeah unfortunately the main issue with calling into npy_mergesort is that the sort_compare is not implemented yet, nor descending. (Actually, #29987 has a workaround for exactly this.)

ngoldbaum · 2025-11-07T02:35:31Z

I guess that argues for merging that PR before this one. I'll try to give that one a look soon. Sorry for letting it drop...

MaanasArora · 2025-11-07T02:38:42Z

No worries at all! It actually does StringDType too, but it's a larger change and there are some confusing points about the API there. Completely fine with doing this just for now!

…t corpus

MaanasArora · 2025-11-07T05:36:24Z

I've added tests using an ad-hoc string corpus; I couldn't find anything standard, sorry. It is two lorem ipsum passsages with some random unicode strings in between.

I'm getting some very bizzare failures with running the new test, only for argsort, and only when na_object is unset and the corpus has duplicates... I suspect it has to do something with the allocator. Debugging now.

MaanasArora · 2025-11-20T00:33:09Z

Closing in favor of #30266.

github-actions bot added the 01 - Enhancement label Oct 31, 2025

ENH: New-style sorting for StringDType

220338d

MaanasArora force-pushed the enh/string-sorts branch from dab50e6 to 220338d Compare October 31, 2025 03:46

MaanasArora commented Oct 31, 2025

View reviewed changes

MaanasArora added 2 commits November 1, 2025 13:52

BUG: Improve handling of method_flags in PyArray_(Arg)Sort for memory…

ce3f070

… safety and consistency

ENH: Add descending order support to string sorting

d4aa6c7

ngoldbaum added this to the 2.4.0 release milestone Nov 3, 2025

ngoldbaum reviewed Nov 6, 2025

View reviewed changes

MaanasArora commented Nov 7, 2025

View reviewed changes

MaanasArora added 2 commits November 7, 2025 00:25

TST: Refactor stringdtype sort tests and add test with new string sor…

2fcc5c5

…t corpus

STYLE: remove trailing whitespace

ff1d631

MaanasArora mentioned this pull request Nov 20, 2025

ENH: New-style sorting for StringDType #30266

Merged

MaanasArora closed this Nov 20, 2025

	static void
	npy_mergesort0(char pl, char pr, char pw, char vp, npy_intp elsize,
	PyArray_CompareFunc cmp, PyArrayObject arr)
	{
	char pi, pj, pk, pm;

	if (pr - pl > SMALL_MERGESORT * elsize) {
	/* merge sort */
	pm = pl + (((pr - pl) / elsize) >> 1) * elsize;
	npy_mergesort0(pl, pm, pw, vp, elsize, cmp, arr);
	npy_mergesort0(pm, pr, pw, vp, elsize, cmp, arr);
	GENERIC_COPY(pw, pl, pm - pl);
	pi = pw + (pm - pl);
	pj = pw;
	pk = pl;
	while (pj < pi && pm < pr) {
	if (cmp(pm, pj, arr) < 0) {
	GENERIC_COPY(pk, pm, elsize);
	pm += elsize;
	pk += elsize;
	}
	else {
	GENERIC_COPY(pk, pj, elsize);
	pj += elsize;
	pk += elsize;
	}
	}
	GENERIC_COPY(pk, pj, pi - pj);
	}
	else {
	/* insertion sort */
	for (pi = pl + elsize; pi < pr; pi += elsize) {
	GENERIC_COPY(vp, pi, elsize);
	pj = pi;
	pk = pi - elsize;
	while (pj > pl && cmp(vp, pk, arr) < 0) {
	GENERIC_COPY(pj, pk, elsize);
	pj -= elsize;
	pk -= elsize;
	}
	GENERIC_COPY(pj, vp, elsize);
	}
	}
	}

	NPY_NO_EXPORT int
	npy_mergesort(void start, npy_intp num, void varr)
	{
	PyArrayObject arr = (PyArrayObject )varr;
	npy_intp elsize = PyArray_ITEMSIZE(arr);
	PyArray_CompareFunc *cmp = PyDataType_GetArrFuncs(PyArray_DESCR(arr))->compare;
	char pl = (char )start;
	char pr = pl + num elsize;
	char *pw;
	char *vp;
	int err = -NPY_ENOMEM;

	/* Items that have zero size don't make sense to sort */
	if (elsize == 0) {
	return 0;
	}

	pw = (char )malloc((num >> 1) elsize);
	vp = (char *)malloc(elsize);

	if (pw != NULL && vp != NULL) {
	npy_mergesort0(pl, pr, pw, vp, elsize, cmp, arr);
	err = 0;
	}

	free(vp);
	free(pw);

	return err;
	}

Uh oh!

Conversation

MaanasArora commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaanasArora commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Nov 3, 2025

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaanasArora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Nov 7, 2025

Uh oh!

MaanasArora commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Nov 7, 2025

Uh oh!

MaanasArora commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaanasArora commented Oct 31, 2025 •

edited

Loading

MaanasArora commented Oct 31, 2025 •

edited

Loading

MaanasArora commented Nov 7, 2025 •

edited

Loading

MaanasArora commented Nov 7, 2025 •

edited

Loading

MaanasArora commented Nov 7, 2025 •

edited

Loading