ENH: Add fast path in ufuncs for numerical scalars. #29819

mhvk · 2025-09-27T04:15:44Z

It has always been a bit annoying that math.sin is so much faster than np.sin on a scalar.

This PR adds a fast check for numerical scalars, improving the speed by about a factor of 5--6, with essentially no cost for arrays. At the moment, just for single-input, single-output ufuncs, with only the input passed in. The trial assumes that the output has the same type as the input (if that fails, the regular path is taken).

Sample speed test:

%timeit np.sin(1.5)
103 ns ± 3.32 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
# on main
688 ns ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit math.sin(1.5)
38.3 ns ± 4.77 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Note that I did some trials to find out where most time (690 ns on my laptop when plugged in) is spent in the regular path (tested by simply returning the input argument at various places):

36 ns at start, OK
39 ns -> +3 ns for memory allocation
77ns -> +38 ns for turning args into full_args.in tuple
94 ns -> +14 ns for override check
! 344 ns -> +250 ns for convert_ufunc_arguments
369 ns -> +15 ns for promote_and_get_ufuncimpl
400 ns -> +31 ns for resolve_descriptors
! 621 ns -> +221 ns for PyUFunc_GenericFunctionInternal
651 ns -> +30 ns for wrap-check and return

So, about 40% of the time is spent in convert_ufunc_arguments, mostly in turning the scalars into arrays - PyArray_FromAny perhaps could be sped up a bit (time needed is similar to that of np.array(scalar)), though I think the time is dominated by actually creating an array structure in the first place.

But there would seem to be some space for improvements at the 10% level also for regular arrays by, e.g., not creating the full_args.in tuple until it is actually needed (for overrides). There may also be little bits to gain in PyUFunc_GenericFunctionInternal.

charris · 2025-09-28T02:17:33Z

The error is unrelated.

seberg

Yeah, also one of those long standing things... I suppose this is rather simple, how much does it to the simplest array paths?

This is of course (and has to be) specific to NumPy scalars and 1 in/1 out ufuncs.

While we need to do a few more defensive guesses here I think, which will slow it down slightly, it is simple enough that it seems worthwhile to just do this.

I am really curious how far we can get without a fully special path here, e.g. by trying to make creation of (simple) 0-D arrays very fast (without that, one would have to delay array creation, since channeling non-arrays into the iterator wouldn't be fun, unfortunately.)

numpy/_core/src/umath/ufunc_object.c

mhvk · 2025-09-29T13:34:41Z

Yeah, also one of those long standing things... I suppose this is rather simple, how much does it to the simplest array paths?

As far as I could tell, negligible impact on arrays, because of the !PyArray_Check(args[0]) in front of the call (note that that one should have len_args == 1 as the very first check - I've got that committed, but will wait pushing until I've been able to answer your comments.

This is of course (and has to be) specific to NumPy scalars and 1 in/1 out ufuncs.

And to python floats! For 2in/1out it is much less problematic since most of those are called indirectly, and operators can avoid the trouble.

While we need to do a few more defensive guesses here I think, which will slow it down slightly, it is simple enough that it seems worthwhile to just do this.

I tried hard to think what could go wrong with what I did, bailing in the face of uncertainty. Let me know if I missed something!

I am really curious how far we can get without a fully special path here, e.g. by trying to make creation of (simple) 0-D arrays very fast (without that, one would have to delay array creation, since channeling non-arrays into the iterator wouldn't be fun, unfortunately.)

Agreed, we should make array creation faster. But just doing np.empty(()) is already slow. It seems hard to believe one could reduce it ~10 fold (which would be needed to become competitive).

numpy/_core/src/umath/ufunc_object.c

seberg · 2025-09-29T13:58:13Z

It seems hard to believe one could reduce it ~10 fold

Yeah, not quite that much probably. One could have a free-list, which might be very fast. But we may have to check for a custom allocator being set, and that alone might actually be significant in this context (compared to what you have now).

I tried hard to think what could go wrong with what I did, bailing in the face of uncertainty. Let me know if I missed something!

Just the two promotion/resolving steps:

I think you didn't check that the loop DTypes match. That is both input and output DType are what you expect it to be. That is just an identity check maybe, but it is crucial.
Skipping resolve_dtypes should be OKish, but if it is just ~5% slowdown, I might prefer not to (maybe a check for the default one can recover parts also). You could imagine a ufunc that does something more, but it's hard to imagine for builtin numeric ones (it would have to be a byteswap() ufunc, but that is a useless one).

numpy/_core/src/umath/ufunc_object.c

mhvk · 2025-09-29T16:03:00Z

OK, I pushed the changes so far.

I can try how much extra time resolve_descriptors takes. I think I was confused about it: I thought you meant the function inside ufunc_object (which does take operands[] that I don't have), but now realize you probably meant method->resolve_descriptors. Anyway, got to teach soon, so better focus on that now. Will look at it later.

seberg

Thanks, a few smaller comments. I suppose we should rarely add overhead (mainly in functions that always promote currently).
But if we promote and go the slow path (i.e. scalar in), the added overhead will be overshadowed by a lot.

Still a bit curious about just using resolve_descriptors (maybe with a check whether it is the default one). But now with the DType check I think this is OK.
(I am also not 100% sure that there might not be some problem there.)

So overall, I suspect we should actually just do it, it's simple enough compared to the speedup.

This might actually propel us very fairly close to the scalarmath binary operators speed wise, but I guess those are unfortunately a bit more special...

doc/release/upcoming_changes/29819.improvement.rst

numpy/_core/src/umath/ufunc_object.c

seberg · 2025-10-14T06:54:51Z

numpy/_core/src/umath/ufunc_object.c

+        // Clear error, may just be because of wrong assumption.
+        PyErr_Clear();


Suggested change

// Clear error, may just be because of wrong assumption.

PyErr_Clear();

Got to remove this. We really can't rely on this ever erroring. The reason this works is solely for the old-style ufuncs which have to do a second dance of loop lookup (which is a waste but can fail gracefully).

But... I assume that the DType check above will make this check unnecessary. (We are still gambling on resolve_descriptors, but that is a pretty small gamble for these dtypes).

OK, let me just return -1 and change the comment to say that this should not happen.
...
At least locally, all the tests pass.

numpy/_core/src/umath/ufunc_object.c

mhvk

Thanks for the detailed review. All comments addressed, I think!

doc/release/upcoming_changes/29819.improvement.rst

numpy/_core/src/umath/ufunc_object.c

mhvk · 2025-10-14T19:32:59Z

numpy/_core/src/umath/ufunc_object.c

+        // Clear error, may just be because of wrong assumption.
+        PyErr_Clear();


OK, let me just return -1 and change the comment to say that this should not happen.
...
At least locally, all the tests pass.

numpy/_core/src/umath/ufunc_object.c

mhvk · 2025-10-14T20:00:32Z

This might actually propel us very fairly close to the scalarmath binary operators speed wise, but I guess those are unfortunately a bit more special...

I did think about the binary operators, but, as you say, they already have a special path, and typically one will do scalar1 + scalar2, so it is less important that np.add(scalar1, scalar2) is very fast too. They will hopefully get the benefit from storing on the instance, though.

mhvk · 2025-10-14T20:06:01Z

Too bad with all the efforts of speeding up the allocation, but in the end this seems worthwhile.

The allocation stuff will still help a lot with things like np.mul(small_array, scalar), etc.

mhvk · 2025-10-14T21:11:21Z

circle ci/build failure is due to a timeout error

seberg · 2025-10-15T06:34:54Z

Thanks, I think I'll try to add a check for resolve_descriptors being the default one (or maybe even call it) -- the weird resolve with scalars doesn't matter for unary I think.

I would take a <10% performance hit there to be on the safe side of things and that might also unlock supporting a few more cases (but I am not suggesting to necessarily do that now).

mhvk · 2025-11-11T02:02:31Z

@seberg - your comment just above suggested you still wanted to make a small change. What would that be? Or maybe OK to just go with what I have here?

This adds a fast check for numerical scalars, improving the speed by about a factor of 6, with essentially no cost for arrays. At the moment, just for single-input, single-output ufuncs, with the assumption that the output has the same type as the input (if that fails, the regular path is taken).

seberg · 2025-11-11T10:01:38Z

Sorry, forgot about this. I pushed a commit that adds it, it was a bit more unwieldy than I hoped. I think the "scalar" special path doesn't matter for unary ufuncs. (I did rebase main so unfortunately, it's just a pull.)

For me, it pushes things from 77ns to 82ns for the np.sin(1.5) example. It does make np.signbit fast, not that I think it particularly matters.
But, overall, I would prefer to add this since it is <10% overhead and does fix potential subtle issues.

mhvk

@seberg - Thanks! But I'm still confused why we need resolve_descriptors when the goal is not to be general but just to make a common case much faster (i.e., it is fine to bail on signbit!)

I looked back at NEP 43 to try to understand why we would need resolve_descriptors and it states (omitting strings):

In general, whenever an output DType is parametric the parameters have to be found (resolved).

When they are not parametric, a default implementation is provided which fills in the default dtype instances (ensuring for example native byte order).

Neither would seem relevant here - we only accept standard python and numpy scalars, so are we not guaranteed to deal with a non-parametric dtype with native byte order?

I do see that my original check that the output had to have the same dtype as the input would fail for signbit, but I'm still confused how it could fail for, say, np.sin, i.e., how can resolve_descriptors be relevant?

mhvk · 2025-11-11T20:45:24Z

p.s. The core dumps seem related...

seberg · 2025-11-11T20:59:22Z

Ack, yeah, the crash is certainly related.

Yeah, maybe it is rather unnecessary, it would make timedelta work (at least in principle) and would bring us closer to allowing more scalars, but that requires some restructure anyway.

You are right it seems unnecessary, I don't love that there is no guarantee nobody writes a byteswap_input ufunc, but that is pretty unlikely in practice. So yeah... probably should just undo this again (and at best add a small comment).

mhvk · 2025-11-11T21:30:00Z

I don't love that there is no guarantee nobody writes a byteswap_input ufunc

I'd be happy to guard against that, as long as it is fast (maybe, ufunc->types != NULL?)

But it also seems fine not to guard against the hypothetical, when it is so difficult to come up with an example where it might be problematic.

seberg · 2025-11-13T09:19:57Z

Yeah, let's try what you had then. I do think allowing a different output dtype makes sense (because it should be pretty much free), and I also think adding the resolve-dtypes call is likely good if just for silly safety.
But... I can't disagree that it shouldn't really matter in practice.

(Just one last note because I don't think I said it: Unary ufuncs don't do special scalar promotions, which means the int/float handling here should be correct.)

hawkinsp · 2025-11-16T16:24:38Z

I think this PR introduced a race condition that we see in JAX's tsan CI (https://github.com/jax-ml/jax/actions/runs/19400717531):

WARNING: ThreadSanitizer: data race (pid=95179)
  Write of size 8 at 0x7fffb4060f78 by thread T3 (mutexes: read M0, write M1):
    #0 PyArrayIdentityHash_SetItem /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/common/npy_hashtable.cpp:224:20 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x41dbbb) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #1 promote_and_get_info_and_ufuncimpl(_tagPyUFuncObject*, tagPyArrayObject* const*, PyArray_DTypeMeta**, PyArray_DTypeMeta**, unsigned char) /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/dispatching.cpp:871:17 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x435463) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #2 promote_and_get_info_and_ufuncimpl_with_locking(_tagPyUFuncObject*, tagPyArrayObject* const*, PyArray_DTypeMeta**, PyArray_DTypeMeta**, unsigned char) /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/dispatching.cpp:1004:12 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x432fd9) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #3 promote_and_get_ufuncimpl /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/dispatching.cpp:1101:22 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x432fd9)
    #4 ufunc_generic_fastcall /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4683:38 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x4404df) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
...

  Previous read of size 8 at 0x7fffb4060f78 by thread T2 (mutexes: read M0):
    #0 find_item(PyArrayIdentityHash const*, _object* const*) /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/common/npy_hashtable.cpp:79:13 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x41ddd0) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #1 PyArrayIdentityHash_GetItem /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/common/npy_hashtable.cpp:240:21 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x41ddd0)
    #2 try_trivial_scalar_call /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4340:22 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x43f606) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #3 ufunc_generic_fastcall /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4425:13 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x43f606)
    #4 ufunc_generic_vectorcall /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4766:12 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x43d60e) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #5 _PyObject_VectorcallTstate /__w/jax/jax/cpython/./Include/internal/pycore_call.h:169:11 (python3.14+0x1fa17a) (BuildId: 010258c9a6c23915fd3f27ed01befca5ea57f252)

I suspect that's enough for you to figure it out but if you like I can make a better repro.

mhvk · 2025-11-16T17:38:29Z

@hawkinsp - Ouch, that seems clearly a mistake here.

Looking through your trace, the data race occurs when the new code reads the ufunc implementation cache, presumably for the case that the identify has is not yet defined, as the race is with a SetItem inside promote_and_get_ufuncimpl in dispatching.cpp. I think the problem is that I simply cribbed my bit of code from promote_and_get_info_and_ufuncimpl, while I should have added a case for Py_GIL_DISABLED following the start of promote_and_get_info_and_ufuncimpl_with_locking.

Obviously, not difficult in principle to fix, though seeing this, it may be better to think through how to make this safer generally. For instance, should PyArrayIdentityHash_GetItem not take care of this if the GIL is disabled? After all, it is the cache itself which carries the mutex, so it would seem logical to do the locking directly in the code that uses it. Looking at npy_hashtable, it would seem relatively simple: just have separate versions for *_GetItem and *_SetItem with and without GIL. If that works, one can get rid of promote_and_get_ufuncimpl_with_locking.

I am happy to try that (though not before Thursday), but one problem is that I'm not sure how to write a good test case... @ngoldbaum, since you were the one that introduced the _with_locking variant, could you advice?

seberg · 2025-11-17T07:32:45Z

I think we can just add the same locking code? For testing, I think this test covers it:

numpy/numpy/_core/tests/test_multithreading.py

Line 28 in 0432a85

def test_parallel_ufunc_execution():

it just needs parametrization to also cover the scalar path probably.

As for the error checking issue (not discussed here). You should be able to make the PyErr_Occurred() conditional on the legacy wrapping, not sure how slow it actually is (i.e. whether that is worthwhile). I think the ufunc code may currently be doing it uncondtionally.

May look at it later too, both seem like pretty straight forward follow-ups.

mhvk · 2025-11-17T14:44:29Z

@seberg - yes, agreed it is not too hard. Main question would seem to be whether to add it in my new fast path, or whether to move things more generally to where it is actually needed, i.e., locking in npy_hashtable itself, error checking inside the legacy loop wrapper. But having a quick local fix now and then refactoring would be a good solution too.

p.s. Unlikely to get to this myself until Thursday, so certainly happy if anyone else has a fix beforehand!

seberg · 2025-11-17T14:48:48Z

error checking inside the legacy loop wrapper

I have a PR for that one, and no that doesn't work, because you don't have the GIL there. The second one is less urgent probably, but let's see.

mhvk · 2025-11-17T14:57:13Z

error checking inside the legacy loop wrapper

I have a PR for that one, and no that doesn't work, because you don't have the GIL there.

Ah, yes, I now saw the PR. Since it apparently has very little cost, definitely the easiest to go with a local check. (Conceptually, I'd still like to push it in the legacy loop wrappers even if it is fairly costly there, but am saying that without really knowing how much hassle it would be...)

ngoldbaum · 2025-11-18T16:22:25Z

@mhvk let me know if you want me to take a crack at this. I can probably fix this on funded time for Quansight's free-threading project.

mhvk · 2025-11-18T17:19:52Z

@ngoldbaum - I think it would be faster for you than for me, so definitely fine for you to do it. I'm very unlikely to have time myself before Friday.

mhvk added 01 - Enhancement component: numpy.ufunc labels Sep 27, 2025

mhvk requested a review from seberg September 27, 2025 04:16

mhvk force-pushed the ufunc-scalar-speed-up branch from 9a7a45e to f7f5900 Compare September 27, 2025 14:34

mhvk mentioned this pull request Sep 27, 2025

ENH, API: New sorting slots for DType API #29737

Merged

seberg reviewed Sep 29, 2025

View reviewed changes

numpy/_core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved

numpy/_core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved

numpy/_core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved

numpy/_core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved

mhvk commented Sep 29, 2025

View reviewed changes

numpy/_core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved

mhvk commented Sep 29, 2025

View reviewed changes

numpy/_core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved

mhvk commented Sep 29, 2025

View reviewed changes

numpy/_core/src/umath/ufunc_object.c Outdated Show resolved Hide resolved

mhvk force-pushed the ufunc-scalar-speed-up branch from f7f5900 to c3f0573 Compare September 29, 2025 15:58

mhvk mentioned this pull request Sep 30, 2025

Speed up some FK4-related code astropy/astropy#18659

Merged

1 task

This was referenced Oct 8, 2025

ENH, PERF: allocate memory as part of the array object for scalars #29878

Draft

poisson_conf_interval failing for large N : Issue #13334 astropy/astropy#18676

Merged

mhvk force-pushed the ufunc-scalar-speed-up branch from c3f0573 to 00b71fb Compare October 13, 2025 20:21

seberg reviewed Oct 14, 2025

View reviewed changes

mhvk force-pushed the ufunc-scalar-speed-up branch from 00b71fb to 9aee826 Compare October 14, 2025 19:57

mhvk commented Oct 14, 2025

View reviewed changes

mhvk force-pushed the ufunc-scalar-speed-up branch from 9aee826 to 9258649 Compare October 14, 2025 20:39

seberg force-pushed the ufunc-scalar-speed-up branch from 9258649 to c47a5c6 Compare November 11, 2025 09:58

mhvk commented Nov 11, 2025

View reviewed changes

seberg force-pushed the ufunc-scalar-speed-up branch from c47a5c6 to a2d99b7 Compare November 13, 2025 09:16

seberg merged commit 5d4dc2a into numpy:main Nov 13, 2025
77 checks passed

mhvk deleted the ufunc-scalar-speed-up branch November 13, 2025 13:39

WarrenWeckesser mentioned this pull request Nov 16, 2025

CI: special/stats: failures for sf_error and MatrixT on NumPy dev/pre jobs scipy/scipy#23997

Closed

ngoldbaum mentioned this pull request Nov 18, 2025

BUG: TSan race in ufunc scalar fast path #30251

Closed

		// Clear error, may just be because of wrong assumption.
		PyErr_Clear();

Uh oh!

ENH: Add fast path in ufuncs for numerical scalars. #29819

ENH: Add fast path in ufuncs for numerical scalars. #29819

Conversation

mhvk commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charris commented Sep 28, 2025

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhvk commented Sep 29, 2025

Uh oh!

Uh oh!

seberg commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

mhvk commented Sep 29, 2025

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seberg Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

mhvk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhvk Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mhvk commented Oct 14, 2025

Uh oh!

mhvk commented Oct 14, 2025

Uh oh!

mhvk commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Oct 15, 2025

Uh oh!

mhvk commented Nov 11, 2025

Uh oh!

seberg commented Nov 11, 2025

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

mhvk commented Nov 11, 2025

Uh oh!

seberg commented Nov 11, 2025

Uh oh!

mhvk commented Nov 11, 2025

Uh oh!

seberg commented Nov 13, 2025

Uh oh!

Uh oh!

mhvk commented Sep 27, 2025 •

edited

Loading

mhvk commented Oct 14, 2025 •

edited

Loading

hawkinsp commented Nov 16, 2025 •

edited

Loading