Skip to content

Conversation

@mhvk
Copy link
Contributor

@mhvk mhvk commented Sep 27, 2025

It has always been a bit annoying that math.sin is so much faster than np.sin on a scalar.

This PR adds a fast check for numerical scalars, improving the speed by about a factor of 5--6, with essentially no cost for arrays. At the moment, just for single-input, single-output ufuncs, with only the input passed in. The trial assumes that the output has the same type as the input (if that fails, the regular path is taken).

Sample speed test:

%timeit np.sin(1.5)
103 ns ± 3.32 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
# on main
688 ns ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit math.sin(1.5)
38.3 ns ± 4.77 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Note that I did some trials to find out where most time (690 ns on my laptop when plugged in) is spent in the regular path (tested by simply returning the input argument at various places):

  • 36 ns at start, OK
  • 39 ns -> +3 ns for memory allocation
  • 77ns -> +38 ns for turning args into full_args.in tuple
  • 94 ns -> +14 ns for override check
  • ! 344 ns -> +250 ns for convert_ufunc_arguments
  • 369 ns -> +15 ns for promote_and_get_ufuncimpl
  • 400 ns -> +31 ns for resolve_descriptors
  • ! 621 ns -> +221 ns for PyUFunc_GenericFunctionInternal
  • 651 ns -> +30 ns for wrap-check and return

So, about 40% of the time is spent in convert_ufunc_arguments, mostly in turning the scalars into arrays - PyArray_FromAny perhaps could be sped up a bit (time needed is similar to that of np.array(scalar)), though I think the time is dominated by actually creating an array structure in the first place.

But there would seem to be some space for improvements at the 10% level also for regular arrays by, e.g., not creating the full_args.in tuple until it is actually needed (for overrides). There may also be little bits to gain in PyUFunc_GenericFunctionInternal.

@charris
Copy link
Member

charris commented Sep 28, 2025

The error is unrelated.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, also one of those long standing things... I suppose this is rather simple, how much does it to the simplest array paths?

This is of course (and has to be) specific to NumPy scalars and 1 in/1 out ufuncs.

While we need to do a few more defensive guesses here I think, which will slow it down slightly, it is simple enough that it seems worthwhile to just do this.

I am really curious how far we can get without a fully special path here, e.g. by trying to make creation of (simple) 0-D arrays very fast (without that, one would have to delay array creation, since channeling non-arrays into the iterator wouldn't be fun, unfortunately.)

@mhvk
Copy link
Contributor Author

mhvk commented Sep 29, 2025

Yeah, also one of those long standing things... I suppose this is rather simple, how much does it to the simplest array paths?

As far as I could tell, negligible impact on arrays, because of the !PyArray_Check(args[0]) in front of the call (note that that one should have len_args == 1 as the very first check - I've got that committed, but will wait pushing until I've been able to answer your comments.

This is of course (and has to be) specific to NumPy scalars and 1 in/1 out ufuncs.

And to python floats! For 2in/1out it is much less problematic since most of those are called indirectly, and operators can avoid the trouble.

While we need to do a few more defensive guesses here I think, which will slow it down slightly, it is simple enough that it seems worthwhile to just do this.

I tried hard to think what could go wrong with what I did, bailing in the face of uncertainty. Let me know if I missed something!

I am really curious how far we can get without a fully special path here, e.g. by trying to make creation of (simple) 0-D arrays very fast (without that, one would have to delay array creation, since channeling non-arrays into the iterator wouldn't be fun, unfortunately.)

Agreed, we should make array creation faster. But just doing np.empty(()) is already slow. It seems hard to believe one could reduce it ~10 fold (which would be needed to become competitive).

@seberg
Copy link
Member

seberg commented Sep 29, 2025

It seems hard to believe one could reduce it ~10 fold

Yeah, not quite that much probably. One could have a free-list, which might be very fast. But we may have to check for a custom allocator being set, and that alone might actually be significant in this context (compared to what you have now).

I tried hard to think what could go wrong with what I did, bailing in the face of uncertainty. Let me know if I missed something!

Just the two promotion/resolving steps:

  • I think you didn't check that the loop DTypes match. That is both input and output DType are what you expect it to be. That is just an identity check maybe, but it is crucial.
  • Skipping resolve_dtypes should be OKish, but if it is just ~5% slowdown, I might prefer not to (maybe a check for the default one can recover parts also). You could imagine a ufunc that does something more, but it's hard to imagine for builtin numeric ones (it would have to be a byteswap() ufunc, but that is a useless one).

@mhvk mhvk force-pushed the ufunc-scalar-speed-up branch from f7f5900 to c3f0573 Compare September 29, 2025 15:58
@mhvk
Copy link
Contributor Author

mhvk commented Sep 29, 2025

OK, I pushed the changes so far.

I can try how much extra time resolve_descriptors takes. I think I was confused about it: I thought you meant the function inside ufunc_object (which does take operands[] that I don't have), but now realize you probably meant method->resolve_descriptors. Anyway, got to teach soon, so better focus on that now. Will look at it later.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, a few smaller comments. I suppose we should rarely add overhead (mainly in functions that always promote currently).
But if we promote and go the slow path (i.e. scalar in), the added overhead will be overshadowed by a lot.

Still a bit curious about just using resolve_descriptors (maybe with a check whether it is the default one). But now with the DType check I think this is OK.
(I am also not 100% sure that there might not be some problem there.)

So overall, I suspect we should actually just do it, it's simple enough compared to the speedup.

This might actually propel us very fairly close to the scalarmath binary operators speed wise, but I guess those are unfortunately a bit more special...

Comment on lines 4369 to 4370
// Clear error, may just be because of wrong assumption.
PyErr_Clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Clear error, may just be because of wrong assumption.
PyErr_Clear();

Got to remove this. We really can't rely on this ever erroring. The reason this works is solely for the old-style ufuncs which have to do a second dance of loop lookup (which is a waste but can fail gracefully).

But... I assume that the DType check above will make this check unnecessary. (We are still gambling on resolve_descriptors, but that is a pretty small gamble for these dtypes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me just return -1 and change the comment to say that this should not happen.
...
At least locally, all the tests pass.

@mhvk mhvk force-pushed the ufunc-scalar-speed-up branch from 00b71fb to 9aee826 Compare October 14, 2025 19:57
Copy link
Contributor Author

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review. All comments addressed, I think!

Comment on lines 4369 to 4370
// Clear error, may just be because of wrong assumption.
PyErr_Clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me just return -1 and change the comment to say that this should not happen.
...
At least locally, all the tests pass.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 14, 2025

This might actually propel us very fairly close to the scalarmath binary operators speed wise, but I guess those are unfortunately a bit more special...

I did think about the binary operators, but, as you say, they already have a special path, and typically one will do scalar1 + scalar2, so it is less important that np.add(scalar1, scalar2) is very fast too. They will hopefully get the benefit from storing on the instance, though.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 14, 2025

Too bad with all the efforts of speeding up the allocation, but in the end this seems worthwhile.

The allocation stuff will still help a lot with things like np.mul(small_array, scalar), etc.

@mhvk mhvk force-pushed the ufunc-scalar-speed-up branch from 9aee826 to 9258649 Compare October 14, 2025 20:39
@mhvk
Copy link
Contributor Author

mhvk commented Oct 14, 2025

circle ci/build failure is due to a timeout error

@seberg
Copy link
Member

seberg commented Oct 15, 2025

Thanks, I think I'll try to add a check for resolve_descriptors being the default one (or maybe even call it) -- the weird resolve with scalars doesn't matter for unary I think.

I would take a <10% performance hit there to be on the safe side of things and that might also unlock supporting a few more cases (but I am not suggesting to necessarily do that now).

@mhvk
Copy link
Contributor Author

mhvk commented Nov 11, 2025

@seberg - your comment just above suggested you still wanted to make a small change. What would that be? Or maybe OK to just go with what I have here?

This adds a fast check for numerical scalars, improving the speed by
about a factor of 6, with essentially no cost for arrays.  At the
moment, just for single-input, single-output ufuncs, with the
assumption that the output has the same type as the input (if that
fails, the regular path is taken).
@seberg seberg force-pushed the ufunc-scalar-speed-up branch from 9258649 to c47a5c6 Compare November 11, 2025 09:58
@seberg
Copy link
Member

seberg commented Nov 11, 2025

Sorry, forgot about this. I pushed a commit that adds it, it was a bit more unwieldy than I hoped. I think the "scalar" special path doesn't matter for unary ufuncs. (I did rebase main so unfortunately, it's just a pull.)

For me, it pushes things from 77ns to 82ns for the np.sin(1.5) example. It does make np.signbit fast, not that I think it particularly matters.
But, overall, I would prefer to add this since it is <10% overhead and does fix potential subtle issues.

Copy link
Contributor Author

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - Thanks! But I'm still confused why we need resolve_descriptors when the goal is not to be general but just to make a common case much faster (i.e., it is fine to bail on signbit!)

I looked back at NEP 43 to try to understand why we would need resolve_descriptors and it states (omitting strings):

  • In general, whenever an output DType is parametric the parameters have to be found (resolved).
  • When they are not parametric, a default implementation is provided which fills in the default dtype instances (ensuring for example native byte order).

Neither would seem relevant here - we only accept standard python and numpy scalars, so are we not guaranteed to deal with a non-parametric dtype with native byte order?

I do see that my original check that the output had to have the same dtype as the input would fail for signbit, but I'm still confused how it could fail for, say, np.sin, i.e., how can resolve_descriptors be relevant?

@mhvk
Copy link
Contributor Author

mhvk commented Nov 11, 2025

p.s. The core dumps seem related...

@seberg
Copy link
Member

seberg commented Nov 11, 2025

Ack, yeah, the crash is certainly related.

Yeah, maybe it is rather unnecessary, it would make timedelta work (at least in principle) and would bring us closer to allowing more scalars, but that requires some restructure anyway.

You are right it seems unnecessary, I don't love that there is no guarantee nobody writes a byteswap_input ufunc, but that is pretty unlikely in practice. So yeah... probably should just undo this again (and at best add a small comment).

@mhvk
Copy link
Contributor Author

mhvk commented Nov 11, 2025

I don't love that there is no guarantee nobody writes a byteswap_input ufunc

I'd be happy to guard against that, as long as it is fast (maybe, ufunc->types != NULL?)

But it also seems fine not to guard against the hypothetical, when it is so difficult to come up with an example where it might be problematic.

@seberg seberg force-pushed the ufunc-scalar-speed-up branch from c47a5c6 to a2d99b7 Compare November 13, 2025 09:16
@seberg
Copy link
Member

seberg commented Nov 13, 2025

Yeah, let's try what you had then. I do think allowing a different output dtype makes sense (because it should be pretty much free), and I also think adding the resolve-dtypes call is likely good if just for silly safety.
But... I can't disagree that it shouldn't really matter in practice.

(Just one last note because I don't think I said it: Unary ufuncs don't do special scalar promotions, which means the int/float handling here should be correct.)

@seberg seberg merged commit 5d4dc2a into numpy:main Nov 13, 2025
77 checks passed
@mhvk mhvk deleted the ufunc-scalar-speed-up branch November 13, 2025 13:39
@hawkinsp
Copy link
Contributor

hawkinsp commented Nov 16, 2025

I think this PR introduced a race condition that we see in JAX's tsan CI (https://github.com/jax-ml/jax/actions/runs/19400717531):

WARNING: ThreadSanitizer: data race (pid=95179)
  Write of size 8 at 0x7fffb4060f78 by thread T3 (mutexes: read M0, write M1):
    #0 PyArrayIdentityHash_SetItem /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/common/npy_hashtable.cpp:224:20 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x41dbbb) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #1 promote_and_get_info_and_ufuncimpl(_tagPyUFuncObject*, tagPyArrayObject* const*, PyArray_DTypeMeta**, PyArray_DTypeMeta**, unsigned char) /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/dispatching.cpp:871:17 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x435463) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #2 promote_and_get_info_and_ufuncimpl_with_locking(_tagPyUFuncObject*, tagPyArrayObject* const*, PyArray_DTypeMeta**, PyArray_DTypeMeta**, unsigned char) /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/dispatching.cpp:1004:12 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x432fd9) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #3 promote_and_get_ufuncimpl /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/dispatching.cpp:1101:22 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x432fd9)
    #4 ufunc_generic_fastcall /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4683:38 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x4404df) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
...

  Previous read of size 8 at 0x7fffb4060f78 by thread T2 (mutexes: read M0):
    #0 find_item(PyArrayIdentityHash const*, _object* const*) /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/common/npy_hashtable.cpp:79:13 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x41ddd0) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #1 PyArrayIdentityHash_GetItem /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/common/npy_hashtable.cpp:240:21 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x41ddd0)
    #2 try_trivial_scalar_call /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4340:22 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x43f606) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #3 ufunc_generic_fastcall /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4425:13 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x43f606)
    #4 ufunc_generic_vectorcall /__w/jax/jax/numpy/.mesonpy-tk47p_zd/../numpy/_core/src/umath/ufunc_object.c:4766:12 (_multiarray_umath.cpython-314t-x86_64-linux-gnu.so+0x43d60e) (BuildId: fe065841b00019c85589d862249e0c30081a9a3f)
    #5 _PyObject_VectorcallTstate /__w/jax/jax/cpython/./Include/internal/pycore_call.h:169:11 (python3.14+0x1fa17a) (BuildId: 010258c9a6c23915fd3f27ed01befca5ea57f252)

I suspect that's enough for you to figure it out but if you like I can make a better repro.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 16, 2025

@hawkinsp - Ouch, that seems clearly a mistake here.

Looking through your trace, the data race occurs when the new code reads the ufunc implementation cache, presumably for the case that the identify has is not yet defined, as the race is with a SetItem inside promote_and_get_ufuncimpl in dispatching.cpp. I think the problem is that I simply cribbed my bit of code from promote_and_get_info_and_ufuncimpl, while I should have added a case for Py_GIL_DISABLED following the start of promote_and_get_info_and_ufuncimpl_with_locking.

Obviously, not difficult in principle to fix, though seeing this, it may be better to think through how to make this safer generally. For instance, should PyArrayIdentityHash_GetItem not take care of this if the GIL is disabled? After all, it is the cache itself which carries the mutex, so it would seem logical to do the locking directly in the code that uses it. Looking at npy_hashtable, it would seem relatively simple: just have separate versions for *_GetItem and *_SetItem with and without GIL. If that works, one can get rid of promote_and_get_ufuncimpl_with_locking.

I am happy to try that (though not before Thursday), but one problem is that I'm not sure how to write a good test case... @ngoldbaum, since you were the one that introduced the _with_locking variant, could you advice?

@seberg
Copy link
Member

seberg commented Nov 17, 2025

I think we can just add the same locking code? For testing, I think this test covers it:

def test_parallel_ufunc_execution():

it just needs parametrization to also cover the scalar path probably.

As for the error checking issue (not discussed here). You should be able to make the PyErr_Occurred() conditional on the legacy wrapping, not sure how slow it actually is (i.e. whether that is worthwhile). I think the ufunc code may currently be doing it uncondtionally.

May look at it later too, both seem like pretty straight forward follow-ups.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 17, 2025

@seberg - yes, agreed it is not too hard. Main question would seem to be whether to add it in my new fast path, or whether to move things more generally to where it is actually needed, i.e., locking in npy_hashtable itself, error checking inside the legacy loop wrapper. But having a quick local fix now and then refactoring would be a good solution too.

p.s. Unlikely to get to this myself until Thursday, so certainly happy if anyone else has a fix beforehand!

@seberg
Copy link
Member

seberg commented Nov 17, 2025

error checking inside the legacy loop wrapper

I have a PR for that one, and no that doesn't work, because you don't have the GIL there. The second one is less urgent probably, but let's see.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 17, 2025

error checking inside the legacy loop wrapper

I have a PR for that one, and no that doesn't work, because you don't have the GIL there.

Ah, yes, I now saw the PR. Since it apparently has very little cost, definitely the easiest to go with a local check. (Conceptually, I'd still like to push it in the legacy loop wrappers even if it is fairly costly there, but am saying that without really knowing how much hassle it would be...)

@ngoldbaum
Copy link
Member

@mhvk let me know if you want me to take a crack at this. I can probably fix this on funded time for Quansight's free-threading project.

@mhvk
Copy link
Contributor Author

mhvk commented Nov 18, 2025

@ngoldbaum - I think it would be faster for you than for me, so definitely fine for you to do it. I'm very unlikely to have time myself before Friday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants