-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
ENH: Add fast path in ufuncs for numerical scalars. #29819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9a7a45e to
f7f5900
Compare
|
The error is unrelated. |
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, also one of those long standing things... I suppose this is rather simple, how much does it to the simplest array paths?
This is of course (and has to be) specific to NumPy scalars and 1 in/1 out ufuncs.
While we need to do a few more defensive guesses here I think, which will slow it down slightly, it is simple enough that it seems worthwhile to just do this.
I am really curious how far we can get without a fully special path here, e.g. by trying to make creation of (simple) 0-D arrays very fast (without that, one would have to delay array creation, since channeling non-arrays into the iterator wouldn't be fun, unfortunately.)
As far as I could tell, negligible impact on arrays, because of the
And to python floats! For 2in/1out it is much less problematic since most of those are called indirectly, and operators can avoid the trouble.
I tried hard to think what could go wrong with what I did, bailing in the face of uncertainty. Let me know if I missed something!
Agreed, we should make array creation faster. But just doing |
Yeah, not quite that much probably. One could have a free-list, which might be very fast. But we may have to check for a custom allocator being set, and that alone might actually be significant in this context (compared to what you have now).
Just the two promotion/resolving steps:
|
f7f5900 to
c3f0573
Compare
|
OK, I pushed the changes so far. I can try how much extra time |
c3f0573 to
00b71fb
Compare
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, a few smaller comments. I suppose we should rarely add overhead (mainly in functions that always promote currently).
But if we promote and go the slow path (i.e. scalar in), the added overhead will be overshadowed by a lot.
Still a bit curious about just using resolve_descriptors (maybe with a check whether it is the default one). But now with the DType check I think this is OK.
(I am also not 100% sure that there might not be some problem there.)
So overall, I suspect we should actually just do it, it's simple enough compared to the speedup.
This might actually propel us very fairly close to the scalarmath binary operators speed wise, but I guess those are unfortunately a bit more special...
numpy/_core/src/umath/ufunc_object.c
Outdated
| // Clear error, may just be because of wrong assumption. | ||
| PyErr_Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Clear error, may just be because of wrong assumption. | |
| PyErr_Clear(); |
Got to remove this. We really can't rely on this ever erroring. The reason this works is solely for the old-style ufuncs which have to do a second dance of loop lookup (which is a waste but can fail gracefully).
But... I assume that the DType check above will make this check unnecessary. (We are still gambling on resolve_descriptors, but that is a pretty small gamble for these dtypes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me just return -1 and change the comment to say that this should not happen.
...
At least locally, all the tests pass.
00b71fb to
9aee826
Compare
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed review. All comments addressed, I think!
numpy/_core/src/umath/ufunc_object.c
Outdated
| // Clear error, may just be because of wrong assumption. | ||
| PyErr_Clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me just return -1 and change the comment to say that this should not happen.
...
At least locally, all the tests pass.
I did think about the binary operators, but, as you say, they already have a special path, and typically one will do |
The allocation stuff will still help a lot with things like |
9aee826 to
9258649
Compare
|
circle ci/build failure is due to a timeout error |
|
Thanks, I think I'll try to add a check for I would take a <10% performance hit there to be on the safe side of things and that might also unlock supporting a few more cases (but I am not suggesting to necessarily do that now). |
|
@seberg - your comment just above suggested you still wanted to make a small change. What would that be? Or maybe OK to just go with what I have here? |
This adds a fast check for numerical scalars, improving the speed by about a factor of 6, with essentially no cost for arrays. At the moment, just for single-input, single-output ufuncs, with the assumption that the output has the same type as the input (if that fails, the regular path is taken).
9258649 to
c47a5c6
Compare
|
Sorry, forgot about this. I pushed a commit that adds it, it was a bit more unwieldy than I hoped. I think the "scalar" special path doesn't matter for unary ufuncs. (I did rebase main so unfortunately, it's just a pull.) For me, it pushes things from 77ns to 82ns for the |
mhvk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg - Thanks! But I'm still confused why we need resolve_descriptors when the goal is not to be general but just to make a common case much faster (i.e., it is fine to bail on signbit!)
I looked back at NEP 43 to try to understand why we would need resolve_descriptors and it states (omitting strings):
- In general, whenever an output DType is parametric the parameters have to be found (resolved).
- When they are not parametric, a default implementation is provided which fills in the default dtype instances (ensuring for example native byte order).
Neither would seem relevant here - we only accept standard python and numpy scalars, so are we not guaranteed to deal with a non-parametric dtype with native byte order?
I do see that my original check that the output had to have the same dtype as the input would fail for signbit, but I'm still confused how it could fail for, say, np.sin, i.e., how can resolve_descriptors be relevant?
|
p.s. The core dumps seem related... |
|
Ack, yeah, the crash is certainly related. Yeah, maybe it is rather unnecessary, it would make timedelta work (at least in principle) and would bring us closer to allowing more scalars, but that requires some restructure anyway. You are right it seems unnecessary, I don't love that there is no guarantee nobody writes a |
I'd be happy to guard against that, as long as it is fast (maybe, But it also seems fine not to guard against the hypothetical, when it is so difficult to come up with an example where it might be problematic. |
c47a5c6 to
a2d99b7
Compare
|
Yeah, let's try what you had then. I do think allowing a different output dtype makes sense (because it should be pretty much free), and I also think adding the resolve-dtypes call is likely good if just for silly safety. (Just one last note because I don't think I said it: Unary ufuncs don't do special scalar promotions, which means the int/float handling here should be correct.) |
|
I think this PR introduced a race condition that we see in JAX's tsan CI (https://github.com/jax-ml/jax/actions/runs/19400717531): I suspect that's enough for you to figure it out but if you like I can make a better repro. |
|
@hawkinsp - Ouch, that seems clearly a mistake here. Looking through your trace, the data race occurs when the new code reads the ufunc implementation cache, presumably for the case that the identify has is not yet defined, as the race is with a Obviously, not difficult in principle to fix, though seeing this, it may be better to think through how to make this safer generally. For instance, should I am happy to try that (though not before Thursday), but one problem is that I'm not sure how to write a good test case... @ngoldbaum, since you were the one that introduced the |
|
I think we can just add the same locking code? For testing, I think this test covers it:
it just needs parametrization to also cover the scalar path probably. As for the error checking issue (not discussed here). You should be able to make the May look at it later too, both seem like pretty straight forward follow-ups. |
|
@seberg - yes, agreed it is not too hard. Main question would seem to be whether to add it in my new fast path, or whether to move things more generally to where it is actually needed, i.e., locking in p.s. Unlikely to get to this myself until Thursday, so certainly happy if anyone else has a fix beforehand! |
I have a PR for that one, and no that doesn't work, because you don't have the GIL there. The second one is less urgent probably, but let's see. |
Ah, yes, I now saw the PR. Since it apparently has very little cost, definitely the easiest to go with a local check. (Conceptually, I'd still like to push it in the legacy loop wrappers even if it is fairly costly there, but am saying that without really knowing how much hassle it would be...) |
|
@mhvk let me know if you want me to take a crack at this. I can probably fix this on funded time for Quansight's free-threading project. |
|
@ngoldbaum - I think it would be faster for you than for me, so definitely fine for you to do it. I'm very unlikely to have time myself before Friday. |
It has always been a bit annoying that
math.sinis so much faster thannp.sinon a scalar.This PR adds a fast check for numerical scalars, improving the speed by about a factor of 5--6, with essentially no cost for arrays. At the moment, just for single-input, single-output ufuncs, with only the input passed in. The trial assumes that the output has the same type as the input (if that fails, the regular path is taken).
Sample speed test:
Note that I did some trials to find out where most time (690 ns on my laptop when plugged in) is spent in the regular path (tested by simply returning the input argument at various places):
convert_ufunc_argumentspromote_and_get_ufuncimplresolve_descriptorsPyUFunc_GenericFunctionInternalSo, about 40% of the time is spent in
convert_ufunc_arguments, mostly in turning the scalars into arrays -PyArray_FromAnyperhaps could be sped up a bit (time needed is similar to that ofnp.array(scalar)), though I think the time is dominated by actually creating an array structure in the first place.But there would seem to be some space for improvements at the 10% level also for regular arrays by, e.g., not creating the
full_args.intuple until it is actually needed (for overrides). There may also be little bits to gain inPyUFunc_GenericFunctionInternal.