-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
ENH, PERF: allocate memory as part of the array object for scalars #29878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
The overhead of allocating memory for array data becomes large when the inputs are scalars. So for those this PR allocates memory as part of the object using `tp_alloc`. Timings ``` %timeit np.empty(()) 112->67 ns a = 1.6 %timeit np.array(a) 134->89 ns %timeit np.add(a, a) 527->413 ns (390 ns for np.float64) ``` In principle, this could be trivially extended to doing this generally when the required amount of memory is small. One note: the above requires setting `tp_itemsize=1` for the array object, which means that always at least 1 extra byte gets allocated [1]. I verified that with the current struct, this does not increase the size of the object (it remains 96 bytes). Alignment should be OK, since the last item on the current struct is a pointer. [1] https://github.com/python/cpython/blob/37d16f7f620d79af7db1c00c8d638122af889bf9/Objects/typeobject.c#L2490-L2497
78f87ed to
70d058c
Compare
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment, the speedup does look promising. Unfortunately, I think we should check PyDataMem_GetHandler for being the default one (which will add 10ns or so).
Just commenting, for now, we may need to iterate a bit, but it's a nice PoC!
| PyVarObject_HEAD_INIT(NULL, 0) | ||
| .tp_name = "numpy.ndarray", | ||
| .tp_basicsize = sizeof(PyArrayObject_fields), | ||
| .tp_itemsize = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think this might break subclasses if they were written in C. But I do suspect if we have a basetype, we can just manually add similar logic.
(Also, I don't think elements have alignment guarantees, so padding may be necessary.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that they might have used tp_itemsize differently? (note that I changed things so that I don't rely anymore on PyArray_Type.tp_basicsize, but now properly take Py_TYPE(self)->tp_basicsize)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suppose a custom tp_itemsize may be the only, but it may not be checked so if we test for no subtype we would be safe also?
It's a bit awkward, because we do not actually have a VarObject (a var object would have an ob_size field). The fact that the above has PyVarObject_HEAD_INIT is just wrong, it is setting the *data field or so to 0, which ends up working out.
But while we can't insert an ob_size field that field is also irrelevant except at allocation time, I think (i.e. Python will write garbage at allocation, but we just overwrite it).
We have broken some C classes before when increasing the size of the struct, that might just happen again, I guess.
(Part of me is tempted to just call PyObject_Malloc() directly we have enough code that does that, but I am not sure that is ideal -- especially on free-threaded builds that seem to use different allocation heaps for different type of objects. But it would side-step most of these worries.)
| if ((fa->flags & NPY_ARRAY_OWNDATA) && fa->data) { | ||
| /* Deallocate data, if present and not a small allocation in the object */ | ||
| if ((fa->flags & NPY_ARRAY_OWNDATA) && fa->data != NULL | ||
| && fa->data != (void *)((char *)fa + PyArray_Type.tp_basicsize)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a (private) flag as well, although maybe this is check is nice enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I realized the same. How about setting mem_handler = Py_None? (cannot use NULL as that is interpreted as meaning that data might have been gotten with malloc... see around the changes to array_dealloc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, mem_handler = Py_None is probably asking for problems, as surely there is code that just acts on it being non-NULL. One would need a no-op function or so. But that becomes more complicated than just having a new flag...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, right now I went for a semi-private _PyArray_DATA_ON_INSTANCE in-line function in ndarraytypes.h - that way we can always decide later whether or not to define a flag, etc.
I'm trying to understand why it might be a problem if a scalar is on the object itself (mostly because numpy scalars are stored on their own object anyway...). I was thinking that the allocator would be mostly used for very large arrays. But might it be needed in case someone, e.g., expects all data to be allocated in GPU memory or so? But wouldn't they then want the whole object there anyway? Shall I perhaps send an e-mail to numpy-dev to ask? It would be silly to guard against something that actually is a benefit also to those with their own memory allocator. |
|
p.s. Am confused why the tests all fail when on my machine |
Yes, precisely, someone might be asking for an alignment of 512bytes or so and any new array allocated within a context and returned to the user should be expected to have this.
To be fair, a lot of tests are passing here also :). Maybe try running with |
|
Now for the correct issue: OK, at least I can reproduce it, it has to do with being on python 3.11. It is also clear it has something to do with subclassing. |
|
Hmm, this is trickier than I hoped. Under python 3.11, I can get rid of the crashes by starting Under python 3.13, this all works fine. I'll push the version with the change anyway, since it would seem to be correct, but it is unclear if I really can proceed like this... |
76f01b1 to
43f1b29
Compare
Standard tp_alloc worked on python 3.13, but not python 3.11 for subclassing. Could not figure out what the problem was. Possibly needed a PyObject_VAR_HEAD, but that also gave problems (though at least no segfaults).
43f1b29 to
93f2c69
Compare
|
OK, I ended up just doing the allocation directly, instead of via p.s. Still need to do the |
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this as such, but I would like to have slightly more confidence about bare PyObject_malloc use here.
There might also be an interesting thing whether this whole thing is just much faster on Python 3.14 or so due to the use of mimalloc (or is that only free-threading)?
And in the end, I have to also think about arr.resize calls which might be weird but we can likely figure something out there (although need a release note, I wouldn't be surprised if someone is insane enough to replace NumPy array allocations under our feet...).
numpy/_core/src/multiarray/ctors.c
Outdated
| * calculate full nbytes up front and just use that here? | ||
| * | ||
| * Note: alignment constraint written such that compiler can just | ||
| * show it is true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, but since this is a static check, I think it should be a static_assert (otherwise you won't notice if it starts failing). Might also use NPY_ALIGNOF(npy_clongdouble), not sure that we can actually trust sizeof(npy_intp) to be a maximally aligned type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I was wondering how best to do that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, not sure about NPY_ALIGNOF(npy_clongdouble) - if that can be more than sizeof(npy_intp), then I'm not sure we can count on PyObject_Malloc to allocate to that level (i.e., our basic size may be 96 and thus divisible by 16, but malloc may only care about alignment of 8?). If that's a problem, it might have to be a check on data itself...
But probably it is not a problem. How about,
static_assert((sizeof(PyArrayObject_fields)
& (sizeof(npy_intp) - 1)) == 0);
static_assert(NPY_ALIGNOF(npy_clongdouble) <= sizeof(npy_intp));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm static_assert(NPY_ALIGNOF(npy_clongdouble) <= sizeof(npy_intp)); fails.
I wonder, though, whether this is a problem - we do not seem to allocate our data at that alignment either (or at least I don't see it in PyDataMem_UserNEW, which seems to just use malloc under the hood).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any malloc implementation (including Python's) guarantees at least the maximum alignment of any type on the system (Not necessarily the maximum of any instruction, but that is a different matter).
If that assertion fails, that is OK. Should have just written the previous one with NPY_ALIGNOF(npy_clongdouble) instead (might as well use % == 0 since speed is irrelevant).
If that also fails, then that is also OK! Just introduce a helper struct to clarify this. Maybe you prefer that either way even...
struct {
PyArrayObject_fields arr;
npy_clongdouble data[];
} _sizeof_arr_plus_data_struct;
then you can skip on the asssertions and use the sizeof that.
(If we want even more alignment, I would probably just do the math manually, but I suspect the struct thing is enough. I.e. on most systems the malloc functions probably all give 16 bytes alignment which is larger than the maximum alignment type probably -- it might make sense to just honor that.)
EDIT: Sorry, alignof longdouble is 16, so everything adds up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, funnily enough, the python documentation suggests something similar for variable size object (see last example in https://docs.python.org/3/c-api/typeobj.html#examples):
typedef struct {
PyObject_VAR_HEAD
const char *data[1];
} MyObject;
static PyTypeObject MyObject_Type = {
PyVarObject_HEAD_INIT(NULL, 0)
.tp_name = "mymod.MyObject",
.tp_basicsize = sizeof(MyObject) - sizeof(char *),
.tp_itemsize = sizeof(char *),
};
Notice the tp_basicsize calculation. I quite like this. Do you think we could do the same here, i.e., just add npy_clongdouble data[] to the end of the regular PyArrayObject_field struct, but remove it from tp_basicsize?
I ask in part since if the problems I had before with subclasses giving errors on python 3.11 are solved on python 3.12, then perhaps we can go for allocating extra stuff using the regular allocator -- and either work around python 3.11 or count on support for it being dropped very soon (2025 quarter 4 according to SPEC 0, i.e., basically now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could. But no need to "remove" anything, the only thing for the removal is that it uses data[1] rather than data[]. And the only reason for that is that the data[] notation is C99 (and I am not sure which C++ versios support it).
Hopefully C++17 does, because otherwise we may have to work around things. But we do use data[] a lot in NumPy, I am just not sure that it is in public headers yet (and this has to be public).
EDIT: Actually, not 100% sure about the public, but I suspect it needs to be (although we can say it is private of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, about Python versions. Without thinking it through: Yes, I can see doing such tricks only on Python versions that properly support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may try - the nice thing about having some npy_clongdouble extra[] is that the test for whether the data is stored on the instance becomes a more obvious one: self->data == &self->extra (would need to think of a better name...).
I think it is OK to note in the header file that this field is private and only exists if data points to it (or something like that). A name starting with an underscore would probably be wise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I went ahead and added _storage, with an explanation. Also added a _PyArray_DATA_ON_INSTANCE function for checking whether that is the case. It makes the other code changes more obvious. Not sure it should be in ndarraytypes.h, but didn't know where else to put it (and there are other private ones in there).
| } | ||
| memset(alloc, '\0', size); | ||
| PyObject_Init((PyObject *)alloc, subtype); | ||
| fa = (PyArrayObject_fields *)alloc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use the ndim == 0 more here, then I wonder if we shouldn't always use this at least for the shape/dims allocation as well and potentially for all very small arrays=.
@ngoldbaum I wonder if you know how bad it is to use PyObject_Malloc directly. I am sure it works (numpy has a lot of weird code that does it), but it is very unclear to me if there are downsides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was wondering about storing the strides on the object as well, even if just to keep the actual values near the object in memory.
Note that this code is just a condensed version of PyType_GenericAlloc -- it uses PyObject_Malloc too.
(Note that in an earlier version, I wrote a separate array_alloc. If we go that route, we can of course do what we want, but may need an array_free as well. This seemed less invasive...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that may be reasonable if I think about it (a custom allocate). I don't think that means you need a custom free at all. Means it is inherited just fine (only problem if someone would add an itemsize, but I think I might gamble on that).
|
|
||
| fa = (PyArrayObject_fields *) subtype->tp_alloc(subtype, 0); | ||
| npy_bool data_on_instance = NPY_FALSE; | ||
| if (subtype == &PyArray_Type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle one could add a check for tp_basicsize == our size and tp_alloc == PyObject_GenericAlloc (with alignment logic without the tp_basicsize check). But in the end, this is mainly about scalars so subclasses do not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add a note to the already long comment... We'd probably want tp_itemsize=0 too...
As mentioned in-line, that use comes straight from
I'll see if I can get 3.14rc3 to install...
Oops, I thought I had looked carefully for other uses of the memory handler, but missed it because I was looking for |
|
OK, I now also added the Separately, you mentioned somewhere (I think) that the actual allocation functions might be sped up a bit. Were you thinking about the fact that every time the handler capsule is opened? It would indeed seem worthwhile to check whether it is the default handler and then avoid that pointer chasing... |
Right. I had a start here with a slightly heavy handed approach to add a custom It's not much, I was surprised to see that a try to avoid the any allocation for scalar ufunc inputs made only a <10%, but then profiling isn't quite as simple as that. (I should time this, I wonder if I barely notice it on the M1 in a simple |
I think one could be faster by just defining a little in-line function that has a shortcut, like and then in all the which would be nicer regardless. p.s. Am slightly sad that the handler isn't just supposed to take care of the tracking as well, i.e., that the handler itself contains pointers to |
Yes, was my first thought also then realized that the
Yeah, but I also think it probably made sense to not make it very hard to support tracing in a custom allocator. I.e. it does seem like an orthgonal option to me, that maybe we can even do with a global. |
3051737 to
a3c5541
Compare
|
Hmm, MSVC does not like the zero-sized array, unfortunately. I went for the size-1 version, where that is not included in |
I checked and with python 3.12, defining an So, perhaps what's here is better. Do you think I should go for a separate |
I now tested, and on python 3.12, things works properly using I'm also still wondering about creating our own Anyway, let me know how far you think we should pursue this... It is still a very nice performance gain (though for the ufuncs, #29819 remains a much better solution...). |
Yes, this is effectively impossible unfortunately. (You would have to force every to recompile and fix a lot of downstream projects that don't just use our headers directly. So yeah, that is completely impossible, doing it would need a year long preparation at least.) |
I'll try discussing it in the meeting tonight and see if others have opinions. I think I need to let it sink a bit. It may be ugly enough that I am not sure it's worthwhile unless it is a large speedup, but it did seem pretty large. And I also could see that if we do this then merging the shape/strides is actually more worthwhile than targeting scalars. |
|
OK, I'll abandon any thought about I will think a bit about incorporating strides. It would be trivial if we could guarantee they would never change, i.e., in the absence of the ability to set Another obvious extension would be for small arrays (say, less than I'm not sure I can make the meeting (not sure I even have the e-mail still), but I think it is worth mentioning that numpy scalar to array conversion happens a lot, and small arrays are common, so speeding up those paths is worthwhile. Arguably, a larger question is whether it makes sense to move towards a future where everything is part of the instance. |
The overhead of allocating memory for array data becomes large when the inputs are scalars. So for those this PR allocates memory as part of the object using
tp_alloc. (This solves the same problem #29876 aimed to solve, but arguably in a much nicer manner.)EDIT: quite a bit of the changes are just changes in indentation, so may want to review ignoring that, via https://github.com/numpy/numpy/pull/29878/files?w=1
Timings
In principle, this could be trivially extended to doing this generally when the required amount of memory is small.
Notes/queries
tp_itemsize=1for the array object, which means that always at least 1 extra byte gets allocated [1]. I verified that with the current struct, this does not increase the size of the object (it remains 96 bytes). Alignment should be OK, since the last item on the current struct is a pointer.__setstate__, it might be an idea to add a flag that tells whether the data is on the object. That avoids comparing pointers. Logically, one might want to usemem_handler == NULLfor that, but apparently that is already assumed to mean the data was gotten withmalloc. Maybemem_handler =Py_None` is reasonable?[1] https://github.com/python/cpython/blob/37d16f7f620d79af7db1c00c8d638122af889bf9/Objects/typeobject.c#L2490-L2497