MAINT: Avoid assumptions about how memory is allocated #29925

mhvk · 2025-10-10T21:16:30Z

While trying to store data on the instance, I found that there were a few places where code needlessly relies on details of memory allocation. Since those cases are useful to fix even though I'm considering a better scheme that would not have problems, I thought I might as well put them in a separate PR.

For ease of reference, the three commits are:

MAINT: Avoid "fooling deallocator" in array_descr_set
Found this caused problems while trying to implement storing the
dimensions and strides on the instance. Splitting it out since it is
generally bad form to assume too much about the inner workings of, e.g.,
the deallocator, and it has no real benefits beyond trying to be clever.

MAINT: Use PyArray_Resize to grow arrays in FromFile and FromIter
Found this while trying to implement storage on the instance, but
factored it out since it seems better form to just have one place
in numpy where arrays are resized inplace (which ideally would not
happen at all, but here is quite logical).

For PyArray_FromFile, there is little benefit beyond that, but in
PyArray_FromIter, it allows removing quite a bit of code that
duplicated what PyArray_Resize does.

MAINT: Ensure PyArray_Resize replaces empty allcation with 1 byte.
Minimum allocation should be 1, not elsize. Also use the default
function to deallocate dimension & strides.

seberg

Wondering slightly if a more raw thing might be helpful, but not sure, then you still need to update the shape correctly. Although... all of these functions that reasonably resize clearly have to update shape all the time anyway?

(But otherwise the None return is a thorn, although I am not strictly sure it's worth removing...)

One thing: I think you should just use dims[NPY_MAXDIMS] at least for any of the temporary ones just to simplify the code...

numpy/_core/src/multiarray/ctors.c

mhvk

@seberg - thanks for the comments! With your suggestion of just using a fixed dims[NPY_MAXDIMS] the code is indeed clearer (and more correct, in not forgetting to check for allocation errors...).

Note that I found another culprit, the textreader (which only uses UserRENEW, not UserNEW, hence an earlier grep failed to find it...). Here the use of PyArray_Resize is even nicer, since the equivalent of dims is already available in the form of result_shape.

numpy/_core/src/multiarray/ctors.c

mhvk · 2025-10-12T02:11:31Z

numpy/_core/src/multiarray/textreading/rows.c

                    goto error;
                }
-                if (needs_init) {
-                    memset(PyArray_BYTES(data_array), 0, PyArray_NBYTES(data_array));


There is no need for initializing the memory if the dtype needs it - already done by PyArray_SimpleNewFromDescr.

mhvk · 2025-10-12T02:12:34Z

numpy/_core/src/multiarray/textreading/rows.c

+            data_ptr = PyArray_DATA(data_array) + row_count * row_size;
            data_allocated_rows = new_rows;
-            if (needs_init) {
-                memset(data_ptr, '\0', (new_rows - row_count) * row_size);


PyArray_Resize also takes care of initialization. Nice number of lines saved!

seberg

OK, a bit closer look, I am happy with this overall, a few comments (but I can make a pass tomorrow or so).

(Would be curious if this is noticable speed wise, but may try myself also.)

numpy/_core/src/multiarray/ctors.c

seberg · 2025-10-12T16:12:49Z

numpy/_core/src/multiarray/ctors.c

-                new_data = PyDataMem_UserRENEW(
-                        PyArray_BYTES(ret), nbytes, PyArray_HANDLER(ret));
+                dims[0] = elcount;
+                res = PyArray_Resize(ret, &new_dims, 0, 0);


Maybe we should put NY_CORDER as last argument for clarity.

Now that I've created the internal version, I just omitted that entry - it was unused.

numpy/_core/src/multiarray/ctors.c

seberg · 2025-10-12T16:24:19Z

numpy/_core/src/multiarray/ctors.c

            elcount = (i >> 1) + (i < 4 ? 4 : 2) + i;
+            npy_intp nbytes = 0;
+            PyObject *res = NULL;
            if (!npy_mul_sizes_with_overflow(&nbytes, elcount, elsize)) {


This mul_sizes_with_overflow protects from overflows (mainly on 32bit systems).
PyArray_Resize() should already do that (I dunno if I trust it to, though).

So this branch should be unnecessary now (which also cleans up the slightly weird error logic)

Good catch, indeed it does check. And I think we should trust it...

numpy/_core/src/multiarray/ctors.c

seberg · 2025-10-12T16:31:18Z

numpy/_core/src/multiarray/getset.c

-        /* Fool deallocator not to delete these*/
-        ((PyArrayObject_fields *)temp)->nd = 0;
-        ((PyArrayObject_fields *)temp)->dimensions = NULL;
+        Py_INCREF(newtype);


I guess we still mess with one error, but yeah I don't mind doing this slightly longer dance here.

Agreed it is still not perfect, but it doesn't fit resize and here I just wanted not to mess with temp.

I've got a PR halfway that changes setstate to basically deallocate all but itself (using that part of array_dealloc) and hopefully then allocate using the relevant part of ~~PyArray_FromDescrAndData_int~~ PyArray_FromDescr_int.

For here, though, I'd suggest to leave it at this -- at least the deallocation of temp is now not messed with.

seberg · 2025-10-12T16:31:55Z

numpy/_core/src/multiarray/shape.c

        }
        new_data = PyDataMem_UserRENEW(PyArray_DATA(self),
-                                       newnbytes == 0 ? elsize : newnbytes,
+                                       newnbytes == 0 ? 1 : newnbytes,


Nice catch, was changed everywhere else!

mhvk

Thanks for the detailed review! The multiple times I forgot to decref the None from PyArray_Resize convinced me that we should after all have an internal version...

numpy/_core/src/multiarray/ctors.c

mhvk · 2025-10-12T21:05:14Z

numpy/_core/src/multiarray/ctors.c

            elcount = (i >> 1) + (i < 4 ? 4 : 2) + i;
+            npy_intp nbytes = 0;
+            PyObject *res = NULL;
            if (!npy_mul_sizes_with_overflow(&nbytes, elcount, elsize)) {


Good catch, indeed it does check. And I think we should trust it...

mhvk · 2025-10-12T21:06:05Z

numpy/_core/src/multiarray/ctors.c

-                new_data = PyDataMem_UserRENEW(
-                        PyArray_BYTES(ret), nbytes, PyArray_HANDLER(ret));
+                dims[0] = elcount;
+                res = PyArray_Resize(ret, &new_dims, 0, 0);


Now that I've created the internal version, I just omitted that entry - it was unused.

numpy/_core/src/multiarray/ctors.c

mhvk · 2025-10-12T21:10:26Z

numpy/_core/src/multiarray/getset.c

-        /* Fool deallocator not to delete these*/
-        ((PyArrayObject_fields *)temp)->nd = 0;
-        ((PyArrayObject_fields *)temp)->dimensions = NULL;
+        Py_INCREF(newtype);


Agreed it is still not perfect, but it doesn't fit resize and here I just wanted not to mess with temp.

seberg · 2025-10-15T13:06:47Z

Looking at the freeing bug, there is one fun little difference here for objects. Unlike Resize the old code always just NULLed everything for initialization and that means also shrinking doesn't need to clean up here.
But Resize actually fills as if you called np.zeros, which is a bit weird. Dunno if that should worry, since if we have objects, we probably don't need to worry about the speed of that much.

mhvk · 2025-10-15T16:39:09Z

Hmm, I actually think it is better to be safe. But really, what should happen should be the same between contruction of an an array and resizing. I saw that PyArray_FromDescr_int has a whole dance between calloc and fill_zero_info.func, which seems very similar to what PyArray_ZeroContiguousBuffer does (and maybe could be factored out?). So, it seems reasonable to do the same for resizing of arrays in FromIter and FromFile.

seberg · 2025-10-15T17:26:20Z

Hmm, I actually think it is better to be safe.

I don't mind cutting this corner (since we size it to exactly what we wrote, everything else would be horribly wrong in more than one way). But... it is maybe not worthwhile avoiding this (at least for shrinking). FWIW, for growing, I wouldn't mind being able to just init rather than zero fill.

PyArray_FromDescr_int could maybe use the PyArray_ZeroContiguousBuffer helper. But it has to do parts of the same setup to be able to decide whether it can use calloc or not.

mhvk · 2025-11-11T02:00:10Z

@seberg - was looking at this PR and trying to see what needs to be done. I think we were down to wondering whether or not objects should be initialized as NULL or None - my sense for now is that this only matters for speed in a corner case where speed is not very important, so maybe keep it simple here? If we ever want to, we can rewrite resize to initialize objects to NULL instead of None...

seberg · 2025-11-11T08:27:44Z

Yes, agreed. I tried timing briefly, reading a csv file with:

In [3]: with open("/tmp/asdf.csv", "w") as f:
   ...:     f.write("1\n" * 1000000)
   ...: 

In [4]: %timeit np.loadtxt("/tmp/asdf.csv", dtype=object)

Which went from 26ms to 29ms on my computer (hot file cache). 46 to 37 if I use dtype="O,i", delimiter="," and a two column file.

That isn't nothing, but also not a crazy slowdown, I guess. So yeah, I am fine with just doing this. I suppose conceptually, I still think these internal uses are a bit different from Python side resizing (where a np.zeros style API is also weird, but full initialization makes more sense).

Found this caused problems while trying to implement storing the dimensions and strides on the instance. Splitting it out since it is generally bad form to assume too much about the inner workings of, e.g., the deallocator, and it has no real benefits beyond trying to be clever.

…nternally. Also fix the minimum allocation used inside, which should be 1, not elsize. And use the default function to deallocate dimension & strides.

Found this while trying to implement storage on the instance, but factored it out since it seems better form to just have one place in numpy where arrays are resized inplace (which ideally would not happen at all, but here is quite logical). For PyArray_FromFile, there is little benefit beyond that, but in PyArray_FromIter, it allows removing quite a bit of code that duplicated what PyArray_Resize does.

mhvk · 2025-11-12T17:58:40Z

I think the reading slow downs are probably at the acceptable level, especially as it only affects objects, which are not very common. I also think we can solve them if we allow passing something in to tell PyArray_ZeroContiguousBuffer whether zeroing an object means writing NULL, or inserting Py_None. Indeed, that might well help code elsewhere...

mhvk · 2025-11-12T17:59:19Z

p.s. Rebased to get rid of the merge conflict.

seberg · 2025-11-12T18:14:12Z

. I also think we can solve them if we allow passing something in to tell PyArray_ZeroContiguousBuffer whether zeroing an object means writing NULL, or inserting Py_None. Indeed, that might well help code elsewhere...

An additional helper PyArray_InitContigouosBufferIfNeeded() may be useful, but they seem two very different concepts. ZeroContiguousBuffer is explicitly for np.zeros() which isn't used in many places, so I don't think it would be helpful to mix up the two concepts.

But yeah, I am tempted to add a flag to the _int version (call it "unsafe" if you like) that causes only init-if-needed for growing and maybe also promises to only shrunk unused space.
Basically, because any sane code where this isn't OK is buggy anyway (i.e. file readers/fromiter).

Anway, we can just put it in when this passes, I may follow up.

mhvk · 2025-11-12T19:20:12Z

Agreed with the "init_if_needed" idea. And also with doing it in follow-up 😺 (test passed so I went ahead and merged...)

seberg · 2025-11-13T07:59:59Z

Thanks for merging, sorry for forgetting.

mhvk requested a review from seberg October 10, 2025 21:16

mhvk added component: numpy._core 03 - Maintenance labels Oct 10, 2025

seberg reviewed Oct 11, 2025

View reviewed changes

numpy/_core/src/multiarray/ctors.c Show resolved Hide resolved

numpy/_core/src/multiarray/ctors.c Show resolved Hide resolved

mhvk force-pushed the array-avoid-assumptions branch from db194f6 to 545a8e5 Compare October 12, 2025 02:10

mhvk commented Oct 12, 2025

View reviewed changes

mhvk force-pushed the array-avoid-assumptions branch from 545a8e5 to 3b81728 Compare October 12, 2025 02:19

seberg reviewed Oct 12, 2025

View reviewed changes

mhvk force-pushed the array-avoid-assumptions branch from 3b81728 to e259933 Compare October 12, 2025 21:42

mhvk commented Oct 12, 2025

View reviewed changes

mhvk mentioned this pull request Oct 13, 2025

ENH, PERF: allocate memory as part of the array object for scalars #29878

Draft

mhvk added 4 commits November 12, 2025 12:36

MAINT,BUG: create PyArray_Resize_int, which returns int, and use it i…

930b0c2

…nternally. Also fix the minimum allocation used inside, which should be 1, not elsize. And use the default function to deallocate dimension & strides.

MAINT: use PyArray_Resize inside text reader.

9b8027f

mhvk force-pushed the array-avoid-assumptions branch from e259933 to 9b8027f Compare November 12, 2025 17:55

mhvk merged commit 60878d2 into numpy:main Nov 12, 2025
77 checks passed

mhvk deleted the array-avoid-assumptions branch November 12, 2025 19:20

mhvk mentioned this pull request Nov 13, 2025

TSK: deprecate setting strides, dtype, and shape on arrays. #28800

Open

4 tasks

Uh oh!

MAINT: Avoid assumptions about how memory is allocated #29925

MAINT: Avoid assumptions about how memory is allocated #29925

Conversation

mhvk commented Oct 10, 2025

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg commented Oct 15, 2025

Uh oh!

mhvk commented Oct 15, 2025

Uh oh!

seberg commented Oct 15, 2025

Uh oh!

mhvk commented Nov 11, 2025

Uh oh!

seberg commented Nov 11, 2025

Uh oh!

mhvk commented Nov 12, 2025

Uh oh!

mhvk commented Nov 12, 2025

Uh oh!

seberg commented Nov 12, 2025

Uh oh!

Uh oh!

mhvk commented Nov 12, 2025

Uh oh!

seberg commented Nov 13, 2025

Uh oh!

Reviewers

mhvk Oct 15, 2025 •

edited

Loading