Skip to content

BUG: Remove unnecessary copying and casting from out array in choose#28206

Merged
seberg merged 29 commits intonumpy:mainfrom
MaanasArora:bug/choose-unnecessary-cast
Feb 26, 2025
Merged

BUG: Remove unnecessary copying and casting from out array in choose#28206
seberg merged 29 commits intonumpy:mainfrom
MaanasArora:bug/choose-unnecessary-cast

Conversation

@MaanasArora
Copy link
Copy Markdown
Contributor

Resolves #22237.

This PR does not copy the existing out array in np.choose, instead creating the new output array from scratch and copying it back manually.

Thank you for reviewing!

@ngoldbaum
Copy link
Copy Markdown
Member

Neat! It's a good sign the tests pass. I don't know offhand if we have great test coverage for np.choose. One thing you could do is add a test to this PR based on the test code in the linked issue demonstrating that np.choose no longer generates a spurious warning. There's an example in the pytest docs showing how to convert warnings into errors so the test fails by raising an error if a warning is generated.

@MaanasArora
Copy link
Copy Markdown
Contributor Author

@ngoldbaum Added the test, thanks! It fails in main and passes in this branch.

Copy link
Copy Markdown
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small nitpick. I thought carefully about reference counts and didn't see any issues.

@seberg can you give this a once-over once the failure case for CopyAnyInto gets handled and hit the merge button assuming you don't see any additional issues?

if (out != NULL && out != obj) {
Py_INCREF(out);
PyArray_ResolveWritebackIfCopy(obj);
PyArray_CopyAnyInto(out, obj);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fail so you need to check the failure case. The same bug existed in the code you're updating - ResolveWritebackIfCopy also returns -1 on failure.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a simple failure check; hopefully that is sufficient. Thank you!

if (out != NULL && out != obj) {
Py_INCREF(out);
PyArray_ResolveWritebackIfCopy(obj);
if (PyArray_CopyAnyInto(out, obj) != 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (PyArray_CopyAnyInto(out, obj) != 0) {
if (PyArray_CopyAnyInto(out, obj) < 0) {

Slightly more defensive in case another non-error return code is added in the future (almost certainly won't happen but a good defensive programming habit to get into).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense; I made this change, thanks.

@MaanasArora
Copy link
Copy Markdown
Contributor Author

Sorry, it seems some unrelated changes creeped in so I removed them. This new approach should be ready for review now. Thank you

@ngoldbaum
Copy link
Copy Markdown
Member

Maybe also another test to make sure the unnecessary copy the old approach triggered doesn't happen?

@MaanasArora
Copy link
Copy Markdown
Contributor Author

MaanasArora commented Feb 1, 2025

Not entirely sure if there's a convenient way to test for that in pytest? In either approach, there would probably be no data indicating a copy was made, after the array is overwritten.

@MaanasArora
Copy link
Copy Markdown
Contributor Author

Sorry about the accidental changes--I'll be more careful with Git

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, yes I think this is correct now! I also think we can simplify things more, by just adding a must_copy flag and using your new branch then unconditionally.

Py_INCREF(dtype);
obj = (PyArrayObject *)PyArray_FromArray(out, dtype, flags);
if (!PyArray_EquivTypes(dtype, PyArray_DESCR(out))) {
obj = (PyArrayObject *)PyArray_NewFromDescr(Py_TYPE(out),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any reason to use Py_TYPE(out) here, just pass NULL and get a base-class array, I think.

(There is a subtle issue here about that, I think, in that ufuncs would ensure they call __array_wrap__ when done and we don't. But this a completely different issue!)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense thanks, but looking at the doc for PyArray_NewFromDescr, should we pass &PyArray_Type instead? NULL causes a ValueError.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course. Somehow thought it accepted NULL as PyArray_Type there.

Copy link
Copy Markdown
Contributor Author

@MaanasArora MaanasArora Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change but it seems to fail a regression test test_attributes that tests for an identical info string assigned to the out array. Would you like me to push?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe push, but may not look at it today. I really don't see why it matters since obj is never returned to the user! (And not understanding that worries me more than the change ;)).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem at all, thanks for the feedback. Yes, it is a bit confusing...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I think I was editing the wrong line! Things work fine now, pushed. Thanks.

PyArray_ResolveWritebackIfCopy(obj);
int success;
if (is_newarr) {
success = PyArray_CopyAnyInto(out, obj);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, you could set the base-object to your allocation above and set to use WritebackIfCopy even with your custom allocation. But, TBH, I think a direct copy is actually clearer (frankly, I am not sure if we couldn't use it in half the places ResolveWritebackIfCopy is used ;)).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see--good to know!


# gh-28206 test new StringDType casting
x = np.array(['a', 'b', 'c'], dtype=np.dtypes.StringDType())
y = np.array(['abc', 'def', 'ghi'], dtype=np.dtypes.StringDType())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subtle thing, but make sure to include a long string (otherwise you can't trigger the path which might have crashed).

Copy link
Copy Markdown
Contributor Author

@MaanasArora MaanasArora Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that still seems to crash, thanks for noticing. I think PyArray_ConvertToCommonType (it seems to be a backward compatibility function) does not cast properly to a common scenario for StringDType, and EquivTypes passes the dtype unchanged--so, we end up with the dtype only for the first array. Should we do a better promotion of some sort if the types are equivalent? Or perhaps, for now, we can ensure they are identical?

}
if (success < 0) {
Py_DECREF(obj);
goto fail;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed: this is wrong! At this point the code already did most/all cleanup and doing that again will cause crashes. So you just need to do a straight return NULL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, the other functions do the cleanup already--fixed this!

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but one more iteration, I think. The idea here was to always manually copy out rather than use any auto-setup via PyArray_FromArray and that final writebackifcopy finalization.

So, I think both should be gone at the end, because otherwise it is confusing when what happenes.

}
dtype = PyArray_DESCR(mps[0]);

int is_newarr = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you find a different name for this? At the end of this, this should have the same meaning as out != NULL && obj != out. (or make it local and keep using that)

I.e. something clarifying that the user passed out array was copied (or must be copied).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used copy_existing_out, which hopefully isn't too clumsy!

Copy link
Copy Markdown
Contributor Author

@MaanasArora MaanasArora Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, wasn't is_newarr maybe more fitting as it also applies when the array is created from scratch (in out == NULL)? It seems to be a special case of out != NULL && obj != out after out is (re)assigned. I'll do out_newarr maybe?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, in my mind it doesn't apply to out == NULL, because to me it tells us if we need to copy back at the end (as well).

Copy link
Copy Markdown
Contributor Author

@MaanasArora MaanasArora Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I agree! What about copy_out? Since it's not always user passed...

NPY_ARRAY_WRITEBACKIFCOPY |
NPY_ARRAY_FORCECAST;

obj = (PyArrayObject *)PyArray_FromArray(out, dtype, flags);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should now never copy, so the only thing it should be doing is check for PyArray_FailUnlessWriteable. That also means you don't need the writeback-if-copy branch below.

(We can check that earlier also, since it doesn't make sense to wait until the last copy for this to fail in the other branch either.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks! Addressed this.

}
else {
success = PyArray_ResolveWritebackIfCopy(obj);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And thus, you never need this branch here either.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well.

NULL, NULL, 0,
(PyObject *)ap);
out_newarr = 1;
copy_out = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
copy_out = 1;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, your naming suggestion makes total sense now 😓 I'll change that back too, thanks!

goto fail;
}

PyArray_FailUnlessWriteable(out, "output array");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a test for this, because this doesn't work, you need an error return check and deal with that (this is not C++ with exceptions).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes sorry, fixed this, adding a test now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is pushed.

(PyObject *)out);
}
else {
obj = (PyArrayObject *)PyArray_FromArray(out, dtype, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please only incref here and assign to obj. We rely on that being true now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just to be clear, this would be the dtype? Made that change.

Copy link
Copy Markdown
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo some refactors (e.g. error refcounting was wrong).

@MaanasArora
Copy link
Copy Markdown
Contributor Author

Thank you for reviewing.

@seberg seberg merged commit cdb44f7 into numpy:main Feb 26, 2025
61 of 62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: choose does an unnecessary cast of the initial out values

3 participants