Skip to content

Conversation

@francois-rozet
Copy link
Contributor

Closes #21274

@jorenham jorenham changed the title BUG: fix np.vectorize for object otype BUG: fix np.vectorize for object dtype Apr 2, 2025
@seberg
Copy link
Member

seberg commented Apr 2, 2025

Thanks, but this doesn't seem right. If your element-wise function returns arrays, then yes, it is correct to wrap them into another level of the NumPy array and repeat it based on the input shape.

I agree it is awkward if you try to spell this:

arr = np.empty((), dtype=object)
arr[()] = np.arange(3)

f(arr)  # arr is a 0-D array containing an object array

And the fact that this doesn't return a 0-D array itself is even more awkward (I am considering doing something about that, if just out=... to prevent it for now).

I agree dtype=object is unfortunately awkward in some cases. The reason is a fundamental difficulty though. NumPy must make a choice for whether np.arange(3) is a scalar or an array input (for casting), and that decision is that it is an array (to be cast).
That won't be right in all contexts, but unless you write a custom dtype that is what you have for dtype=object...

In the end, np.frompyfunc does everything right, I expect. You can also compare with np.positive(x, dtype=object) (of course that makes it harder to return arrays for the element-wise function).

If vectorize diverges from those, that would be wrong (I am not quite sure if it does).


If you want non-object-arr input to be viewed as a scalar, you need a bigger hammer somewhere.
That is plausible in principle but it would have to be indicated on the DType. A custom DType could actually do that but it can't be the default for the object dtype.

So basically, it might work with a dtype=np.dtypes.ObjectDTypes(non_obj_arrays_are_scalars=True), but then the ufunc/frompyfunc would still have to add special logic to use this for the input.
Or a custom DType (or a build in additional one, maybe a subclass of np.dtypes.ObjectDType), then you would still have to say that the input dtype (like otypes= is not object, but that new dtype.

For all of this we have the machinery available in principle (some bugs may need fixing), but of course it isn't a very small project.

@francois-rozet
Copy link
Contributor Author

francois-rozet commented Apr 2, 2025

I think there is a misunderstanding here. My issue is not about input arrays. It is about outputs. Let's say my function returns a list (or any valid array-like), and I wish to keep it as a list as it is very large, and I don't want to duplicate its content. Then, depending on the shape of the input, the vectorized output can be completely different.

>>> f = lambda x: [1, 2, 3]
>>> v = np.vectorize(f, otypes=[object])
>>> v(None)
array([1, 2, 3], dtype=object)
>>> v([None])
array([list([1, 2, 3])], dtype=object)
>>> v([[None]])
array([[list([1, 2, 3])]], dtype=object)

I believe the current behavior when the input has at-least one axis is correct. But the behavior for "scalar" inputs is inconsistent, with itself and np.frompyfunc.

What my PR does is fixing the behavior when the input is a "scalar" by adding a new axis and removing it.

>>> f = lambda x: [1, 2, 3]
>>> v = np.vectorize(f, otypes=[object])
>>> v(None)
array(list([1, 2, 3]), dtype=object)

An alternative, to match np.frompyfunc more closely, would be to index the new axis instead of squeezing it, which would lead to:

>>> f = lambda x: [1, 2, 3]
>>> v = np.vectorize(f, otypes=[object])
>>> v(None)
[1, 2, 3]

@seberg
Copy link
Member

seberg commented Apr 2, 2025

Sorry, yeah, I misjudged! I somehow read it as changing more. This diverges from frompyfunc which I agree is wrong. I suppose we could be OK with always returning an array.

I just merged #28576, maybe you can have a look? That allows using out=... to ensure an array return. That in turns means you don't need the asanyarray() call there at all (you can just call res.astype() instead). Admittedly a proper output_dtype= would be even nicer but we don't have that yet...

Then we still need to decide whether we want to return arr[()] (i.e. the scalar) at the end, but I suspect you may not even care about that part.


I do wonder if there may be cases where the old behavior gave something sane, so maybe a release note is good (maybe "change") see doc/release/upcoming_changes/README.md for how to write one.

@francois-rozet
Copy link
Contributor Author

After giving it more thoughts, I think there are other issues with the current behavior of np.vectorize. I think array-like outputs should never be implicitly cast to arrays. Why? Because it is buggy and inconsistent.

>>> np.vectorize(lambda x: [1, 2, 3])(None)
array([1, 2, 3])
>>> np.vectorize(lambda x: [1, 2, 3])([None])
...
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'
>>> np.vectorize(lambda x: [1, 2, object()])(None)
array([1, 2, <object object at 0x7f60b67d66d0>], dtype=object)
>>> np.vectorize(lambda x: [1, 2, object()])([None])
array([list([1, 2, <object object at 0x7f60b67d66e0>])], dtype=object)
>>> np.vectorize(lambda x: [1, 2, [3, 4]])(None)
...
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

Instead, I think that np.vectorize should detect types with

>>> def get_dtype(x):
...     return x.dtype if isinstance(x, np.ndarray) else np.dtype(type(x))
>>> get_dtype(1)
dtype('int64')
>>> get_dtype(np.array(1.0))
dtype('float64')
>>> get_dtype([1, 2])
dtype('O')

@seberg
Copy link
Member

seberg commented Apr 2, 2025

@francois-rozet but we are talking about two distinct problems maybe? In this PR, you are passing otypes=[object] and for that, the behavior seems perfectly well defined. While in the new examples now, you are not passing it.

I agree that if you don't pass otypes things don't make much sense unless the return value is indeed a scalar. That is similar to:

np.array([None, [1, 2, 3]])

which raises an error now, although:

np.array([None, [1, 2, 3]], dtype=object)

is -- at least for now -- allowed. Maybe not the best analogy, but the point is that we can put anything into an object dtype, but for everything else, the result should be a scalar (or at least 0-D).

So, I think it would make sense to error if it isn't 0-D, since arguably. Either the result can be 0-D (the user can easily convert it in the function), or the extra dimensions have a meaning, and the right thing would be to use signature= (even if vectorize with signature= isn't ideal).

I think I would be happy to deprecate your examples, but we should try to do deprecate it slowly, I suspect. (It seems easy to end up with a single element but not 0-D array for some odd reason.)

@francois-rozet
Copy link
Contributor Author

francois-rozet commented Apr 2, 2025

I just merged #28576, maybe you can have a look? That allows using out=... to ensure an array return.

I am not sure to understand where I can set this. In the call to the ufunc or in its creation?

@seberg
Copy link
Member

seberg commented Apr 2, 2025

I am not sure to understand where I can set this. In the call to the ufunc or in its creation?

In the call to the ufunc, you can add out=... now.

@francois-rozet
Copy link
Contributor Author

francois-rozet commented Apr 2, 2025

but we are talking about two distinct problems maybe?

Yes you are right, the type detection and casting are not the same problems. But I figured they are related. Should I remove the fix in the last commit, which ensures that list and tuple in the output are not cast to arrays implicitly?

Note that I do not necessarily agree with you that the output should be a "scalar". When the output is already an array, I think it makes sense for vectorize to stack the outputs into a single array.

@francois-rozet
Copy link
Contributor Author

In the call to the ufunc, you can add out=... now.

Is my last commit what you meant?

@seberg
Copy link
Member

seberg commented Apr 2, 2025

I think it makes sense for vectorize to stack the outputs into a single array.

Yeah, fair. For a normal ufunc that is what signature= should indicate (that the output shape is different). But I would agree with not messing with things here unnecessarily.
I suppose the point is whether you get into a confusing mess due to that otype object/non-object differing here in obscure ways.

Should I remove the fix in the last commit, which ensures that list and tuple in the output are not cast to arrays implicitly?

Not sure I follow, so please do what you think is best for now. I'll have to think about it a bit more later.

Is my last commit what you meant?

Yes, I think we may be able to change some of the following code a bit with that (things are already arrays), but not sure it is needed.

@francois-rozet
Copy link
Contributor Author

francois-rozet commented Apr 2, 2025

To summarize the modifications:

  1. When otype=object, np.ndarray outputs are not cast to dtype=object anymore.
  2. When otypes is not set, list and tuple outputs are not cast to arrays anymore.

@seberg
Copy link
Member

seberg commented Apr 2, 2025

Thanks for the summary, now I see what you meant above. Maybe it is best to split the two, my first gut feeling is that change 1 is quite a bit safer than 2.
(but again, we are iterating quickly, and I am only paying half attention right now)

@francois-rozet
Copy link
Contributor Author

Sorry, I made a mistake. The two changes are not separable.

The original issue was that for scalar inputs the output of the ufunc was not wrapped into an object array, resulting in inconsistent casts for array-like outputs. This was resolved with out=....

Now, the outputs of the ufunc are always object arrays. If these object arrays contain Python objects including list and tuple they cannot be cast to anything but dtype=object.

However, vectorize currently sets the otype of lists and tuples as the type of their content, instead of object. For example, [1, 2, 3] has otype int. Then, when it attempts to cast the output of the ufunc, it crashes.

In summary, without fixing the otype detection, vectorize does not work with any function that returns lists or tuples.

@seberg
Copy link
Member

seberg commented Apr 3, 2025

In summary, without fixing the otype detection, vectorize does not work with any function that returns lists or tuples.

Right, but to me it seems possibly OK to force users to set otype=[object] here? We still have two distinct things, if otype is set, we have a clear fix for scalar results.
If otype isn't set you are making certain usage patterns possible that were not before.

I appreciate that you want to make things more useful here, but I am not sure if that is the right way to make it more useful yet. E.g. signature= should already provide a way for list/tuple returns to be incorporated into a numeric array in principle.
(Or I could also say, that I am very comfortable with asking users to rely on otypes= more, because it is safer anyway.)

@francois-rozet
Copy link
Contributor Author

Or I could also say, that I am very comfortable with asking users to rely on otypes= more, because it is safer anyway.

So should we let the code crash? Or should we add an assert warning the user to use otypes if the output is a list?

@seberg
Copy link
Member

seberg commented Apr 3, 2025

If we know it's definitely going to crash while figuring out otypes, then raising a more specific error seems good. But maybe it is still easier to split the two?

@francois-rozet
Copy link
Contributor Author

_get_ufunc_and_otypes will crash if the output is a nested list/tuple with uneven lengths. If the output is a list of NumPy-compatible types (int, float, str, ...), _get_ufunc_and_otypes will detect otypes incorrectly and vectorize will crash when casting later on.

@seberg
Copy link
Member

seberg commented Apr 4, 2025

But by "crash" you mean, "raises a hard to understand exception". And if I accept that exception (which for now I do), then the only problem is the "hard to understand" (especially w.r.t. to how to work around).

Which is why I like to split it out, so we can work out there whether we want to improve the error (so it tends to trigger immediately, always and informs of otypes=[object] for example).
(I may be slow to respond for a bit, but I think the first part just seems much clearer since I think in all cases it fixes what is clearly wrong.)

EDIT: We have a hidden function (that we should unhide in parts) np._core.multiarray_umath._discover_array_parameters, since we never added an ndmax parameter to np.asarray() which may be useful there. We can add it without any thoughts there (it's in C, but it is extremely easy to add there). I am not actually sure how that parameter would behave for a tuple (go to object or raise an error unless dtype=object).

@francois-rozet
Copy link
Contributor Author

francois-rozet commented Apr 13, 2025

@seberg I have reverted the otypes detection to the old behavior. The PR now only fixes the incorrect casting when the input is a "scalar". All the bugs reported in #28624 (comment) are therefore still present.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reverting and follow ups! I think this is now indeed always a bug fix, so let's put it in.

We can discuss the other changes on a new PR.

@seberg seberg merged commit 461d1c0 into numpy:main Apr 22, 2025
71 of 72 checks passed
@github-project-automation github-project-automation bot moved this from Pending authors' response to Completed in NumPy first-time contributor PRs Apr 22, 2025
@seberg
Copy link
Member

seberg commented May 26, 2025

Scratch that. Just fix the tests!

You never supported vectorize. Nothing significant there changed at all. The only change is that we stopped calling asarray() rather than asanyarray() (i.e. something you like) and because of that the error changed.

And it only changed if otypes= isn't used anyway.

EDIT: Sorry got the wrong window, this was for astropy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Development

Successfully merging this pull request may close these issues.

BUG: np.vectorize is inconsistent for scalar inputs and array-like outputs

2 participants