-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
BUG: fix np.vectorize for object dtype
#28624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
np.vectorize for object dtype
|
Thanks, but this doesn't seem right. If your element-wise function returns arrays, then yes, it is correct to wrap them into another level of the NumPy array and repeat it based on the input shape. I agree it is awkward if you try to spell this: And the fact that this doesn't return a 0-D array itself is even more awkward (I am considering doing something about that, if just I agree In the end, If vectorize diverges from those, that would be wrong (I am not quite sure if it does). If you want So basically, it might work with a For all of this we have the machinery available in principle (some bugs may need fixing), but of course it isn't a very small project. |
|
I think there is a misunderstanding here. My issue is not about input arrays. It is about outputs. Let's say my function returns a list (or any valid array-like), and I wish to keep it as a list as it is very large, and I don't want to duplicate its content. Then, depending on the shape of the input, the vectorized output can be completely different. >>> f = lambda x: [1, 2, 3]
>>> v = np.vectorize(f, otypes=[object])
>>> v(None)
array([1, 2, 3], dtype=object)
>>> v([None])
array([list([1, 2, 3])], dtype=object)
>>> v([[None]])
array([[list([1, 2, 3])]], dtype=object)I believe the current behavior when the input has at-least one axis is correct. But the behavior for "scalar" inputs is inconsistent, with itself and What my PR does is fixing the behavior when the input is a "scalar" by adding a new axis and removing it. >>> f = lambda x: [1, 2, 3]
>>> v = np.vectorize(f, otypes=[object])
>>> v(None)
array(list([1, 2, 3]), dtype=object)An alternative, to match >>> f = lambda x: [1, 2, 3]
>>> v = np.vectorize(f, otypes=[object])
>>> v(None)
[1, 2, 3] |
|
Sorry, yeah, I misjudged! I somehow read it as changing more. This diverges from I just merged #28576, maybe you can have a look? That allows using Then we still need to decide whether we want to return I do wonder if there may be cases where the old behavior gave something sane, so maybe a release note is good (maybe "change") see |
|
After giving it more thoughts, I think there are other issues with the current behavior of >>> np.vectorize(lambda x: [1, 2, 3])(None)
array([1, 2, 3])
>>> np.vectorize(lambda x: [1, 2, 3])([None])
...
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'list'>>> np.vectorize(lambda x: [1, 2, object()])(None)
array([1, 2, <object object at 0x7f60b67d66d0>], dtype=object)
>>> np.vectorize(lambda x: [1, 2, object()])([None])
array([list([1, 2, <object object at 0x7f60b67d66e0>])], dtype=object)>>> np.vectorize(lambda x: [1, 2, [3, 4]])(None)
...
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.Instead, I think that >>> def get_dtype(x):
... return x.dtype if isinstance(x, np.ndarray) else np.dtype(type(x))
>>> get_dtype(1)
dtype('int64')
>>> get_dtype(np.array(1.0))
dtype('float64')
>>> get_dtype([1, 2])
dtype('O') |
|
@francois-rozet but we are talking about two distinct problems maybe? In this PR, you are passing I agree that if you don't pass which raises an error now, although: is -- at least for now -- allowed. Maybe not the best analogy, but the point is that we can put anything into an object dtype, but for everything else, the result should be a scalar (or at least 0-D). So, I think it would make sense to error if it isn't 0-D, since arguably. Either the result can be 0-D (the user can easily convert it in the function), or the extra dimensions have a meaning, and the right thing would be to use I think I would be happy to deprecate your examples, but we should try to do deprecate it slowly, I suspect. (It seems easy to end up with a single element but not 0-D array for some odd reason.) |
I am not sure to understand where I can set this. In the call to the |
In the call to the ufunc, you can add |
Yes you are right, the type detection and casting are not the same problems. But I figured they are related. Should I remove the fix in the last commit, which ensures that Note that I do not necessarily agree with you that the output should be a "scalar". When the output is already an array, I think it makes sense for |
Is my last commit what you meant? |
Yeah, fair. For a normal
Not sure I follow, so please do what you think is best for now. I'll have to think about it a bit more later.
Yes, I think we may be able to change some of the following code a bit with that (things are already arrays), but not sure it is needed. |
|
To summarize the modifications:
|
|
Thanks for the summary, now I see what you meant above. Maybe it is best to split the two, my first gut feeling is that change 1 is quite a bit safer than 2. |
|
Sorry, I made a mistake. The two changes are not separable. The original issue was that for scalar inputs the output of the ufunc was not wrapped into an object array, resulting in inconsistent casts for array-like outputs. This was resolved with Now, the outputs of the ufunc are always object arrays. If these object arrays contain Python objects including However, In summary, without fixing the otype detection, |
Right, but to me it seems possibly OK to force users to set I appreciate that you want to make things more useful here, but I am not sure if that is the right way to make it more useful yet. E.g. |
So should we let the code crash? Or should we add an |
|
If we know it's definitely going to crash while figuring out |
|
|
|
But by "crash" you mean, "raises a hard to understand exception". And if I accept that exception (which for now I do), then the only problem is the "hard to understand" (especially w.r.t. to how to work around). Which is why I like to split it out, so we can work out there whether we want to improve the error (so it tends to trigger immediately, always and informs of EDIT: We have a hidden function (that we should unhide in parts) |
|
@seberg I have reverted the otypes detection to the old behavior. The PR now only fixes the incorrect casting when the input is a "scalar". All the bugs reported in #28624 (comment) are therefore still present. |
seberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reverting and follow ups! I think this is now indeed always a bug fix, so let's put it in.
We can discuss the other changes on a new PR.
|
Scratch that. Just fix the tests! You never supported And it only changed if EDIT: Sorry got the wrong window, this was for astropy |
Closes #21274