-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add support for __array_function__ protocol #4567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ok, I'm not really sure why that's failing only on newer Python/NumPy versions. Anybody sees something obvious I've missed? |
|
NumPy 1.13 and 1.11 didn't have this protocol... So what you added was essentially dead code then. |
|
I'm surprised to see tests with sparse and cupy in this PR. From my perspective this is more about having Dask Arrays respond to NumPy functions, rather than having the blocks of a Dask array respond to NumPy functions. So I would be more interested in checks like the following: x = da.ones(...)
y = np.concatenate([x, x, x])
assert isinstance(y, da.Array)Where we would also want to check things like |
Oh yes, that makes sense. I've been testing with upstream NumPy, that's why it passed here. |
|
@mrocklin it's good that you mention that, I've totally forgotten about that case. But either way, are you suggesting we drop Sparse and CuPy tests? I understand it may be a little early to have them properly setup due to different version requirements, but ultimately, isn't that what we want, to have Dask working also with other arrays? |
|
We do want those to work, yes, but I think that that effort is somewhat orthogonal from the solution you've added here. Or, in testing terms, you've got some great integration tests here, it would be good to see some unit tests as well that just test dask array's support of the |
|
Ok, so here's my suggestion, I will add the tests you suggested (which I really forgot before), and skip the tests we currently have if NumPy < 1.17, does that sound acceptable? |
|
Sure. More tests are always great (assuming they're decently fast)
…On Fri, Mar 8, 2019 at 11:51 AM Peter Andreas Entschev < ***@***.***> wrote:
Ok, so here's my suggestion, I will add the tests you suggested (which I
really forgot before), and skip the tests we currently have if NumPy <
1.17, does that sound acceptable?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4567 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszPhczbZ13URRnlpNqoNeYL8lLmNFks5vUr9AgaJpZM4bl0pH>
.
|
Agreed, the current ones were tweaked to be fast, I didn't profile them exactly, but all 3 tests were certainly < 1 sec. |
|
Ah, now I realized why the tests are failing, before 1.17 (currently, upstream), we have to set the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. So the question here is, can we enabled that in CI for newer versions of Python? Otherwise, I think the only alternative is to skip these tests until v1.17 is released. |
|
I think that enabling that is fine
…On Fri, Mar 8, 2019 at 12:10 PM Peter Andreas Entschev < ***@***.***> wrote:
Ah, now I realized why the tests are failing, before 1.17 (currently,
upstream), we have to set the environment variable
NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1.
So the question here is, can we enabled that in CI for newer versions of
Python? Otherwise, I think the only alternative is to skip these tests
until v1.17 is released.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4567 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszHKr2YQbaO8EN5x5O-0oUo-bz46Dks5vUsOngaJpZM4bl0pH>
.
|
|
Alright, I added some Regarding the NumPy version, we need at least v1.16 (now skipped for older versions) and |
mrocklin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looks like I didn't submit this review. I copy a bit of what @shoyer said, but some other comments as well.
dask/array/core.py
Outdated
| if da_func is func: | ||
| return NotImplemented | ||
| if not all(issubclass(t, Array) for t in types): | ||
| return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in many cases a dask array function will work with numpy objects (or presumably anything on which it can effectively call asarray. I suggest a test like np.tensordot(dask_array, numpy_array) and replacing the all above with an any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The __array_function__ protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.
The problem is that you want to support operations involving some but not all other array types. For example, xarray.DataArray wraps dask arrays, not the other way around. So dask.Array.__array_function__ should return NotImplemented in that case, leaving it up xarray.DataArray.__array_function__ to define the operation.
We basically need some protocol or registration system to mark arrays as "can be coerced into dask array", e.g., either
sparse.Array.__dask_chunk_compatibile__ = True
or
dask.array.register_chunk_type(sparse.Array)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, that's an interesting problem. Maybe we try to turn all of the args into dask arrays with da.asarray and, if that works, proceed?
I'm not sure that this would produce the correct behavior for xarray.DataArray though. I genuinely don't know what the right behavior is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think my idea for mixins can help with this. Specifically, we could check the specific mixins that Dask absolutely requires, I’m assuming that’d be the indexing mixin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, that's an interesting problem. Maybe we try to turn all of the args into dask arrays with
da.asarrayand, if that works, proceed?
I actually like this idea for its simplicity.
If we want to be completely general, we could use
hasattr(t, ‘__array_function__’). But this would also include Pandas dataframes at some point I’m assuming...
I also like this idea in general, but I think then it would be best before to agree on all mixins we expect to have/support. Plus, if I understood correctly, it feels like this will require much more type checking, which may become complicated to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that that would only help to address the second point above. I'm not sure that it would solve it though. As a worst case scenario I suspect that a project like TensorFlow would not inherit from those classes. We could get NumPy, Dask, Sparse, and Xarray on board but I'm not sure about any of the others out there.
I personally think that duck typing is more likely to work across projects than explicit typing.
Mixins may still be a good idea for the reason you mention, you can easily implement a lot of NumPy-like functionality without too much effort, but I don't think that they're a good choice to test for Numpy-ness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that it's hard to get traction for mixins or add dependencies, but adding a protocol isn't hard.
While you're right that TensorFlow doesn't plan on implementing any of these protocols, it has always historically done its own thing, and it has its own distributed system. I suspect that with Google's historical "rewrite the world and do it better" philosophy, it'll be hard to get them to play along anyway.
In contrast, anyone specifically aiming to be a NumPy duck array won't have a problem implementing protocols, and that is what I'm planning to do, to allow mixins to implement methods and so forth based on the implementation of protocols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the time being, I've used @shoyer and @mrocklin suggestions replacing only that last condition in __array_function__ to:
if not all(issubclass(t, Array) for t in types):
return NotImplementedIt seems to me that the mixin discussion has grown enough that we should move it to a Dask issue, discussed there and later review the current code. Anybody disagrees with my proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since nobody opposed, I moved the discussion to #4583.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass
The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.
You need to add that for 1.16, where it was experimental. For 1.17, though, the env var isn't needed currently, but people are proposing that we keep the "experimental" status of the NEP for one more version... |
While I agree that features should be experimental until they can really be considered stable, I was really expecting that v1.17 would introduce this feature as stable. Without that, it's difficult to even get other projects to even run unit tests for it, e.g., CuPy. I'm totally willing to help improving everything we can to attempt having this in v1.17 as stable, are there many major concerns about its maturity? |
CuPy has been using it for a while (as short as whiles go)... So I think not. There might be features added but not removed. |
Tests added: * Submodules * Functionality not implemented in Dask
|
One thing I must mention I like about this implementation is that it specifically avoids a NumPy dependency or mapping. |
I fixed this issue in my latest commit. If anyone thinks this is wrong, please let me know. I would say this is ready for another round of review now. |
That looks correct to me, @pentschev. Those are not really ufuncs, but they happened to more or less work for the purposes of the original tests. So it is more correct to test them separately. Thanks for correcting these. |
jakirkham
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me. I added a small comment above about adjusting the tests a little, but don't view this as a blocker.
|
@jakirkham thanks for the review! |
| dweights = da.from_array(weights, chunks=2) | ||
| e = da.bincount(d, weights=dweights, minlength=6) | ||
| assert_eq(e, np.bincount(x, weights=dweights, minlength=6)) | ||
| assert_eq(e, np.bincount(x, weights=dweights.compute(), minlength=6)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in this, the right argument of assert_eq() is used as baseline, and x being a NumPy array, it fails here:
@wraps(np.bincount)
def bincount(x, weights=None, minlength=None):
if minlength is None:
raise TypeError("Must specify minlength argument in da.bincount")
assert x.ndim == 1
if weights is not None:
> assert weights.chunks == x.chunks
E AttributeError: 'numpy.ndarray' object has no attribute 'chunks'Therefore, to me it makes sense to make both arguments NumPy arrays and let it fallback to NumPy's implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reverse should be done, with weights being coerced to da.Array with the same chunks as x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. I think I agree with @pentschev here. The left argument, e, is testing the dask/dask situation while the right argument is testing the numpy/numpy situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reverse should be done, with
weightsbeing coerced toda.Arraywith the same chunks asx.
That would cause da.bincount() to be dispatched, then we would be testing against e which computes exactly that.
dask/array/core.py
Outdated
| if da_func is func: | ||
| return NotImplemented | ||
| if not all(issubclass(t, Array) for t in types): | ||
| return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass
The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.
I think I've left this for a reason, I can't remember for sure. But I think it was in the case of having arrays of different types. Wouldn't it if one array was a Dask array and another a CuPy one, for example? |
|
Currently the code is if not any(issubclass(t, (Array, np.ndarray)) for t in types):
return NotImplementedBut we know that |
|
@mrocklin I'm confused, you're saying now we should get rid of the entire condition or just of |
|
I'm saying that I think we can remove the entire condition. If what @shoyer suggests is true then I don't think it will ever be triggered. |
|
I was thinking of other types, like CuPy and maybe Sparse. I thought I had seen one such case, but maybe I got confused. I'll remove for now, and if need be, we can always add it back later. |
|
Currently your condition will not trigger if any of the arguments is of The fact that I may not fully understand things though. |
|
@mrocklin I don't either, but that's most likely right, since Dask can operate on different types as well, it should probably be capable of operator on mixed types too. For now I've removed this condition. |
|
Merging this in. Thanks for the effort here @pentschev and thank you @hameerabbasi, @jakirkham, and @shoyer for the review |
|
For testing, I suppose this is OK, but I do think it's a really bad idea for dask not to include any type checking inside But let's discuss more in #4583. |
|
To be clear, thanks @pentschev and @mrocklin for pushing this forward. I'm really excited about the potential here! I just want to get the design right. Given that we were involved in writing the NEP, I feel like we should try hard to be an exemplar implementation of |
|
Ah, my apologies if I jumped the gun here. Happy to revert and reopen if desired. |
|
no need to revert, we can fix this forward |
|
Thanks for the reviews @mrocklin, @shoyer, @hameerabbasi, @jakirkham. I absolutely agree with @shoyer, we should get the design right. I'm happy to help with whatever I can, I'm not sure I can help much now with coming up with a design since I'm not aware of many of the use cases and how they should work, but count on me later on for implementation and testing. |
Add __array_function__ support to Array class
Resolves #4563.
flake8 dask