-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add simple chunk type registry and defer as appropriate to upcast types #6393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d8490d1 to
42e951d
Compare
|
Thank you for writing this up @jthielen . I'm somewhat against an explicit whitelist of types by default. If some small third-party library makes a numpy-compatible array library I don't want them to have to explicitly talk to us or import dask in order to have things work. Ideally with protocols we're able to avoid pairwise library interactions like this. I may be missing subtle points here though. |
Thank you for the feedback! I took the list-of-downcast-types approach based on @shoyer's direction in #4583 (in contrast to the list-of-upcast-types approach we took in Pint in hgrecco/pint#959). For a library like Dask that is generally in the middle of the type casting hierarchy, I would think either of those two approaches would be good, and I would have no problem swapping over to a registry of upcast types (e.g., xarray Variable/DataArray/Dataset, Pint Quantity) instead. Also, please anyone correct me if I am wrong in this, but part of the reason I think those are the two options to decide between is that pairwise library interactions are unavoidable. The type casting hierarchy as defined in NEP 13 is a directed acyclic graph (and not a strictly ordered list of array priorities), so pairwise casting relationships are inherent to its structure and use. Perhaps a protocol establishing these pairwise relationships could be determined, but I'm not sure that gives any benefits over upcast/downcast type registries. Plus, given NEPs 13 and 18 and the work done across all the duck array libraries, we are very close to having structures like xarray > Pint > Dask > Sparse (among other combinations) working properly, and I'd hate to wait for a new protocol to be approved when it seems like we already have almost everything that we need once #4583 and pydata/xarray#4208 are resolved. But, I must admit that in all this duck array work, I've been hyper-focused on the type casting hierarchy, so I very well may be missing the forest for the trees here. |
|
I think we need something like this to make the duck array ecosystem work. There’s no a priori way to know if dask arrays should handle pint or the other way around. A some alternatives come to mind:
On the whole, I think either this PR or some version of (1) would be the best option in the long term, but alternative (1) would be a lot more work and would be slow to roll out. |
|
@shoyer It looks like we doubled up on our comments! I definitely agree with your points (and that is particularly good to note about your option 3 giving undefined behavior rather than an error, something I neglected to consider in my "either/or" point above). |
|
I haven't thought about this problem too deeply, but here are my two cents. Duck-arrays can be divided broadly into three categories, based on their interface:
In my (admittedly limited) knowledge, any DAG should work so long as it topologically sorts things by the correct order as given above. The question is where Dask fits in... Does it go into category 1 or 2? Well, it's a bit special: It shouldn't wrap things from 2, only 1. So I'd put it as a 1.5. I do agree with @shoyer that these libraries rarely coordinate on this front, but I think there's a reason for that: There's no good guidance on how to choose Going even further, I think one can (for full functionality) pick something from category 1, multiple things from category 2, and one from category 3 again. So it's really only the libraries in category 2 that have to do some extra work, all others can just reject/accept as appropriate. |
I don't think Dask is so special here. Nearly every duck array adds at least something on top of NumPy -- otherwise it wouldn't have a good reason for existing! E.g., even with pydata/sparse, you can pull out attributes indicating the underlying representation of a sparse array, which is occasionally quite useful. So most of your examples might actually fall in between 1 and 2. We might attempt to do a (near) complete census of hypothetical duck array types, and could perhaps come up with something like the following order:
Hopefully the issues with this sort of thing are self-evident:
|
|
Also, to follow up on @shoyer's points, another problem with trying to base casting rules on an ordered/tiered list (rather than the full information of the DAG) is the implicit assumption that an array at any given level has to support wrapping any of those below, which may not necessarily be the case. Take the example hierarchy from NEP 13: In any topological ordering of the DAG, C will be above D, but C cannot handle interactions with D (without B wrapping D). So, unless we force the DAG to be dense by requiring support for all arrays lower on the ordered/tiered list, the pairwise relationships between types (specific edges of the DAG) are required. |
|
I'm happy to relax my concern on operations where many operands may want to take charge, like I think that other operations like |
Would you be able to clarify this a bit further? Unless I'm not realizing the full extent of what |
I don't think we end up doing computation directly on the original array type with As for |
Ah, got it. Although, for both prototypical upcast types of Pint Quantities and xarray DataArrays, So is there a better place for such a type check on the chunks on creation? Or would it make more sense to instead just write some tests that upcast types cannot mistakenly become chunks of a dask array, and see where that leads?
That makes sense if the user calls |
The |
Makes sense now, thanks for clarifying! So, I'll plan on removing that check in |
|
Mostly I'm saying that I want Dask to be as permissive as possible, except
in cases where the behavior should definitely be undefined, as in x + y.
…On Fri, Jul 10, 2020 at 2:08 PM Jon Thielen ***@***.***> wrote:
The NotImplemented logic could be handled inside __mul__ or
__array_ufunc__. elemwise itself never should return NotImplemented, it
doesn't need to worry about dispatch.
Makes sense now, thanks for clarifying! So, I'll plan on removing that
check in elemwise and replacing it with a decorator for all the
dunder/operator methods that does same check as __array_ufunc__.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6393 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTFSXNLTJGIJCEC6NH3R257ODANCNFSM4OWC5UJA>
.
|
Yes, but hopefully you're not putting a This isn't a recommendation pattern but I don't think we need to raise an error since it can't happen accidentally. |
Sounds good. Implementing the type check on the dunder methods themselves rather than in elemwise should hopefully take care of that.
Sounds good as well. I guess I need to stop worrying so much about guarding for user mistakes. With all that said, next up I'll go ahead with those changes and high-level tests for the type casting hierarchy working properly, and then check in again for a review. I do want to hold off on low-level tests for now until it is clearly decided which option (the current opt-in registry or one of @shoyer's 4 alternatives) to settle on. |
FWIW I think that either of these are entirely reasonable actions, |
Unfortunately, that indeed works right now and it is part of the problem I was trying to fix in this PR. Take for example the following code snippet: from pint import UnitRegistry
import dask.array as da
ureg = UnitRegistry()
l = da.array([0, 1, 2] * ureg.m)
t = 1 * ureg.s
s = (1 / t) * l
print(s.units)which currently outputs when the correct unit is In order for intermediate operations with Quantities to work out in physical calculations, we need the units to be managed consistently, which can break entirely with this kind of Pint > Dask > Pint construction. I'm totally fine with not raising an error in Dask on this as @shoyer said, but it definitely needs to be discouraged when working with Pint (which is how I hope we can manage it in MetPy). |
|
Right, I agree that for operations where both sides want to assume control we need to be careful. In case you all were interested in the Dask(Pint) approach, this example would work well-ish in Dask today if Pint didn't try to consume Dask. In [1]: from pint import UnitRegistry
...: import dask.array as da
...: ureg = UnitRegistry()
...: l = da.array([0, 1, 2] * ureg.m)
...: t = 1 * ureg.s
...: s = (1 / t) * l
...: s2 = l / t
In [2]: s.units
Out[2]: <Unit('1 / second')>
In [3]: s2._meta.units
Out[3]: <Unit('meter / second')>But there are probably lots of cases where this wouldn't work as cleanly, and regardless, Dask would need to learn to pass certain getattr calls through to |
Indeed, this all goes back to the point that either Dask(Pint) or Pint(Dask) would technically work just fine, but we have to choose one and only one for everything to work out cleanly. |
4bb9241 to
26ca741
Compare
a174cfe to
ce09852
Compare
8f8354b to
9aad0ab
Compare
|
I've just rebased to latest master, and the CI is all green, so unless there are any final comments, this should be ready to merge! |
hameerabbasi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pedantic comment to preserve the type of a list/tuple. Otherwise this looks great to me!
Co-authored-by: Hameer Abbasi <[email protected]>
|
Thank you for your work here @jthielen . I apologize for the long delay. Given the positive reviews by the other folks here I'm going to go ahead and merge. Also, I notice that this is your first code contribution to this repository. Welcome! It's quite an initial PR :) |
|
Glad to be able to help! Thank you everyone for the great feedback along the way! |
|
|
||
| # First try to find a matching function name. If that doesn't work, we may | ||
| # First, verify that all types are handled by Dask. Otherwise, return NotImplemented. | ||
| if not all(type is Array or is_valid_chunk_type(type) for type in types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jthielen what was your reasoning for type is Array versus issubclass(type, Array)?
This breaks subclassing da.Array—or at least __array_function__ dispatch no longer works on subclasses. I don't want to register_chunk_type on my da.Array subclass, since it's not actually a valid chunk type.
I'm happy to open a PR switching to issubclass if you think that'll work, but I haven't read this whole PR carefully enough to follow all the nuances.
…es (dask#6393) Adds a basic list-based registry of valid chunk types for within dask.array.Array, with the ability to add otherwise unknown types with dask.array.register_chunk_type. When a type *not* in this list or defined/handled by Dask is encountered in dunder operations or NumPy functions/ufuncs, return NotImplemented.
* Support __array_function__ dispatching on Array subclasses #6393 (comment) * defer to subclasses, but don't hardcode Array * Accept superclasses in NEP-13/17 dispatching

Adds a basic list-based registry of valid chunk types for use within
dask.array.Array, with the ability to add otherwise unknown types withdask.array.register_chunk_type. When a type not in this list or defined/handled by Dask is encountered, defer to it instead. This means raisingTypeErroron creation/wrapping and returningNotImplementedinelemwise,__array_function__, and__array_ufunc__. With changes of this PR,dask.array.Arraywill be better able to respect a well-defined type casting hierarchy and will by default follow the generally-accepted hierarchy among common PyData/SciPy packages that I'm aware of. It is very much a breaking change for any array types not known by default to Dask, but these otherwise unknown types can be easily added withdask.array.register_chunk_type. It fixes bugs with non-commutative operations with Pint Quantities and xarray DataArrays.Right now, this is a draft PR since I have not added tests for the new registry functions or for
dask.array.Arrayrespecting the type casting hierarchy. If at all possible, I'd like to get feedback/approval on this general approach before spending a bunch of time on the tests of it, in case it needs to be redone in a different fashion.black dask/flake8 daskTagging @pentschev, @shoyer, @hameerabbasi, and @mrocklin based on input in #4583. I would greatly appreciate any of your feedback on this!
Closes #4583