Skip to content

Conversation

@jthielen
Copy link
Contributor

@jthielen jthielen commented Jul 10, 2020

Adds a basic list-based registry of valid chunk types for use within dask.array.Array, with the ability to add otherwise unknown types with dask.array.register_chunk_type. When a type not in this list or defined/handled by Dask is encountered, defer to it instead. This means raising TypeError on creation/wrapping and returning NotImplemented in elemwise, __array_function__, and __array_ufunc__. With changes of this PR, dask.array.Array will be better able to respect a well-defined type casting hierarchy and will by default follow the generally-accepted hierarchy among common PyData/SciPy packages that I'm aware of. It is very much a breaking change for any array types not known by default to Dask, but these otherwise unknown types can be easily added with dask.array.register_chunk_type. It fixes bugs with non-commutative operations with Pint Quantities and xarray DataArrays.

Right now, this is a draft PR since I have not added tests for the new registry functions or for dask.array.Array respecting the type casting hierarchy. If at all possible, I'd like to get feedback/approval on this general approach before spending a bunch of time on the tests of it, in case it needs to be redone in a different fashion.

  • Tests added
  • Tests passed
  • Passes black dask / flake8 dask

Tagging @pentschev, @shoyer, @hameerabbasi, and @mrocklin based on input in #4583. I would greatly appreciate any of your feedback on this!

Closes #4583

@mrocklin
Copy link
Member

Thank you for writing this up @jthielen .

I'm somewhat against an explicit whitelist of types by default. If some small third-party library makes a numpy-compatible array library I don't want them to have to explicitly talk to us or import dask in order to have things work. Ideally with protocols we're able to avoid pairwise library interactions like this.

I may be missing subtle points here though.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 10, 2020

I'm somewhat against an explicit whitelist of types by default. If some small third-party library makes a numpy-compatible array library I don't want them to have to explicitly talk to us or import dask in order to have things work. Ideally with protocols we're able to avoid pairwise library interactions like this.

Thank you for the feedback! I took the list-of-downcast-types approach based on @shoyer's direction in #4583 (in contrast to the list-of-upcast-types approach we took in Pint in hgrecco/pint#959). For a library like Dask that is generally in the middle of the type casting hierarchy, I would think either of those two approaches would be good, and I would have no problem swapping over to a registry of upcast types (e.g., xarray Variable/DataArray/Dataset, Pint Quantity) instead.

Also, please anyone correct me if I am wrong in this, but part of the reason I think those are the two options to decide between is that pairwise library interactions are unavoidable. The type casting hierarchy as defined in NEP 13 is a directed acyclic graph (and not a strictly ordered list of array priorities), so pairwise casting relationships are inherent to its structure and use. Perhaps a protocol establishing these pairwise relationships could be determined, but I'm not sure that gives any benefits over upcast/downcast type registries. Plus, given NEPs 13 and 18 and the work done across all the duck array libraries, we are very close to having structures like xarray > Pint > Dask > Sparse (among other combinations) working properly, and I'd hate to wait for a new protocol to be approved when it seems like we already have almost everything that we need once #4583 and pydata/xarray#4208 are resolved.

But, I must admit that in all this duck array work, I've been hyper-focused on the type casting hierarchy, so I very well may be missing the forest for the trees here.

@shoyer
Copy link
Member

shoyer commented Jul 10, 2020

I think we need something like this to make the duck array ecosystem work. There’s no a priori way to know if dask arrays should handle pint or the other way around.

A some alternatives come to mind:

  1. Put this logic (the which array handles which other arrays DAG) into a third party library, maybe eventually even in NumPy itself? This might make it easier to enforce invariants, like ensuring that type handling is indeed a DAG.

  2. Use a protocol to enforce a global ordering between array types, like NumPy’s deprecated use of __array_priority__. This sounds fine in theory (any DAG can be topologically sorted, after all) but doesn’t scale — every library has to coordinate on what priorities they pick, and in practice that just doesn’t seem to happen.

  3. Make registration opt-out instead of opt-in. But this means the default behavior would be undefined behavior rather than an error.

  4. Use some sort of protocol attribute / method for recognizing valid (or invalid) chunk array types. This is a bit less flexible than explicit registration, though, because it needs to implemented on the Array object (at least for C extensions and objects that use __slots__). I think it’s more convenient to let anyone register chunk types.

On the whole, I think either this PR or some version of (1) would be the best option in the long term, but alternative (1) would be a lot more work and would be slow to roll out.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 10, 2020

@shoyer It looks like we doubled up on our comments! I definitely agree with your points (and that is particularly good to note about your option 3 giving undefined behavior rather than an error, something I neglected to consider in my "either/or" point above).

@hameerabbasi
Copy link
Contributor

hameerabbasi commented Jul 10, 2020

I haven't thought about this problem too deeply, but here are my two cents. Duck-arrays can be divided broadly into three categories, based on their interface:

  1. The first is those that provide a (nearly) ndarray interface (Only one of these at a time).
    • CuPy
    • PyData/Sparse
    • JAX Array
    • (am I missing any?)
  2. The second are those that purposefully diverge from the ndarray interface, but still follow it quite closely. These usually provide extra functionality and wrap other arrays (AFAICT, only one of these can be in a hierarchy at a time unless it forwards __getattr__ or some similar magic):
    • pint Quantity
    • udiff
  3. Those that do not follow the ndarray interface at all, just wrap other arrays:
    • XArray DataSet/DataArray/Variable
    • Pandas (in the future, maybe?)

In my (admittedly limited) knowledge, any DAG should work so long as it topologically sorts things by the correct order as given above. The question is where Dask fits in... Does it go into category 1 or 2? Well, it's a bit special: It shouldn't wrap things from 2, only 1. So I'd put it as a 1.5.

I do agree with @shoyer that these libraries rarely coordinate on this front, but I think there's a reason for that: There's no good guidance on how to choose __array_priority__ in the first place other than going out and doing some research. Maybe the above could be added as guidance, perhaps multiplying numbers by 100?

Going even further, I think one can (for full functionality) pick something from category 1, multiple things from category 2, and one from category 3 again. So it's really only the libraries in category 2 that have to do some extra work, all others can just reject/accept as appropriate.

@shoyer
Copy link
Member

shoyer commented Jul 10, 2020

In my (admittedly limited) knowledge, any DAG should work so long as it topologically sorts things by the correct order as given above. The question is where Dask fits in... Does it go into category 1 or 2? Well, it's a bit special: It shouldn't wrap things from 2, only 1. So I'd put it as a 1.5.

I don't think Dask is so special here. Nearly every duck array adds at least something on top of NumPy -- otherwise it wouldn't have a good reason for existing! E.g., even with pydata/sparse, you can pull out attributes indicating the underlying representation of a sparse array, which is occasionally quite useful. So most of your examples might actually fall in between 1 and 2.

We might attempt to do a (near) complete census of hypothetical duck array types, and could perhaps come up with something like the following order:

  1. CPU/GPU arrays
  2. New dtype arrays (e.g, for bfloat16 or units or...)
  3. Sparse arrays
  4. Distributed arrays
  5. Autodiff arrays (hypothetical duck array version of autograd)
  6. Masked arrays (hypothetical duck array version of np.ma)
  7. Labeled arrays (with compatible semantics to NumPy, like Pytorch)

Hopefully the issues with this sort of thing are self-evident:

  • It's almost assuredly still incomplete. Someone could come along with a new duck array type like awkward array (for ragged arrays) and we'd have to figure out how to slot it in.
  • Some duck array types fall into multiple categories. For example, JAX currently does (1), (2), (4) and (5).
  • The right ordering may depend on details of how array types are implemented, not just their category. For example, is a new dtype array implemented as a duck array on top of existing dtypes, or in terms of low operations specialized for particular hardware?
  • Sometimes the right ordering is rather arbitrary (e.g., for pint vs dask). There are arguments for doing it both ways. The most important thing is that it's picked consistently.

@jthielen
Copy link
Contributor Author

Also, to follow up on @shoyer's points, another problem with trying to base casting rules on an ordered/tiered list (rather than the full information of the DAG) is the implicit assumption that an array at any given level has to support wrapping any of those below, which may not necessarily be the case. Take the example hierarchy from NEP 13:

In any topological ordering of the DAG, C will be above D, but C cannot handle interactions with D (without B wrapping D). So, unless we force the DAG to be dense by requiring support for all arrays lower on the ordered/tiered list, the pairwise relationships between types (specific edges of the DAG) are required.

@mrocklin
Copy link
Member

I'm happy to relax my concern on operations where many operands may want to take charge, like __array_function__ and my_dask_array + my_xarray.

I think that other operations like from_array or elemwise should work regardless of the input types. The user is asking to turn that thing into a dask array. Also, it's going to be near-impossible to predict every array storage type.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 10, 2020

I'm happy to relax my concern on operations where many operands may want to take charge, like __array_function__ and my_dask_array + my_xarray.

I think that other operations like from_array or elemwise should work regardless of the input types. The user is asking to turn that thing into a dask array. Also, it's going to be near-impossible to predict every array storage type.

Would you be able to clarify this a bit further? Unless I'm not realizing the full extent of what elemwise does, I understood it as how all the operator methods (__add__, __lt__, etc.) are defined, so those all need to be able to defer properly to upcast types in the same way that __array_function__ and __array_ufunc__ do. Likewise, I think construction of a dask array needs to prohibit using upcast types inside the dask array, otherwise you could easily end up with nasty constructs like Pint wrapping Dask wrapping Pint if a user or downstream library isn't careful. With that in mind, if we wanted those to work more flexibly with unknown input types, then we are in essence back to @shoyer's option 3 of opt-out. Does it make sense for some operations to be opt-in with the registry of this PR (or a duck array DAG package like @shoyer's option 1) and others be opt-out?

@shoyer
Copy link
Member

shoyer commented Jul 10, 2020

I think that other operations like from_array or elemwise should work regardless of the input types. The user is asking to turn that thing into a dask array. Also, it's going to be near-impossible to predict every array storage type.

Would you be able to clarify this a bit further? Unless I'm not realizing the full extent of what elemwise does, I understood it as how all the operator methods (__add__, __lt__, etc.) are defined, so those all need to be able to defer properly to upcast types in the same way that __array_function__ and __array_ufunc__ do

I don't think we end up doing computation directly on the original array type with from_array, e.g., on zarr arrays. Slices get inserted, so arithmetic is actually does on chunk_array[:], which will typically be a NumPy array.

As for elemwise, yes, dispatch will need to happen inside of each chunk -- but that should happen automatically. We already know the result should be a dask array since user called dask.elemwise.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 10, 2020

I don't think we end up doing computation directly on the original array type with from_array, e.g., on zarr arrays. Slices get inserted, so arithmetic is actually does on chunk_array[:], which will typically be a NumPy array.

Ah, got it. Although, for both prototypical upcast types of Pint Quantities and xarray DataArrays, chunk_array[:] is still just (a copy of) chunk_array, which we don't want inside the dask array either.

So is there a better place for such a type check on the chunks on creation? Or would it make more sense to instead just write some tests that upcast types cannot mistakenly become chunks of a dask array, and see where that leads?

As for elemwise, yes, dispatch will need to happen inside of each chunk -- but that should happen automatically. We already know the result should be a dask array since user called dask.elemwise.

That makes sense if the user calls elemwise (knowing that we're already within the realm of Dask-handled arrays which excludes upcast types), however, again unless I'm missing something big, I don't see where elemwise is used in this way...I can't find it anywhere in the docs. What looks to be more common is elemwise being used through dunder/operator methods, like my_dask_array * my_pint_unit. Here my_dask_array needs to return NotImplemented in elemwise (which is all __mul__ is calling) before delegating to blockwise application of operator.mul so that my_pint_unit properly handles the operation with its __rmul__ in order to construct a Quantity wrapping my_dask_array.

@shoyer
Copy link
Member

shoyer commented Jul 10, 2020

Here my_dask_array needs to return NotImplemented in elemwise (which is all __mul__ is calling) before delegating to blockwise application of operator.mul so that my_pint_unit properly handles the operation with its __rmul__ in order to construct a Quantity wrapping my_dask_array.

The NotImplemented logic could be handled inside __mul__ or __array_ufunc__. elemwise itself never should return NotImplemented, it doesn't need to worry about dispatch.

@jthielen
Copy link
Contributor Author

The NotImplemented logic could be handled inside __mul__ or __array_ufunc__. elemwise itself never should return NotImplemented, it doesn't need to worry about dispatch.

Makes sense now, thanks for clarifying! So, I'll plan on removing that check in elemwise and replacing it with a decorator for all the dunder/operator methods that does same check as __array_ufunc__.

@mrocklin
Copy link
Member

mrocklin commented Jul 10, 2020 via email

@shoyer
Copy link
Member

shoyer commented Jul 10, 2020

Ah, got it. Although, for both prototypical upcast types of Pint Quantities and xarray DataArrays, chunk_array[:] is still just (a copy of) chunk_array, which we don't want inside the dask array either.

Yes, but hopefully you're not putting a pint.Quantity or xarray.DataArray as input into dask.array.from_array.

This isn't a recommendation pattern but I don't think we need to raise an error since it can't happen accidentally.

@jthielen
Copy link
Contributor Author

Mostly I'm saying that I want Dask to be as permissive as possible, except in cases where the behavior should definitely be undefined, as in x + y.

Sounds good. Implementing the type check on the dunder methods themselves rather than in elemwise should hopefully take care of that.

Yes, but hopefully you're not putting a pint.Quantity or xarray.DataArray as input into dask.array.from_array.

This isn't a recommendation pattern but I don't think we need to raise an error since it can't happen accidentally.

Sounds good as well. I guess I need to stop worrying so much about guarding for user mistakes.

With all that said, next up I'll go ahead with those changes and high-level tests for the type casting hierarchy working properly, and then check in again for a review. I do want to hold off on low-level tests for now until it is clearly decided which option (the current opt-in registry or one of @shoyer's 4 alternatives) to settle on.

@mrocklin
Copy link
Member

Yes, but hopefully you're not putting a pint.Quantity or xarray.DataArray as input into dask.array.from_array.

FWIW I think that either of these are entirely reasonable actions, pint.Quantity especially, which I would hope would mostly work today.

@jthielen
Copy link
Contributor Author

jthielen commented Jul 10, 2020

Yes, but hopefully you're not putting a pint.Quantity or xarray.DataArray as input into dask.array.from_array.

FWIW I think that either of these are entirely reasonable actions, pint.Quantity especially, which I would hope would mostly work today.

Unfortunately, that indeed works right now and it is part of the problem I was trying to fix in this PR. Take for example the following code snippet:

from pint import UnitRegistry
import dask.array as da
ureg = UnitRegistry()
l = da.array([0, 1, 2] * ureg.m)
t = 1 * ureg.s
s = (1 / t) * l
print(s.units)

which currently outputs

1 / second

when the correct unit is

meter / second

In order for intermediate operations with Quantities to work out in physical calculations, we need the units to be managed consistently, which can break entirely with this kind of Pint > Dask > Pint construction. I'm totally fine with not raising an error in Dask on this as @shoyer said, but it definitely needs to be discouraged when working with Pint (which is how I hope we can manage it in MetPy).

@mrocklin
Copy link
Member

Right, I agree that for operations where both sides want to assume control we need to be careful.

In case you all were interested in the Dask(Pint) approach, this example would work well-ish in Dask today if Pint didn't try to consume Dask.

In [1]: from pint import UnitRegistry 
   ...: import dask.array as da 
   ...: ureg = UnitRegistry() 
   ...: l = da.array([0, 1, 2] * ureg.m) 
   ...: t = 1 * ureg.s 
   ...: s = (1 / t) * l 
   ...: s2 = l / t
                                                                                                                                                                                         
In [2]: s.units                                                                                                                                                                                                   
Out[2]: <Unit('1 / second')>

In [3]: s2._meta.units                                                                                                                                                                                            
Out[3]: <Unit('meter / second')>

But there are probably lots of cases where this wouldn't work as cleanly, and regardless, Dask would need to learn to pass certain getattr calls through to _meta

@jthielen
Copy link
Contributor Author

Right, I agree that for operations where both sides want to assume control we need to be careful.

In case you all were interested in the Dask(Pint) approach, this example would work well-ish in Dask today if Pint didn't try to consume Dask.
...
But there are probably lots of cases where this wouldn't work as cleanly, and regardless, Dask would need to learn to pass certain getattr calls through to _meta

Indeed, this all goes back to the point that either Dask(Pint) or Pint(Dask) would technically work just fine, but we have to choose one and only one for everything to work out cleanly.

@jthielen jthielen force-pushed the type-casting-hierarchy branch 2 times, most recently from 4bb9241 to 26ca741 Compare July 11, 2020 20:04
@jthielen jthielen force-pushed the type-casting-hierarchy branch 5 times, most recently from a174cfe to ce09852 Compare July 13, 2020 00:48
@jthielen jthielen force-pushed the type-casting-hierarchy branch from 8f8354b to 9aad0ab Compare August 24, 2020 16:49
@jthielen
Copy link
Contributor Author

I've just rebased to latest master, and the CI is all green, so unless there are any final comments, this should be ready to merge!

Copy link
Contributor

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pedantic comment to preserve the type of a list/tuple. Otherwise this looks great to me!

@mrocklin
Copy link
Member

mrocklin commented Sep 2, 2020

Thank you for your work here @jthielen . I apologize for the long delay. Given the positive reviews by the other folks here I'm going to go ahead and merge.

Also, I notice that this is your first code contribution to this repository. Welcome! It's quite an initial PR :)

@mrocklin mrocklin merged commit 0ca1607 into dask:master Sep 2, 2020
@jthielen
Copy link
Contributor Author

jthielen commented Sep 2, 2020

Glad to be able to help! Thank you everyone for the great feedback along the way!


# First try to find a matching function name. If that doesn't work, we may
# First, verify that all types are handled by Dask. Otherwise, return NotImplemented.
if not all(type is Array or is_valid_chunk_type(type) for type in types):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jthielen what was your reasoning for type is Array versus issubclass(type, Array)?

This breaks subclassing da.Array—or at least __array_function__ dispatch no longer works on subclasses. I don't want to register_chunk_type on my da.Array subclass, since it's not actually a valid chunk type.

I'm happy to open a PR switching to issubclass if you think that'll work, but I haven't read this whole PR carefully enough to follow all the nuances.

gjoseph92 added a commit to gjoseph92/dask that referenced this pull request Oct 1, 2020
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
…es (dask#6393)

Adds a basic list-based registry of valid chunk types for within
dask.array.Array, with the ability to add otherwise unknown types with
dask.array.register_chunk_type. When a type *not* in this list or
defined/handled by Dask is encountered in dunder operations or NumPy
functions/ufuncs, return NotImplemented.
ncclementi pushed a commit that referenced this pull request Aug 25, 2022
* Support __array_function__ dispatching on Array subclasses

#6393 (comment)

* defer to subclasses, but don't hardcode Array

* Accept superclasses in NEP-13/17 dispatching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Type checking or duck typing inside dask.array.Array.__array_function__

6 participants