-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add __array_ufunc__ protocol to dask.array #2438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Now also supporting the In [1]: import dask.array as da
In [2]: import numpy as np
In [3]: x = da.ones((15, 15), chunks=(5, 5))
In [4]: np.add(x, 100, out=x)
In [5]: empty = da.zeros((15,), chunks=(5,))
In [6]: np.sum(x, axis=0, out=empty)
In [7]: empty.visualize('dask.png')
Out[7]: <IPython.core.display.Image object>
In [8]: empty.compute()
Out[8]:
array([ 1515., 1515., 1515., 1515., 1515., 1515., 1515., 1515.,
1515., 1515., 1515., 1515., 1515., 1515., 1515.]) |
shoyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you probably realize this, but just to note that __array_ufunc__ is only used for NumPy ufunc method. It's not used for reductions like sum or cumsum, though those use an ad-hoc dispatch system based on calling methods.
dask/array/core.py
Outdated
| from . import ufunc | ||
| da_ufunc = getattr(ufunc, numpy_ufunc.__name__) | ||
| if method == '__call__': | ||
| return da_ufunc(*inputs, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest writing just elemwise(numpy_ufunc, *inputs, **kwargs), once you've verified numpy_ufunc.signature is None (which means that isn't a gufunc). This would allow you to handle arbitrary ufuncs without dask wrappers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done this in most cases except those where the ufunc returns multiple outputs. In that case I'm reverting back to calling the dask.array version.
dask/array/core.py
Outdated
| return da_ufunc.outer(*inputs, **kwargs) | ||
| else: | ||
| from ..base import compute | ||
| inputs2 = compute(*inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you recommend? I've changed this to returning NotImplemented for now.
dask/array/core.py
Outdated
| def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs): | ||
| import pdb; pdb.set_trace() | ||
| from . import ufunc | ||
| da_ufunc = getattr(ufunc, numpy_ufunc.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to return NotImplemented here for unrecognized types, e.g., something along these lines. Otherwise, dask will take precedence over everything else, even libraries like xarray.
(There are also some subtleties around output types and the out argument that I would recommend copying from that example, though possibly it would make sense to always return NotImplemented if out is set.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're now returning NotImplemented when we don't know how to handle the inputs.
TomAugspurger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. This will need a pretty big warning for users who (maybe unknowingly via a library) relied on numpy ufuncs evaluating dask arrays eagerly.
| def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs): | ||
| out = kwargs.get('out', ()) | ||
| for x in inputs + out: | ||
| if not isinstance(x, (np.ndarray, Number, Array)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the desired behavior when interacting with, say xarray.DataArrays? I think let them decide how to handle the inputs (so NotImplemented is correct here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly why we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the sparse array type that is allowed in chunks here?
| return handle_out(out, result) | ||
|
|
||
|
|
||
| def handle_out(out, result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe could be implemented as a decorator, when functions have multiple return points (though I see you changed the one case where that happened). If elemwise was the only one that needed to be changed, then a regular function seems preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm somewhat cautious about using decorators for internal things like this. I think that we're capable of tracking down the few corner cases here and I would be concerned about reducing the explicit-ness of the code. This is pretty subjective though and just a personal preference.
Perhaps we can squeeze a FutureWarning about this behavior into the upcoming release. |
| def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs): | ||
| out = kwargs.get('out', ()) | ||
| for x in inputs + out: | ||
| if not isinstance(x, (np.ndarray, Number, Array)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly why we need this.
dask/array/core.py
Outdated
| if method == '__call__': | ||
| if numpy_ufunc.signature is not None: | ||
| return NotImplemented | ||
| if numpy_ufunc.__name__ in ('frexp', 'modf'): # multi-output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check numpy_ufunc.nout, so you can at least raise for other ufuncs with multiple outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's much cleaner
dask/array/core.py
Outdated
| return elemwise(numpy_ufunc, *inputs, **kwargs) | ||
| elif method == 'outer': | ||
| from . import ufunc | ||
| da_ufunc = getattr(ufunc, numpy_ufunc.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly safer:
try:
da_ufunc = getattr(ufunc, numpy_ufunc.__name)
except AttributeError:
return NotImplemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs): | ||
| out = kwargs.get('out', ()) | ||
| for x in inputs + out: | ||
| if not isinstance(x, (np.ndarray, Number, Array)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the sparse array type that is allowed in chunks here?
OK, this might be hard. The only thing that gets triggered on our end is the |
|
If we can't meaningfully provide warnings then do we want this change in the upcoming release? |
Correct me if I'm wrong, but could the next release include a simple definition for |
|
That sounds good. I was trying to solve this for older versions of numpy, but that's probably not the right way to think about this. |
|
I think the way to provide warnings would be to implement a version of But I'm not entirely sure this is worth the trouble -- possibly better to save this for a breaking release. |
|
We could easily make the next release such a breaking release. I'm fine either way. I think that having a release with a warning is probably the mature thing to do. However I also don't know of anyone who intentionally uses numpy ufuncs on dask.arrays and there would be some community value in getting a project out that used the new NumPy 1.13.0 features. |
|
OK, I'm inclined to use this as an opportunity to break a couple of other things in dask/distributed and issue a larger release. |
|
Looks like we're no longer raising a NotImplementedError when using the matmul operator. I'm not quite sure why this is. def test_matmul():
x = np.random.random((5, 5))
y = np.random.random((5, 2))
a = from_array(x, chunks=(1, 5))
b = from_array(y, chunks=(5, 1))
assert_eq(operator.matmul(a, b), a.dot(b))
assert_eq(operator.matmul(a, b), operator.matmul(x, y))
assert_eq(operator.matmul(a, y), operator.matmul(x, b))
list_vec = list(range(1, 6))
assert_eq(operator.matmul(list_vec, b), operator.matmul(list_vec, y))
assert_eq(operator.matmul(x, list_vec), operator.matmul(a, list_vec))
z = np.random.random((5, 5, 5))
with pytest.raises(NotImplementedError):
operator.matmul(a, z)
c = from_array(z, chunks=(1, 5, 1))
with pytest.raises(NotImplementedError):
> operator.matmul(x, c)
E Failed: DID NOT RAISE <class 'NotImplementedError'> |
|
This looks like numpy/numpy#9028 Out of curiosity, what does the result of |
|
A numpy array. It doesn't appear to go through the |
Example
-------
```python
In [1]: import dask.array as da, numpy as np
In [2]: x = np.arange(24).reshape((4, 6))
In [3]: np.sum(np.sin(x), axis=0)
Out[3]:
array([-1.56697566, 2.06850183, 3.80220828, 2.04018197, -1.59757823,
-3.76653238])
In [4]: d = da.arange(24, chunks=(4,)).reshape((4, 6))
In [5]: np.sum(np.sin(d), axis=0)
Out[5]: dask.array<sum-aggregate, shape=(6,), dtype=float64, chunksize=(6,)>
In [6]: _.compute()
Out[6]:
array([-1.56697566, 2.06850183, 3.80220828, 2.04018197, -1.59757823,
-3.76653238])
```
NumPy 1.13.0 is not yet compatible with pyarrow or pandas 0.20.1
|
OK, in order to add numpy 1.13.0 to the testing matrix I had to move around a couple things because some packages, notably on conda-forge, aren't yet compatible with the new version. In particular I made the following changes:
These latter two aren't necessarily related, but seem like decent ideas regardless. |
|
This now passes tests. I'm inclined to merge and release. @TomAugspurger are you comfortable with the lack of a deprecation cycle here if we bump the version by |
|
Yep, going forward seems like the right thing. My initial request for a warning was a warning in the release notes.
… On Jun 10, 2017, at 10:21 AM, Matthew Rocklin ***@***.***> wrote:
This now passes tests. I'm inclined to merge and release. @TomAugspurger are you comfortable with the lack of a deprecation cycle here if we bump the version by 0.X.0?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
I plan to merge this in a few hours if there are no further comments |
| conda install -q -c conda-forge \ | ||
| distributed \ | ||
| cloudpickle \ | ||
| graphviz \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the c library? Why this change?
Edit: nvm, just saw your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that your question remains valid. I was getting png failures from graphviz. I suspect that the defaults channel build is less-than-optimal again. I didn't want to deal with it just now though so punted to use conda-forge for the time being. I'll report upstream after the current build-storm passes if the problem persists.

This implements the
__array_ufunc__protocol for dask.array. Generally we just dispatch to the already-existing functions. I did this in haste and without properly reviewing the NEP/PR so I would not be surprised if I've missed something.Example
cc @shoyer @njsmith