Skip to content

Conversation

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Jun 7, 2017

This implements the __array_ufunc__ protocol for dask.array. Generally we just dispatch to the already-existing functions. I did this in haste and without properly reviewing the NEP/PR so I would not be surprised if I've missed something.

Example

In [1]: import dask.array as da, numpy as np

In [2]: x = np.arange(24).reshape((4, 6))

In [3]: np.sum(np.sin(x), axis=0)
Out[3]:
array([-1.56697566,  2.06850183,  3.80220828,  2.04018197, -1.59757823,
       -3.76653238])

In [4]: d = da.arange(24, chunks=(4,)).reshape((4, 6))

In [5]: np.sum(np.sin(d), axis=0)
Out[5]: dask.array<sum-aggregate, shape=(6,), dtype=float64, chunksize=(6,)>

In [6]: _.compute()
Out[6]:
array([-1.56697566,  2.06850183,  3.80220828,  2.04018197, -1.59757823,
       -3.76653238])

cc @shoyer @njsmith

@mrocklin
Copy link
Member Author

mrocklin commented Jun 7, 2017

Now also supporting the out= parameter. Interestingly this fails for cumulative reductions, which seem to be assuming numpy arrays.

In [1]: import dask.array as da

In [2]: import numpy as np

In [3]: x = da.ones((15, 15), chunks=(5, 5))

In [4]: np.add(x, 100, out=x)

In [5]: empty = da.zeros((15,), chunks=(5,))

In [6]: np.sum(x, axis=0, out=empty)

In [7]: empty.visualize('dask.png')
Out[7]: <IPython.core.display.Image object>

In [8]: empty.compute()
Out[8]: 
array([ 1515.,  1515.,  1515.,  1515.,  1515.,  1515.,  1515.,  1515.,
        1515.,  1515.,  1515.,  1515.,  1515.,  1515.,  1515.])

dask

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you probably realize this, but just to note that __array_ufunc__ is only used for NumPy ufunc method. It's not used for reductions like sum or cumsum, though those use an ad-hoc dispatch system based on calling methods.

from . import ufunc
da_ufunc = getattr(ufunc, numpy_ufunc.__name__)
if method == '__call__':
return da_ufunc(*inputs, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest writing just elemwise(numpy_ufunc, *inputs, **kwargs), once you've verified numpy_ufunc.signature is None (which means that isn't a gufunc). This would allow you to handle arbitrary ufuncs without dask wrappers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done this in most cases except those where the ufunc returns multiple outputs. In that case I'm reverting back to calling the dask.array version.

return da_ufunc.outer(*inputs, **kwargs)
else:
from ..base import compute
inputs2 = compute(*inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you recommend? I've changed this to returning NotImplemented for now.

def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs):
import pdb; pdb.set_trace()
from . import ufunc
da_ufunc = getattr(ufunc, numpy_ufunc.__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to return NotImplemented here for unrecognized types, e.g., something along these lines. Otherwise, dask will take precedence over everything else, even libraries like xarray.

(There are also some subtleties around output types and the out argument that I would recommend copying from that example, though possibly it would make sense to always return NotImplemented if out is set.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're now returning NotImplemented when we don't know how to handle the inputs.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. This will need a pretty big warning for users who (maybe unknowingly via a library) relied on numpy ufuncs evaluating dask arrays eagerly.

def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs):
out = kwargs.get('out', ())
for x in inputs + out:
if not isinstance(x, (np.ndarray, Number, Array)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the desired behavior when interacting with, say xarray.DataArrays? I think let them decide how to handle the inputs (so NotImplemented is correct here).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly why we need this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the sparse array type that is allowed in chunks here?

return handle_out(out, result)


def handle_out(out, result):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe could be implemented as a decorator, when functions have multiple return points (though I see you changed the one case where that happened). If elemwise was the only one that needed to be changed, then a regular function seems preferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat cautious about using decorators for internal things like this. I think that we're capable of tracking down the few corner cases here and I would be concerned about reducing the explicit-ness of the code. This is pretty subjective though and just a personal preference.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2017

This will need a pretty big warning for users who (maybe unknowingly via a library) relied on numpy ufuncs evaluating dask arrays eagerly

Perhaps we can squeeze a FutureWarning about this behavior into the upcoming release.

def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs):
out = kwargs.get('out', ())
for x in inputs + out:
if not isinstance(x, (np.ndarray, Number, Array)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly why we need this.

if method == '__call__':
if numpy_ufunc.signature is not None:
return NotImplemented
if numpy_ufunc.__name__ in ('frexp', 'modf'): # multi-output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check numpy_ufunc.nout, so you can at least raise for other ufuncs with multiple outputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's much cleaner

return elemwise(numpy_ufunc, *inputs, **kwargs)
elif method == 'outer':
from . import ufunc
da_ufunc = getattr(ufunc, numpy_ufunc.__name__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly safer:

try:
    da_ufunc = getattr(ufunc, numpy_ufunc.__name)
except AttributeError:
    return NotImplemented

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def __array_ufunc__(self, numpy_ufunc, method, *inputs, **kwargs):
out = kwargs.get('out', ())
for x in inputs + out:
if not isinstance(x, (np.ndarray, Number, Array)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the sparse array type that is allowed in chunks here?

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2017

This will need a pretty big warning for users who (maybe unknowingly via a library) relied on numpy ufuncs evaluating dask arrays eagerly

Perhaps we can squeeze a FutureWarning about this behavior into the upcoming release.

OK, this might be hard. The only thing that gets triggered on our end is the __array__ call. I don't think that we should warn generally on this method. I don't know how to test for "we are being called by a numpy ufunc".

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2017

If we can't meaningfully provide warnings then do we want this change in the upcoming release?

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 8, 2017

OK, this might be hard. The only thing that gets triggered on our end is the array call.

Correct me if I'm wrong, but could the next release include a simple definition for __array_ufunc__ that just emits a FutureWarning and then return NotImplemented so that Numpy takes over again? Or would that have poor interaction with other inputs that implement __array_ufunc__?

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2017

That sounds good. I was trying to solve this for older versions of numpy, but that's probably not the right way to think about this.

@shoyer
Copy link
Member

shoyer commented Jun 8, 2017

I think the way to provide warnings would be to implement a version of __array_ufunc__ that calls compute on each dask array argument and then calls the ufunc again.

But I'm not entirely sure this is worth the trouble -- possibly better to save this for a breaking release.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2017

We could easily make the next release such a breaking release. I'm fine either way. I think that having a release with a warning is probably the mature thing to do. However I also don't know of anyone who intentionally uses numpy ufuncs on dask.arrays and there would be some community value in getting a project out that used the new NumPy 1.13.0 features.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 8, 2017

OK, I'm inclined to use this as an opportunity to break a couple of other things in dask/distributed and issue a larger release.

@mrocklin
Copy link
Member Author

mrocklin commented Jun 9, 2017

Looks like we're no longer raising a NotImplementedError when using the matmul operator. I'm not quite sure why this is.

    def test_matmul():
        x = np.random.random((5, 5))
        y = np.random.random((5, 2))
        a = from_array(x, chunks=(1, 5))
        b = from_array(y, chunks=(5, 1))
        assert_eq(operator.matmul(a, b), a.dot(b))
        assert_eq(operator.matmul(a, b), operator.matmul(x, y))
        assert_eq(operator.matmul(a, y), operator.matmul(x, b))
        list_vec = list(range(1, 6))
        assert_eq(operator.matmul(list_vec, b), operator.matmul(list_vec, y))
        assert_eq(operator.matmul(x, list_vec), operator.matmul(a, list_vec))
        z = np.random.random((5, 5, 5))
        with pytest.raises(NotImplementedError):
            operator.matmul(a, z)
        c = from_array(z, chunks=(1, 5, 1))
        with pytest.raises(NotImplementedError):
>           operator.matmul(x, c)
E           Failed: DID NOT RAISE <class 'NotImplementedError'>

@shoyer
Copy link
Member

shoyer commented Jun 9, 2017

This looks like numpy/numpy#9028

Out of curiosity, what does the result of operator.matmul(x, c) look like?

@mrocklin
Copy link
Member Author

mrocklin commented Jun 9, 2017

A numpy array. It doesn't appear to go through the __array_ufunc__ code path.

mrocklin added 10 commits June 8, 2017 22:10
Example
-------

```python
In [1]: import dask.array as da, numpy as np

In [2]: x = np.arange(24).reshape((4, 6))

In [3]: np.sum(np.sin(x), axis=0)
Out[3]:
array([-1.56697566,  2.06850183,  3.80220828,  2.04018197, -1.59757823,
       -3.76653238])

In [4]: d = da.arange(24, chunks=(4,)).reshape((4, 6))

In [5]: np.sum(np.sin(d), axis=0)
Out[5]: dask.array<sum-aggregate, shape=(6,), dtype=float64, chunksize=(6,)>

In [6]: _.compute()
Out[6]:
array([-1.56697566,  2.06850183,  3.80220828,  2.04018197, -1.59757823,
       -3.76653238])
```
@mrocklin
Copy link
Member Author

OK, in order to add numpy 1.13.0 to the testing matrix I had to move around a couple things because some packages, notably on conda-forge, aren't yet compatible with the new version. In particular I made the following changes:

  1. Installing PyArrow from PyPI
  2. Avoid the new version of pytest for now, pinning pytest<=3.1.1
  3. Install graphviz from conda-forge

These latter two aren't necessarily related, but seem like decent ideas regardless.

@mrocklin
Copy link
Member Author

This now passes tests. I'm inclined to merge and release. @TomAugspurger are you comfortable with the lack of a deprecation cycle here if we bump the version by 0.X.0?

@TomAugspurger
Copy link
Member

TomAugspurger commented Jun 10, 2017 via email

@mrocklin
Copy link
Member Author

I plan to merge this in a few hours if there are no further comments

conda install -q -c conda-forge \
distributed \
cloudpickle \
graphviz \
Copy link
Member

@jcrist jcrist Jun 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the c library? Why this change?

Edit: nvm, just saw your comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that your question remains valid. I was getting png failures from graphviz. I suspect that the defaults channel build is less-than-optimal again. I didn't want to deal with it just now though so punted to use conda-forge for the time being. I'll report upstream after the current build-storm passes if the problem persists.

@mrocklin mrocklin merged commit 72a5f51 into dask:master Jun 11, 2017
@mrocklin mrocklin deleted the array_ufunc branch June 11, 2017 13:08
@sinhrks sinhrks added this to the 0.15.0 milestone Aug 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants