Skip to content

Conversation

@pentschev
Copy link
Member

@pentschev pentschev commented Mar 8, 2019

Resolves #4563.

  • Tests added / passed
  • Passes flake8 dask

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2019

cc @shoyer @hameerabbasi

@pentschev
Copy link
Member Author

Ok, I'm not really sure why that's failing only on newer Python/NumPy versions. Anybody sees something obvious I've missed?

@hameerabbasi
Copy link
Contributor

NumPy 1.13 and 1.11 didn't have this protocol... So what you added was essentially dead code then.

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2019

I'm surprised to see tests with sparse and cupy in this PR. From my perspective this is more about having Dask Arrays respond to NumPy functions, rather than having the blocks of a Dask array respond to NumPy functions.

So I would be more interested in checks like the following:

x = da.ones(...)
y = np.concatenate([x, x, x])
assert isinstance(y, da.Array)

Where we would also want to check things like np.linalg.qr (which you ran into with cupy) and any others that we know are sometimes problematic.

@pentschev
Copy link
Member Author

NumPy 1.13 and 1.11 didn't have this protocol... So what you added was essentially dead code then.

Oh yes, that makes sense. I've been testing with upstream NumPy, that's why it passed here.

@pentschev
Copy link
Member Author

@mrocklin it's good that you mention that, I've totally forgotten about that case. But either way, are you suggesting we drop Sparse and CuPy tests? I understand it may be a little early to have them properly setup due to different version requirements, but ultimately, isn't that what we want, to have Dask working also with other arrays?

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2019

We do want those to work, yes, but I think that that effort is somewhat orthogonal from the solution you've added here.

Or, in testing terms, you've got some great integration tests here, it would be good to see some unit tests as well that just test dask array's support of the __array_function__ protocol with normal NumPy arrays.

@pentschev
Copy link
Member Author

Ok, so here's my suggestion, I will add the tests you suggested (which I really forgot before), and skip the tests we currently have if NumPy < 1.17, does that sound acceptable?

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2019 via email

@pentschev
Copy link
Member Author

Sure. More tests are always great (assuming they're decently fast)

Agreed, the current ones were tweaked to be fast, I didn't profile them exactly, but all 3 tests were certainly < 1 sec.

@pentschev
Copy link
Member Author

Ah, now I realized why the tests are failing, before 1.17 (currently, upstream), we have to set the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1.

So the question here is, can we enabled that in CI for newer versions of Python? Otherwise, I think the only alternative is to skip these tests until v1.17 is released.

@mrocklin
Copy link
Member

mrocklin commented Mar 8, 2019 via email

@pentschev
Copy link
Member Author

Alright, I added some __array_function__ unit tests, it's by no means an extensive list, but probably a good starting point for us.

Regarding the NumPy version, we need at least v1.16 (now skipped for older versions) and NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1 variable exported, which I now added. Is it possible for us to test also with NumPy v1.16.x?

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks like I didn't submit this review. I copy a bit of what @shoyer said, but some other comments as well.

if da_func is func:
return NotImplemented
if not all(issubclass(t, Array) for t in types):
return NotImplemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in many cases a dask array function will work with numpy objects (or presumably anything on which it can effectively call asarray. I suggest a test like np.tensordot(dask_array, numpy_array) and replacing the all above with an any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __array_function__ protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

The problem is that you want to support operations involving some but not all other array types. For example, xarray.DataArray wraps dask arrays, not the other way around. So dask.Array.__array_function__ should return NotImplemented in that case, leaving it up xarray.DataArray.__array_function__ to define the operation.

We basically need some protocol or registration system to mark arrays as "can be coerced into dask array", e.g., either

sparse.Array.__dask_chunk_compatibile__ = True

or

dask.array.register_chunk_type(sparse.Array)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, that's an interesting problem. Maybe we try to turn all of the args into dask arrays with da.asarray and, if that works, proceed?

I'm not sure that this would produce the correct behavior for xarray.DataArray though. I genuinely don't know what the right behavior is there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think my idea for mixins can help with this. Specifically, we could check the specific mixins that Dask absolutely requires, I’m assuming that’d be the indexing mixin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, that's an interesting problem. Maybe we try to turn all of the args into dask arrays with da.asarray and, if that works, proceed?

I actually like this idea for its simplicity.

If we want to be completely general, we could use hasattr(t, ‘__array_function__’). But this would also include Pandas dataframes at some point I’m assuming...

I also like this idea in general, but I think then it would be best before to agree on all mixins we expect to have/support. Plus, if I understood correctly, it feels like this will require much more type checking, which may become complicated to handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that that would only help to address the second point above. I'm not sure that it would solve it though. As a worst case scenario I suspect that a project like TensorFlow would not inherit from those classes. We could get NumPy, Dask, Sparse, and Xarray on board but I'm not sure about any of the others out there.

I personally think that duck typing is more likely to work across projects than explicit typing.

Mixins may still be a good idea for the reason you mention, you can easily implement a lot of NumPy-like functionality without too much effort, but I don't think that they're a good choice to test for Numpy-ness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that it's hard to get traction for mixins or add dependencies, but adding a protocol isn't hard.

While you're right that TensorFlow doesn't plan on implementing any of these protocols, it has always historically done its own thing, and it has its own distributed system. I suspect that with Google's historical "rewrite the world and do it better" philosophy, it'll be hard to get them to play along anyway.

In contrast, anyone specifically aiming to be a NumPy duck array won't have a problem implementing protocols, and that is what I'm planning to do, to allow mixins to implement methods and so forth based on the implementation of protocols.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the time being, I've used @shoyer and @mrocklin suggestions replacing only that last condition in __array_function__ to:

if not all(issubclass(t, Array) for t in types):
    return NotImplemented

It seems to me that the mixin discussion has grown enough that we should move it to a Dask issue, discussed there and later review the current code. Anybody disagrees with my proposal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since nobody opposed, I moved the discussion to #4583.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass

The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

@hameerabbasi
Copy link
Contributor

Regarding the NumPy version, we need at least v1.16 (now skipped for older versions) and NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1 variable exported, which I now added. Is it possible for us to test also with NumPy v1.16.x?

You need to add that for 1.16, where it was experimental. For 1.17, though, the env var isn't needed currently, but people are proposing that we keep the "experimental" status of the NEP for one more version...

@pentschev
Copy link
Member Author

You need to add that for 1.16, where it was experimental. For 1.17, though, the env var isn't needed currently, but people are proposing that we keep the "experimental" status of the NEP for one more version...

While I agree that features should be experimental until they can really be considered stable, I was really expecting that v1.17 would introduce this feature as stable. Without that, it's difficult to even get other projects to even run unit tests for it, e.g., CuPy. I'm totally willing to help improving everything we can to attempt having this in v1.17 as stable, are there many major concerns about its maturity?

@hameerabbasi
Copy link
Contributor

are there many major concerns about its maturity?

CuPy has been using it for a while (as short as whiles go)... So I think not. There might be features added but not removed.

Tests added:
* Submodules
* Functionality not implemented in Dask
@hameerabbasi
Copy link
Contributor

One thing I must mention I like about this implementation is that it specifically avoids a NumPy dependency or mapping.

@pentschev
Copy link
Member Author

Oh, there's more in ufunc tests. It fails specifically for i0, nan_to_num and sinc, which are being tested in test_unary_ufunc. However, those 3 functions are defined just after the following comment in ufunc.py: "non-ufunc elementwise functions".

I fixed this issue in my latest commit. If anyone thinks this is wrong, please let me know.

I would say this is ready for another round of review now.

@jakirkham
Copy link
Member

I fixed this issue in my latest commit. If anyone thinks this is wrong, please let me know.

That looks correct to me, @pentschev. Those are not really ufuncs, but they happened to more or less work for the purposes of the original tests. So it is more correct to test them separately. Thanks for correcting these.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me. I added a small comment above about adjusting the tests a little, but don't view this as a blocker.

@pentschev
Copy link
Member Author

@jakirkham thanks for the review!

dweights = da.from_array(weights, chunks=2)
e = da.bincount(d, weights=dweights, minlength=6)
assert_eq(e, np.bincount(x, weights=dweights, minlength=6))
assert_eq(e, np.bincount(x, weights=dweights.compute(), minlength=6))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this change needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in this, the right argument of assert_eq() is used as baseline, and x being a NumPy array, it fails here:

    @wraps(np.bincount)
    def bincount(x, weights=None, minlength=None):
        if minlength is None:
            raise TypeError("Must specify minlength argument in da.bincount")
        assert x.ndim == 1
        if weights is not None:
>           assert weights.chunks == x.chunks
E           AttributeError: 'numpy.ndarray' object has no attribute 'chunks'

Therefore, to me it makes sense to make both arguments NumPy arrays and let it fallback to NumPy's implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reverse should be done, with weights being coerced to da.Array with the same chunks as x.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. I think I agree with @pentschev here. The left argument, e, is testing the dask/dask situation while the right argument is testing the numpy/numpy situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reverse should be done, with weights being coerced to da.Array with the same chunks as x.

That would cause da.bincount() to be dispatched, then we would be testing against e which computes exactly that.

if da_func is func:
return NotImplemented
if not all(issubclass(t, Array) for t in types):
return NotImplemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass

The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

@pentschev
Copy link
Member Author

I think that with @shoyer 's statements below we can drop this conditional entirely. It should always pass

The array_function protocol guarantees that it will only get called if dask.array.Array is in types. So if that's all you wanted to check, you could drop this entirely.

I think I've left this for a reason, I can't remember for sure. But I think it was in the case of having arrays of different types. Wouldn't it if one array was a Dask array and another a CuPy one, for example?

@mrocklin
Copy link
Member

Currently the code is

        if not any(issubclass(t, (Array, np.ndarray)) for t in types):
            return NotImplemented

But we know that any(issubclass(t, Array) for t in types) is true. If it wasn't then we wouldn't be in this function.

@pentschev
Copy link
Member Author

@mrocklin I'm confused, you're saying now we should get rid of the entire condition or just of Array and keep np.ndarray?

@mrocklin
Copy link
Member

I'm saying that I think we can remove the entire condition. If what @shoyer suggests is true then I don't think it will ever be triggered.

@pentschev
Copy link
Member Author

I was thinking of other types, like CuPy and maybe Sparse. I thought I had seen one such case, but maybe I got confused. I'll remove for now, and if need be, we can always add it back later.

@mrocklin
Copy link
Member

Currently your condition will not trigger if any of the arguments is of dask.array.Array type.

The fact that Array.__array_function__ is triggered means that at least one of our arguments is of dask.array.Array type. It doesn't matter what the other arguments are.

I may not fully understand things though.

@pentschev
Copy link
Member Author

@mrocklin I don't either, but that's most likely right, since Dask can operate on different types as well, it should probably be capable of operator on mixed types too. For now I've removed this condition.

@mrocklin mrocklin merged commit 87f3a48 into dask:master Mar 15, 2019
@mrocklin
Copy link
Member

Merging this in. Thanks for the effort here @pentschev and thank you @hameerabbasi, @jakirkham, and @shoyer for the review

@shoyer
Copy link
Member

shoyer commented Mar 15, 2019

For testing, I suppose this is OK, but I do think it's a really bad idea for dask not to include any type checking inside __array_function__. This is basically guaranteed to result in bugs.

But let's discuss more in #4583.

@shoyer
Copy link
Member

shoyer commented Mar 15, 2019

To be clear, thanks @pentschev and @mrocklin for pushing this forward. I'm really excited about the potential here!

I just want to get the design right. Given that we were involved in writing the NEP, I feel like we should try hard to be an exemplar implementation of __array_function__.

@mrocklin
Copy link
Member

Ah, my apologies if I jumped the gun here. Happy to revert and reopen if desired.

@shoyer
Copy link
Member

shoyer commented Mar 15, 2019

no need to revert, we can fix this forward

@pentschev
Copy link
Member Author

Thanks for the reviews @mrocklin, @shoyer, @hameerabbasi, @jakirkham.

I absolutely agree with @shoyer, we should get the design right. I'm happy to help with whatever I can, I'm not sure I can help much now with coming up with a design since I'm not aware of many of the use cases and how they should work, but count on me later on for implementation and testing.

@jrbourbeau jrbourbeau mentioned this pull request Mar 17, 2019
2 tasks
@pentschev pentschev deleted the array_function_support branch April 17, 2019 08:12
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019
Add __array_function__ support to Array class
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants