-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Collection Protocol #8674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection Protocol #8674
Conversation
fjetter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the py.typed have any effect? IIUC https://www.python.org/dev/peps/pep-0561/#id13 says we'd need to package this file
|
Yes indeed I was missing the necessary change for it to have an effect (now added as package data in setup.py). |
50d3de8 to
ffb767d
Compare
ffb767d to
9cf3cc9
Compare
ian-r-rose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting the ball rolling on this @douglasdavis , I'm quite excited to see this get in.
I think it's outside of the scope of this PR, but I'd love to start writing down protocols/aliases for things like "tasks" and "keys", which would help to further specify the type parameters for things like __dask_graph__() and __dask_keys__().
dask/typing.py
Outdated
|
|
||
| @runtime_checkable | ||
| class DaskCollection(Protocol): | ||
| def __dask_graph__(self) -> Mapping | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale behind this being None? Is there somewhere where None is a valid graph ("yes" is a perfectly fine answer, I'm just not aware of it)? Since a lot of places check for __dask_graph__() this means a lot of downstream logic around checking for None first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was taking inspiration from the implementation of is_dask_collection checking for is not None instead of something like hasattr, but since you've pointed it out I feel that anyone implementing a collection would not have a None returning __dask_graph__.. So yea I'd agree it would make sense to just have Mapping here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the docs for __dask_graph__() also talk about None: https://docs.dask.org/en/stable/custom-collections.html#dask_graph__ . So the original typing doesn't seem wrong, it's just unfortunate that it would require downstream users to check for it all the time. Perhaps we can remove that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of removing the possibility of None. I can't think of a scenario where it would pop up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any further thoughts on whether we can remove None? I just took a look at doing that locally, and nothing seemed to blow up.
I also see many places where __dask_graph__() is used without None checks, so certainly usages in this codebase don't really consider it to be likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we are removing this, it's worth also deprecating/removing it in is_dask_collection().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah makes sense; I've updated is_dask_collection to use the "old" check by testing hasattr(x, __dask_graph__)
52ba882 to
fb3677c
Compare
967415d to
fcea3d4
Compare
fcea3d4 to
90724c5
Compare
|
@ian-r-rose thanks a lot the first round of comments. I've just pushed some updates
|
90724c5 to
782be8a
Compare
a71a5d2 to
75c90ff
Compare
4813edc to
c758eb1
Compare
d51996c to
b93503f
Compare
aada9fd to
9786950
Compare
18ff623 to
26b119b
Compare
| return x.__dask_graph__() is not None | ||
| except (AttributeError, TypeError): | ||
| return False | ||
| return hasattr(x, "__dask_graph__") and callable(x.__dask_graph__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ian-r-rose there's a test in dask.array that setattrs __dask_graph__ to None:
dask/dask/array/tests/test_dispatch.py
Lines 227 to 236 in 7259f21
| def test_direct_deferral_wrapping_override(): | |
| """Directly test Dask defering to an upcast type and the ability to still wrap it.""" | |
| a = da.from_array(np.arange(4)) | |
| b = WrappedArray(np.arange(4)) | |
| assert a.__add__(b) is NotImplemented | |
| # Note: remove dask_graph to be able to wrap b in a dask array | |
| setattr(b, "__dask_graph__", None) | |
| res = a + da.from_array(b) | |
| assert isinstance(res, da.Array) | |
| assert_eq(res, 2 * np.arange(4), check_type=False) |
I wonder if it's in the scope of this PR to try to adopt a mechanism for "erasing" an existing __dask_graph__ without setting it to None (adding the callable check allows the test to stay as is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, erasing an existing method on an instance is such specialized (and probably not recommended) behavior that we shouldn't try to capture it in the signature. I don't really even understand what this test is doing, to be honest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I also can't quite tell what is being accomplished with the test, and see the assignment of the method to None as very specialized. But anyway, I'm satisfied with this definition of is_dask_collection, tests look to be passing and I think it covers what the previous implementation was getting (it while removing the possibility of a None return from the method itself).
ian-r-rose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @douglasdavis, I think this is close. If some of the self-referential callbacks become too sticky, I'd say we can defer them to follow-ups
| def __call__( | ||
| self, | ||
| dsk: Mapping, | ||
| *args: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of the pyright issues that I saw were due to this *args. The implementation for most of the collections don't actually accept arbitrary *args, making the function signatures incompatible.
I'm actually not sure why mypy is okay with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more: it would be nice if we could type *args with a ParamSpec. However, I think ParamSpec is currently not expressive enough to handle keyword-only args: https://peps.python.org/pep-0612/#concatenating-keyword-parameters.
So I take the above back, I'm not sure we can do better than Any here at the moment.
|
@ian-r-rose I think I'm starting to be pretty content with the status of this PR. Thanks for all of your work on it with me. One thing that I think can improve, but IMO is worth addressing in a future PR, is |
ian-r-rose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @douglasdavis, I'm happy where this sits right now. I have a couple of very minor nits, but I otherwise think this is ready to be taken for a spin.
What do you think @jrbourbeau and @jsignell?
| def is_dask_collection(x) -> bool: | ||
| """Returns ``True`` if ``x`` is a dask collection""" | ||
| try: | ||
| return x.__dask_graph__() is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good change, but perhaps it should go through a deprecation cycle. What do you think @jrbourbeau?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs a deprecation cycle.
| def __call__( | ||
| self, | ||
| dsk: Mapping, | ||
| *args: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more: it would be nice if we could type *args with a ParamSpec. However, I think ParamSpec is currently not expressive enough to handle keyword-only args: https://peps.python.org/pep-0612/#concatenating-keyword-parameters.
So I take the above back, I'm not sure we can do better than Any here at the moment.
ian-r-rose
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @douglasdavis! I'm happy with this as it stands, my only outstanding question is whether the change to is_dask_collection deserves a deprecation cycle.
|
Thanks for your work on this @douglasdavis and @ian-r-rose I am having a read through now. |
baa5824 to
022285a
Compare
jsignell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good to me.
b36f571 to
ad81e86
Compare
|
Woo! Thanks for your persistence @douglasdavis |
|
Seeing a test failure in Distributed, which appears may be related to this change. Details in PR ( dask/distributed#6299 ). |
[PEP 544](https://www.python.org/dev/peps/pep-0544/) introduces the `Protocol` class to the `typing` module in Python 3.8 (the soon be the minimum supported version, dask/community#213). Writing new Dask collections for [dask-awkward](https://github.com/ContinuumIO/dask-awkward/) has had me thinking about working on a `DaskCollection` protocol. I imagine the benefits to be: - usage with static type checkers - other activity in this area at - dask#8295 - dask#8706 - dask#8854 - Python supporting IDEs take advantage of typing - self-documenting; some improvements to [the custom collections page](https://docs.dask.org/en/latest/custom-collections.html) of the docs. The protocol docs can be autogenerated and added to that page. - purely opt-in feature The `typing.runtime_checkable` decorator allows use of `isinstance(x, DaskCollection)` in any code base that uses Dask collections; for example: ```python >>> from dask.typing import DaskCollection >>> import dask.array as da >>> x = da.zeros((10, 3)) >>> isinstance(x, DaskCollection) True ``` (though this is an order of magnitude slower than `dask.base.is_dask_collection` which only checks for `x.__dask_graph__() is not None`; static typing checking & built-in interface documentation are the core benefits IMO) Something else that came up in the brief discussion on a call last week was having `{Scheduler,Worker,Nanny}Plugin` protocols in `distributed`; and perhaps those are better places to start introducing protocols to Dask since on the user side typically more folks would write plugins than new collections.
PEP 544 introduces the
Protocolclass to thetypingmodule in Python 3.8 (the soon be the minimum supported version, dask/community#213). Writing new Dask collections for dask-awkward has had me thinking about working on aDaskCollectionprotocol. I imagine the benefits to be:The
typing.runtime_checkabledecorator allows use ofisinstance(x, DaskCollection)in any code basethat uses Dask collections; for example:
(though this is an order of magnitude slower than
dask.base.is_dask_collectionwhich only checks forx.__dask_graph__() is not None; static typing checking & built-in interface documentation are the core benefits IMO)Something else that came up in the brief discussion on a call last week was having
{Scheduler,Worker,Nanny}Pluginprotocols indistributed; and perhaps those are better places to start introducing protocols to Dask since on the user side typically more folks would write plugins than new collections.