Skip to content

Conversation

@douglasdavis
Copy link
Member

@douglasdavis douglasdavis commented Feb 7, 2022

PEP 544 introduces the Protocol class to the typing module in Python 3.8 (the soon be the minimum supported version, dask/community#213). Writing new Dask collections for dask-awkward has had me thinking about working on a DaskCollection protocol. I imagine the benefits to be:

The typing.runtime_checkable decorator allows use of isinstance(x, DaskCollection) in any code base
that uses Dask collections; for example:

>>> from dask.typing import DaskCollection
>>> import dask.array as da
>>> x = da.zeros((10, 3))
>>> isinstance(x, DaskCollection)
True

(though this is an order of magnitude slower than dask.base.is_dask_collection which only checks for x.__dask_graph__() is not None; static typing checking & built-in interface documentation are the core benefits IMO)

Something else that came up in the brief discussion on a call last week was having {Scheduler,Worker,Nanny}Plugin protocols in distributed; and perhaps those are better places to start introducing protocols to Dask since on the user side typically more folks would write plugins than new collections.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the py.typed have any effect? IIUC https://www.python.org/dev/peps/pep-0561/#id13 says we'd need to package this file

@douglasdavis
Copy link
Member Author

Yes indeed I was missing the necessary change for it to have an effect (now added as package data in setup.py).

@douglasdavis douglasdavis force-pushed the collection-protocol branch 3 times, most recently from 50d3de8 to ffb767d Compare February 14, 2022 21:03
Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting the ball rolling on this @douglasdavis , I'm quite excited to see this get in.

I think it's outside of the scope of this PR, but I'd love to start writing down protocols/aliases for things like "tasks" and "keys", which would help to further specify the type parameters for things like __dask_graph__() and __dask_keys__().

dask/typing.py Outdated

@runtime_checkable
class DaskCollection(Protocol):
def __dask_graph__(self) -> Mapping | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale behind this being None? Is there somewhere where None is a valid graph ("yes" is a perfectly fine answer, I'm just not aware of it)? Since a lot of places check for __dask_graph__() this means a lot of downstream logic around checking for None first

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was taking inspiration from the implementation of is_dask_collection checking for is not None instead of something like hasattr, but since you've pointed it out I feel that anyone implementing a collection would not have a None returning __dask_graph__.. So yea I'd agree it would make sense to just have Mapping here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the docs for __dask_graph__() also talk about None: https://docs.dask.org/en/stable/custom-collections.html#dask_graph__ . So the original typing doesn't seem wrong, it's just unfortunate that it would require downstream users to check for it all the time. Perhaps we can remove that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be in favor of removing the possibility of None. I can't think of a scenario where it would pop up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any further thoughts on whether we can remove None? I just took a look at doing that locally, and nothing seemed to blow up.

I also see many places where __dask_graph__() is used without None checks, so certainly usages in this codebase don't really consider it to be likely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we are removing this, it's worth also deprecating/removing it in is_dask_collection().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense; I've updated is_dask_collection to use the "old" check by testing hasattr(x, __dask_graph__)

@ian-r-rose ian-r-rose added the discussion Discussing a topic with no specific actions yet label Feb 17, 2022
@douglasdavis douglasdavis force-pushed the collection-protocol branch 2 times, most recently from 52ba882 to fb3677c Compare February 24, 2022 18:34
@douglasdavis douglasdavis force-pushed the collection-protocol branch 2 times, most recently from 967415d to fcea3d4 Compare March 1, 2022 16:41
@douglasdavis douglasdavis force-pushed the collection-protocol branch from fcea3d4 to 90724c5 Compare March 7, 2022 16:15
@douglasdavis
Copy link
Member Author

@ian-r-rose thanks a lot the first round of comments. I've just pushed some updates

  • added PostComputeCallable and PostPersistCallable protocols
  • Added a HLGDaskCollection protocol which inherits from DaskCollection but requires definition of a __dask_layers__ method.
  • also added some docstrings

@douglasdavis douglasdavis force-pushed the collection-protocol branch from 90724c5 to 782be8a Compare March 7, 2022 18:48
@douglasdavis douglasdavis force-pushed the collection-protocol branch 5 times, most recently from a71a5d2 to 75c90ff Compare March 15, 2022 15:22
@douglasdavis douglasdavis force-pushed the collection-protocol branch 2 times, most recently from 4813edc to c758eb1 Compare March 21, 2022 21:01
@douglasdavis douglasdavis force-pushed the collection-protocol branch 2 times, most recently from d51996c to b93503f Compare March 29, 2022 16:05
@douglasdavis douglasdavis force-pushed the collection-protocol branch 2 times, most recently from aada9fd to 9786950 Compare April 12, 2022 12:10
return x.__dask_graph__() is not None
except (AttributeError, TypeError):
return False
return hasattr(x, "__dask_graph__") and callable(x.__dask_graph__)
Copy link
Member Author

@douglasdavis douglasdavis Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ian-r-rose there's a test in dask.array that setattrs __dask_graph__ to None:

def test_direct_deferral_wrapping_override():
"""Directly test Dask defering to an upcast type and the ability to still wrap it."""
a = da.from_array(np.arange(4))
b = WrappedArray(np.arange(4))
assert a.__add__(b) is NotImplemented
# Note: remove dask_graph to be able to wrap b in a dask array
setattr(b, "__dask_graph__", None)
res = a + da.from_array(b)
assert isinstance(res, da.Array)
assert_eq(res, 2 * np.arange(4), check_type=False)

I wonder if it's in the scope of this PR to try to adopt a mechanism for "erasing" an existing __dask_graph__ without setting it to None (adding the callable check allows the test to stay as is).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, erasing an existing method on an instance is such specialized (and probably not recommended) behavior that we shouldn't try to capture it in the signature. I don't really even understand what this test is doing, to be honest.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I also can't quite tell what is being accomplished with the test, and see the assignment of the method to None as very specialized. But anyway, I'm satisfied with this definition of is_dask_collection, tests look to be passing and I think it covers what the previous implementation was getting (it while removing the possibility of a None return from the method itself).

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @douglasdavis, I think this is close. If some of the self-referential callbacks become too sticky, I'd say we can defer them to follow-ups

def __call__(
self,
dsk: Mapping,
*args: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the pyright issues that I saw were due to this *args. The implementation for most of the collections don't actually accept arbitrary *args, making the function signatures incompatible.

I'm actually not sure why mypy is okay with it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a bit more: it would be nice if we could type *args with a ParamSpec. However, I think ParamSpec is currently not expressive enough to handle keyword-only args: https://peps.python.org/pep-0612/#concatenating-keyword-parameters.

So I take the above back, I'm not sure we can do better than Any here at the moment.

@douglasdavis
Copy link
Member Author

douglasdavis commented Apr 27, 2022

@ian-r-rose I think I'm starting to be pretty content with the status of this PR. Thanks for all of your work on it with me. One thing that I think can improve, but IMO is worth addressing in a future PR, is __dask_optimize__. It's hinted to Any, even hinting it to Callable is causing issues because the core collections are using dask.context.globalmethod to define their __dask_optimize__ implementations. Perhaps a DaskOptimize protocol is something we can work on in the future.

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @douglasdavis, I'm happy where this sits right now. I have a couple of very minor nits, but I otherwise think this is ready to be taken for a spin.

What do you think @jrbourbeau and @jsignell?

def is_dask_collection(x) -> bool:
"""Returns ``True`` if ``x`` is a dask collection"""
try:
return x.__dask_graph__() is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good change, but perhaps it should go through a deprecation cycle. What do you think @jrbourbeau?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs a deprecation cycle.

def __call__(
self,
dsk: Mapping,
*args: Any,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a bit more: it would be nice if we could type *args with a ParamSpec. However, I think ParamSpec is currently not expressive enough to handle keyword-only args: https://peps.python.org/pep-0612/#concatenating-keyword-parameters.

So I take the above back, I'm not sure we can do better than Any here at the moment.

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this @douglasdavis! I'm happy with this as it stands, my only outstanding question is whether the change to is_dask_collection deserves a deprecation cycle.

@jsignell
Copy link
Member

Thanks for your work on this @douglasdavis and @ian-r-rose I am having a read through now.

@douglasdavis douglasdavis force-pushed the collection-protocol branch from baa5824 to 022285a Compare May 2, 2022 13:30
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me.

@douglasdavis douglasdavis force-pushed the collection-protocol branch from b36f571 to ad81e86 Compare May 5, 2022 13:24
@jsignell jsignell merged commit 1e783d9 into dask:main May 6, 2022
@ian-r-rose
Copy link
Collaborator

Woo! Thanks for your persistence @douglasdavis

@jakirkham jakirkham mentioned this pull request May 6, 2022
2 tasks
@jakirkham
Copy link
Member

Seeing a test failure in Distributed, which appears may be related to this change. Details in PR ( dask/distributed#6299 ).

erayaslan pushed a commit to erayaslan/dask that referenced this pull request May 12, 2022
[PEP 544](https://www.python.org/dev/peps/pep-0544/) introduces the `Protocol` class to the `typing` module in Python 3.8 (the soon be the minimum supported version, dask/community#213). Writing new Dask collections for [dask-awkward](https://github.com/ContinuumIO/dask-awkward/) has had me thinking about working on a `DaskCollection` protocol. I imagine the benefits to be:

- usage with static type checkers
  - other activity in this area at
    - dask#8295 
    - dask#8706 
    -  dask#8854
  - Python supporting IDEs take advantage of typing
- self-documenting; some improvements to [the custom collections page](https://docs.dask.org/en/latest/custom-collections.html) of the docs. The protocol docs can be autogenerated and added to that page.
- purely opt-in feature

The `typing.runtime_checkable` decorator allows use of `isinstance(x, DaskCollection)` in any code base
that uses Dask collections; for example:

```python
>>> from dask.typing import DaskCollection
>>> import dask.array as da
>>> x = da.zeros((10, 3))
>>> isinstance(x, DaskCollection)
True
```
(though this is an order of magnitude slower than `dask.base.is_dask_collection` which only checks for `x.__dask_graph__() is not None`; static typing checking & built-in interface documentation are the core benefits IMO)

Something else that came up in the brief discussion on a call last week was having `{Scheduler,Worker,Nanny}Plugin` protocols in `distributed`; and perhaps those are better places to start introducing protocols to Dask since on the user side typically more folks would write plugins than new collections.
@douglasdavis douglasdavis deleted the collection-protocol branch May 17, 2022 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe discussion Discussing a topic with no specific actions yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants