Skip to content

Conversation

@jcrist
Copy link
Member

@jcrist jcrist commented May 18, 2020

Many of our IO routes start/end with Delayed objects. With fuse enabled,
we can more effectively optimize these graphs. Since using Delayed to
build large graphs is already taking a performance hit (the delayed api
is more ergonomic, but less efficient than raw graphs), the slower
optimization time should be negligible for users compared to other
overheads.

See #6219.

  • Tests added / passed
  • Passes black dask / flake8 dask

Many of our IO routes start/end with Delayed objects. With fuse enabled,
we can more effectively optimize these graphs. Since using `Delayed` to
build large graphs is already taking a performance hit (the delayed api
is more ergonomic, but less efficient than raw graphs), the slower
optimization time should be negligible for users compared to other
overheads.
@TomAugspurger
Copy link
Member

Aside from the performance hit of calling optimize (which I agree is of secondary important), the main question is whether we want to fuse them. I don't really have a good sense for that.

@jcrist
Copy link
Member Author

jcrist commented May 18, 2020

I can't see why we wouldn't? Dask in general is given full liberty about when and where to run tasks, fusing tasks IMO is part of this liberty.

If a user wants fusion not to happen, they can set optimization.fuse.enabled=False in the config to disable this temporarily/globally.

@mrocklin
Copy link
Member

mrocklin commented May 18, 2020 via email

@jcrist
Copy link
Member Author

jcrist commented May 18, 2020

One, perhaps silly reason to avoid fusing is for education. We often use
dask delayed to show basics around the task scheduler to users in
situations like talks and the tutorial. Fusing there would obfuscate
what's going on a bit.

Calling .visualize() doesn't optimize by default, so users won't see this when visualizing. They will see it for computation.

@mrocklin
Copy link
Member

mrocklin commented May 18, 2020 via email

@mrocklin
Copy link
Member

mrocklin commented May 18, 2020 via email

@jcrist
Copy link
Member Author

jcrist commented May 18, 2020

I can see the issue, but I'm not sure if it outweighs the benefits. You also might develop graphs that don't hit this (anything with incoming branches), or develop a new narrative ("dask was smart enough to fuse these into one task"). No strong thoughts.

@jakirkham
Copy link
Member

When I've taught people about Dask graphs, have used visualize for that instead of a computation though, but that might be a style/preference thing :)

I guess in the education case one could disable it in the notebook or hide that away in a config. Perhaps it is useful to teach about fusing though (off/on comparison)?

@TomAugspurger
Copy link
Member

The change here doesn't affect visualize() since that doesn't optimize by default.

@jsignell
Copy link
Member

For educational purposes, it is easy enough to include a dask config that turns the fuse off if that is deemed preferable. So I don't think that the educational reason is enough to justify not adding this feature if it is otherwise an improvement.

@jsignell
Copy link
Member

Are we still deciding whether this is a good idea or does this PR just need to pass CI?

@jcrist
Copy link
Member Author

jcrist commented May 29, 2020

I want to poke at the underlying issue (better fusion of tasks that convert to/from delayed, with store as a specific case) a bit more - hoping to get to this sometime next week.

Base automatically changed from master to main March 8, 2021 20:19
@jcrist
Copy link
Member Author

jcrist commented Oct 1, 2021

With the addition of high level graphs and other work since then, this PR no longer makes sense as is. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants