Add `fuse` to delayed object optimization #6222

jcrist · 2020-05-18T16:03:56Z

Many of our IO routes start/end with Delayed objects. With fuse enabled,
we can more effectively optimize these graphs. Since using Delayed to
build large graphs is already taking a performance hit (the delayed api
is more ergonomic, but less efficient than raw graphs), the slower
optimization time should be negligible for users compared to other
overheads.

See #6219.

Tests added / passed
Passes black dask / flake8 dask

Many of our IO routes start/end with Delayed objects. With fuse enabled, we can more effectively optimize these graphs. Since using `Delayed` to build large graphs is already taking a performance hit (the delayed api is more ergonomic, but less efficient than raw graphs), the slower optimization time should be negligible for users compared to other overheads.

TomAugspurger · 2020-05-18T16:17:06Z

Aside from the performance hit of calling optimize (which I agree is of secondary important), the main question is whether we want to fuse them. I don't really have a good sense for that.

jcrist · 2020-05-18T16:19:44Z

I can't see why we wouldn't? Dask in general is given full liberty about when and where to run tasks, fusing tasks IMO is part of this liberty.

If a user wants fusion not to happen, they can set optimization.fuse.enabled=False in the config to disable this temporarily/globally.

mrocklin · 2020-05-18T16:25:05Z

One, perhaps silly reason to avoid fusing is for education. We often use dask delayed to show basics around the task scheduler to users in situations like talks and the tutorial. Fusing there would obfuscate what's going on a bit.

…

On Mon, May 18, 2020 at 9:20 AM Jim Crist-Harif ***@***.***> wrote: I can't see why we wouldn't? Dask in general is given full liberty about when and where to run tasks, fusing tasks IMO is part of this liberty. If a user wants fusion not to happen, they can set optimization.fuse.enabled=False in the config to disable this temporarily/globally. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6222 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCTVERE3KPPO6MOKUDRSFN3BANCNFSM4NEGX6TQ> .

jcrist · 2020-05-18T16:27:47Z

One, perhaps silly reason to avoid fusing is for education. We often use
dask delayed to show basics around the task scheduler to users in
situations like talks and the tutorial. Fusing there would obfuscate
what's going on a bit.

Calling .visualize() doesn't optimize by default, so users won't see this when visualizing. They will see it for computation.

mrocklin · 2020-05-18T16:33:23Z

I'm aware. At least when I give tutorials or demos I usually have the dashboard up. It's useful to see the inc/add/dec calls everywhere rather than having to explain what an inc-dec call is.

…

On Mon, May 18, 2020 at 9:28 AM Jim Crist-Harif ***@***.***> wrote: One, perhaps silly reason to avoid fusing is for education. We often use dask delayed to show basics around the task scheduler to users in situations like talks and the tutorial. Fusing there would obfuscate what's going on a bit. Calling .visualize() doesn't optimize by default, so users won't see this when visualizing. They will see it for computation. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6222 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTFAYN3C55EHLSD4EDLRSFOZHANCNFSM4NEGX6TQ> .

mrocklin · 2020-05-18T16:33:28Z

But again, this may be silly

…

On Mon, May 18, 2020 at 9:33 AM Matthew Rocklin ***@***.***> wrote: I'm aware. At least when I give tutorials or demos I usually have the dashboard up. It's useful to see the inc/add/dec calls everywhere rather than having to explain what an inc-dec call is. On Mon, May 18, 2020 at 9:28 AM Jim Crist-Harif ***@***.***> wrote: > One, perhaps silly reason to avoid fusing is for education. We often use > dask delayed to show basics around the task scheduler to users in > situations like talks and the tutorial. Fusing there would obfuscate > what's going on a bit. > > Calling .visualize() doesn't optimize by default, so users won't see > this when visualizing. They will see it for computation. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#6222 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTFAYN3C55EHLSD4EDLRSFOZHANCNFSM4NEGX6TQ> > . >

jcrist · 2020-05-18T16:38:03Z

I can see the issue, but I'm not sure if it outweighs the benefits. You also might develop graphs that don't hit this (anything with incoming branches), or develop a new narrative ("dask was smart enough to fuse these into one task"). No strong thoughts.

jakirkham · 2020-05-20T22:12:24Z

When I've taught people about Dask graphs, have used visualize for that instead of a computation though, but that might be a style/preference thing :)

I guess in the education case one could disable it in the notebook or hide that away in a config. Perhaps it is useful to teach about fusing though (off/on comparison)?

TomAugspurger · 2020-05-21T11:12:57Z

The change here doesn't affect visualize() since that doesn't optimize by default.

jsignell · 2020-05-22T19:24:58Z

For educational purposes, it is easy enough to include a dask config that turns the fuse off if that is deemed preferable. So I don't think that the educational reason is enough to justify not adding this feature if it is otherwise an improvement.

jsignell · 2020-05-29T13:44:06Z

Are we still deciding whether this is a good idea or does this PR just need to pass CI?

jcrist · 2020-05-29T13:47:13Z

I want to poke at the underlying issue (better fusion of tasks that convert to/from delayed, with store as a specific case) a bit more - hoping to get to this sometime next week.

jcrist · 2021-10-01T18:33:00Z

With the addition of high level graphs and other work since then, this PR no longer makes sense as is. Closing.

jcrist mentioned this pull request May 18, 2020

Can't optimize HighLevelGraph (delayed object) #6219

Open

eriknw mentioned this pull request May 20, 2020

Inline zarr.Array in da.from_zarr #6203

Closed

jsignell mentioned this pull request May 29, 2020

dd.to_parquet() duplicates graph execution #6232

Closed

Base automatically changed from master to main March 8, 2021 20:19

jcrist closed this Oct 1, 2021

jcrist deleted the add-fuse-delayed branch October 1, 2021 18:33

twoertwein mentioned this pull request Dec 3, 2021

DOC: note that compute does not optimize delayed objects #8454

Closed

jrbourbeau mentioned this pull request Dec 3, 2021

Add fusion optimization for Delayed #8448

Open

Uh oh!

Add fuse to delayed object optimization #6222

Add fuse to delayed object optimization #6222

Uh oh!

Conversation

jcrist commented May 18, 2020

Uh oh!

TomAugspurger commented May 18, 2020

Uh oh!

jcrist commented May 18, 2020

Uh oh!

mrocklin commented May 18, 2020 via email

Uh oh!

jcrist commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented May 18, 2020 via email

Uh oh!

mrocklin commented May 18, 2020 via email

Uh oh!

jcrist commented May 18, 2020

Uh oh!

jakirkham commented May 20, 2020

Uh oh!

TomAugspurger commented May 21, 2020

Uh oh!

jsignell commented May 22, 2020

Uh oh!

jsignell commented May 29, 2020

Uh oh!

jcrist commented May 29, 2020

Uh oh!

jcrist commented Oct 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add `fuse` to delayed object optimization #6222

Add `fuse` to delayed object optimization #6222

jcrist commented May 18, 2020 •

edited

Loading