Wrap HLGs in an Expr to avoid Client side materialization #11736

fjetter · 2025-02-11T16:33:37Z

This is a prototype to wrap HLGs in an expression. This helps with streamlining our compute calls and the serialization path to the scheduler.

I only tested this on a single test so far using a simple threaded scheduler. There are likely still tons of issues and the new expression classes have to be cleaned up, etc.

The top-level compute is implementing the new path and includes some comments on how this will look like when migrated to the client

sibling dask/distributed#9008

Related PRs

Closes dask/distributed#7964
Closes dask/dask-expr#14

github-actions · 2025-02-11T17:31:01Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

9 files ± 0 9 suites ±0 3h 25m 53s ⏱️ - 10m 45s
17 818 tests - 5 16 603 ✅ + 266 1 215 💤 ±0 0 ❌ - 271
159 436 runs - 45 147 321 ✅ +2 193 12 115 💤 - 3 0 ❌ - 2 235

Results for commit 2d208a1. ± Comparison against base commit bd3a7d6.

This pull request removes 6 and adds 1 tests. Note that renamed tests count towards both.

dask.dataframe.dask_expr.io.tests.test_distributed ‑ test_parquet_distriuted[arrow]
dask.dataframe.dask_expr.tests.test_distributed ‑ test_compute_concatenates[False]
dask.dataframe.dask_expr.tests.test_distributed ‑ test_compute_concatenates[True]
dask.tests.test_base ‑ test_optimizations_keyword
dask.tests.test_delayed ‑ test_delayed_optimize
dask.tests.test_distributed ‑ test_get_scheduler_with_distributed_active

dask.dataframe.dask_expr.io.tests.test_distributed ‑ test_parquet_distributed[arrow]

♻️ This comment has been updated with latest results.

dask/_expr.py

fjetter · 2025-02-13T09:18:59Z

dask/bag/tests/test_random.py

    sut = db.from_sequence(seq, partition_size=9)
    li = list(random.choices(sut, k=10).compute())
-    assert sut.map_partitions(len).compute() == (9, 1)
+    assert sut.map_partitions(len).compute() == [9, 1]


the bag output type has been a bit inconsistent so far depending on whether the output is a single partition or not (which in itself is not trivial for bags). Now, the output is always a list

fjetter · 2025-02-13T09:34:15Z

dask/base.py



-def collections_to_dsk(collections, optimize_graph=True, optimizations=(), **kwargs):
+def collections_to_dsk(


My intention is to rename this new function. We'll have the option of keeping the old one for backwards compat if necessary (with a deprecationwarning).

looking at https://github.com/search?q=collections_to_dsk+language%3APython+NOT+path%3A%2F%5Edask%5C%2F%2F+NOT+path%3A%2F%5Edistributed%5C%2F%2F+NOT+path%3Asite-packages&type=code it seems nobody is using this function publicly so I will go ahead and just drop it

Hi - I'm sure I've misunderstood :), but I just wondered what you mean by "just drop it". I rely on this function in one of my libraries (that is in the GitHub search you linked to). Thanks.

This query yields 32 hits and most of the hits are vendored code. I don't know where you are using it but I'm not willing to keep this function around for one library (or very few).

We're internally moving away from HighLevelGraph objects so this function as it currently is, is useless and I do not intend to maintain it.
The function that is replacing this will likely be internal.

What exactly are you doing with this?

If it helps we can keep the code around for a while (and raise a deprecation warning). That's easy to do but long term, usage of this function is likely not viable

Hi,
Thanks for the quick response. It is, of course, fine to get rid of it :) I just wondered what the implications for my work would be ...

I use it to get at a dictionary view (let's call it dsk) of a task graph (usually non-optimised, but after culling) that I can easily loop through for inspection of the data definitions, or to modify the data definitions. In the latter case I then convert the dictionary back to Dask array (dx = da.Array(dsk, ...)) and can carry on.

Our primary use case for this is management of file locations. We need to a) know which files on disk contribute to a computed result, and b) modify file locations when the actual datasets have (or will have) moved since the file names were logged.

If I can replicate this functionality in a new framework, I'll be very happy :)

David

dask/_expr.py

fjetter · 2025-02-27T10:24:37Z

I believe all the test cases are now working fine except of some distribtued integration tests. CI seems to not install the right version of the distributed branch since they pass locally.
There are many smaller changes that make sense unrelated to the HLG pickle change and I will start breaking those out into separate PRs to put this one in a reviewable state.

dask/dataframe/dask_expr/io/tests/test_distributed.py

fjetter · 2025-03-20T10:37:24Z

While CI is green, there are a bunch of things that are only tested in distributed so this isn't done, yet

fjetter · 2025-03-20T13:37:12Z

On distributed side, there is one (maybe two) issues left that relate to how futures are serialized and registered. I doubt this will have a major impact on the code here.
I'll give it another pass to see if there are smaller changes that are worth breaking out but review can start already.

fjetter · 2025-03-20T13:37:59Z

I summarized the high level changes in the changelog already. I recommend to start there to get an idea of the changes

fjetter · 2025-03-20T15:01:59Z

Also important in case somebody wants to test: To get a meaningful kick out of this when working with arrays we will first have to rework da.store since that is still explicitly optimizing and materializing (the only place in the code base doing this differently). While previous attempts to migrate this to blockwise failed, wrapping it up in these classes is simpler and possible. I already have an early prototype and will look into that once the final test cases are wrapped up.

(i.e. coiled benchmarks are using to_zarr / store in almost all cases so to measure the difference here the benchmarks have to be rewritten a bit. I could already confirm there that it is indeed not materializing locally which saves transmission time and removes any relevant constraints on memory usage on client side we had on the very large examples)

fjetter · 2025-03-21T12:38:16Z

continuous_integration/environment-3.10.yaml

  - jinja2
  - pip
  - pip:
-    - git+https://github.com/dask/distributed


revert before merge

hendrikmakait

Thanks, @fjetter!

fjetter marked this pull request as draft February 11, 2025 16:33

fjetter commented Feb 12, 2025

View reviewed changes

dask/_expr.py Show resolved Hide resolved

fjetter commented Feb 12, 2025

View reviewed changes

dask/_expr.py Outdated Show resolved Hide resolved

fjetter commented Feb 13, 2025

View reviewed changes

fjetter mentioned this pull request Feb 13, 2025

Use Expr instead of HLG dask/distributed#9008

Merged

fjetter force-pushed the wrap_hlg_expr branch from c05d5bb to eda4bad Compare February 13, 2025 11:29

phofl reviewed Feb 13, 2025

View reviewed changes

fjetter mentioned this pull request Feb 13, 2025

Move optimize method to base class #11742

Merged

fjetter force-pushed the wrap_hlg_expr branch 2 times, most recently from 17cf997 to aa17996 Compare February 13, 2025 14:48

This was referenced Feb 13, 2025

Simplify assert_divisions #11745

Merged

Ensure map_partitions returns Series object if function returns scalar #11756

Merged

fjetter force-pushed the wrap_hlg_expr branch 2 times, most recently from 248ab86 to 66bb61d Compare February 18, 2025 15:19

fjetter mentioned this pull request Feb 19, 2025

Ensure divisions not diverging from npartitions in Merge #11762

Merged

fjetter force-pushed the wrap_hlg_expr branch 2 times, most recently from 92cd97b to 9e99ec1 Compare February 20, 2025 12:54

fjetter mentioned this pull request Feb 25, 2025

querying df.compute(concatenate=True) #11768

Open

fjetter force-pushed the wrap_hlg_expr branch 2 times, most recently from 8f4612d to 83b30b1 Compare February 27, 2025 10:09

fjetter mentioned this pull request Feb 27, 2025

Never use an asynchronous Client when calling top level compute function #11790

Merged

fjetter commented Feb 27, 2025

View reviewed changes

dask/dataframe/dask_expr/io/tests/test_distributed.py Outdated Show resolved Hide resolved

fjetter mentioned this pull request Feb 27, 2025

Cache tokens on expressions and restore after pickle roundtrip #11791

Merged

fjetter force-pushed the wrap_hlg_expr branch from 83b30b1 to e308b91 Compare February 27, 2025 11:34

fjetter mentioned this pull request Feb 27, 2025

Migrate base.unpack_collections to Task class #11793

Merged

fjetter force-pushed the wrap_hlg_expr branch from e308b91 to 86727b8 Compare February 27, 2025 16:15

fjetter marked this pull request as ready for review February 27, 2025 17:55

fjetter force-pushed the wrap_hlg_expr branch 2 times, most recently from 6c2aa8d to 26270e5 Compare March 21, 2025 09:56

fjetter mentioned this pull request Mar 21, 2025

[WIP] Pass non HLG objects wout materialization #10369

Closed

fjetter force-pushed the wrap_hlg_expr branch 4 times, most recently from a7e6df6 to 90a01d6 Compare March 21, 2025 11:42

fjetter commented Mar 21, 2025

View reviewed changes

hendrikmakait approved these changes Mar 21, 2025

View reviewed changes

fjetter mentioned this pull request Mar 24, 2025

Use map_blocks in array.store to avoid materialization and dropping of annotations #11844

Merged

fjetter added 4 commits March 24, 2025 11:30

Wrap HLGs in Expr to avoid client side materialization

3ce2568

add force kwarg to shuffle

12e069e

updated changelog

5c238b9

Remove grouping in collections_to_expr

2e3bd2a

fjetter force-pushed the wrap_hlg_expr branch from 2d208a1 to 2e3bd2a Compare March 24, 2025 10:31

fjetter merged commit b2a4a21 into dask:main Mar 24, 2025
19 of 22 checks passed

fjetter mentioned this pull request Mar 31, 2025

Xarray resampling seems worse now #11853

Open

fjetter deleted the wrap_hlg_expr branch March 31, 2025 09:26

hendrikmakait mentioned this pull request Mar 31, 2025

Regression: Scheduler runs OOM during test_xgboost.py::test_preprocess coiled/benchmarks#1686

Closed

ikrommyd mentioned this pull request Apr 25, 2025

Slow graph computation time introduced by PR 11736 and magically fixed by PR 11904 locally but not fixed with a distributed Client #11913

Open

trexfeathers mentioned this pull request May 2, 2025

Lockfile updates plus adapt to Dask changes to delayed and internals SciTools/iris#6438

Closed

trexfeathers mentioned this pull request May 9, 2025

[CI Bot] environment lockfiles auto-update SciTools/iris#6412

Merged

HiromuHota added a commit to snorkel-ai/ray that referenced this pull request May 9, 2025

Adapt to dask/dask#11736

6320615

keewis mentioned this pull request Jun 23, 2025

computing larger-than-worker-memory array results in repeatedly restarting workers #11997

Open

JWGits mentioned this pull request Oct 23, 2025

Binding on nested delayed broke on latest version: FutureCancelledError: delayed_three-... cancelled for reason: lost dependencies #12055

Open

jacobtomlinson mentioned this pull request Nov 10, 2025

printing dd.concat([]).reset_index().set_index() raises IndexError #12147

Open



		def collections_to_dsk(collections, optimize_graph=True, optimizations=(), **kwargs):
		def collections_to_dsk(

Uh oh!

Wrap HLGs in an Expr to avoid Client side materialization #11736

Wrap HLGs in an Expr to avoid Client side materialization #11736

Uh oh!

Conversation

fjetter commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

Uh oh!

Uh oh!

fjetter Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

fjetter Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

fjetter Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidhassell Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

fjetter Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

fjetter Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

davidhassell Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fjetter commented Feb 27, 2025

Uh oh!

Uh oh!

fjetter commented Mar 20, 2025

Uh oh!

fjetter commented Mar 20, 2025

Uh oh!

fjetter commented Mar 20, 2025

Uh oh!

fjetter commented Mar 20, 2025

Uh oh!

fjetter Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

hendrikmakait left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fjetter commented Feb 11, 2025 •

edited

Loading

github-actions bot commented Feb 11, 2025 •

edited

Loading

fjetter Feb 13, 2025 •

edited

Loading