Add inline_array option to from_array #6773

TomAugspurger · 2020-10-27T18:48:29Z

This adds an option to da.from_array to control how the array object
itself is included in the task graph. By default, there's no change:
there will be a single key whose value is the array object. The tasks
giving each chunk of the array refer to the array by using its key.

With inline_array=True, each chunk task in the task graph will have
the array itself included in the values.

The same keyword is added to from_zarr, which uses from_array internally.

I think that this closes #6668. I don't
think that we would automatically switch the default to True, which
would certainly close it.

This adds an option to `da.from_array` to control how the array object itself is included in the task graph. By default, there's no change: there will be a single key whose value is the array object. The tasks giving each chunk of the array refer to the array by using its key. With `inline_array=True`, each chunk task in the task graph will have the array itself included in the values. I think that this closes dask#6668. I don't think that we would automatically switch the default to True, which would certainly close it.

TomAugspurger · 2020-10-27T18:54:21Z

cc @shoyer, if this is what you had in mind for the nicer API. Functionally, I think it's equivalent to the custom inline optimization, it just happens at graph construction time.

import dask.array
import numpy as np
import dask

# create data on disk
n = 125 * 4
x = dask.array.zeros((12500, 10000), chunks=('10MB', -1))
dask.array.to_zarr(x, 'saved_x1.zarr', overwrite=True)
dask.array.to_zarr(x, 'saved_y1.zarr', overwrite=True)
dask.array.to_zarr(x, 'saved_x2.zarr', overwrite=True)
dask.array.to_zarr(x, 'saved_y2.zarr', overwrite=True)

x1 = dask.array.from_zarr('saved_x1.zarr', inline_array=True)
y1 = dask.array.from_zarr('saved_x2.zarr', inline_array=True)
x2 = dask.array.from_zarr('saved_y1.zarr', inline_array=True)
y2 = dask.array.from_zarr('saved_y2.zarr', inline_array=True)


def evaluate(x1, y1, x2, y2):
  u = dask.array.stack([x1, y1])
  v = dask.array.stack([x2, y2])
  components = [u, v, u ** 2 + v ** 2]
  return [
      abs(c[0] - c[1]).mean(axis=-1)
      for c in components
  ]

dask.visualize(evaluate(x1[:n], y1[:n], x2[:n], y2[:n]), optimize_graph=True,
               color="order", cmap="autumn", node_attr={"penwidth": "4"})

versus this task graph with inline_array=False (the default):

I guess there may be one remaining task from #6668:

[@shoyer] If so, perhaps it would make sense to think about applying this optimization automatically when not using the distributed scheduler.

I'm still thinking through that.

dask/array/core.py

TomAugspurger · 2020-10-28T14:02:04Z

I added some documentation on this. @mrocklin IIRC, in the past you've had some hesitation to documenting these failures. I believe your objections were along the lines of

It's hard to generalize from a particular example and apply it to your problem.
It's possible (likely?) that any specific ordering problem can be fixed, rendering the example obsolete. I'm kicking around some ideas about somehow ignoring common leaf nodes in ordering that might fix Stephan's issue).

I agree with those downsides, but I still think this is worth including with some explicit caveats:

This is an advanced topic that most people shouldn't hit.
Even when you hit an ordering issue, it may not be the end of the world.
Making changes to achieve "ideal" ordering may have other downsides that slow down the overall computation time.

mrocklin · 2020-10-28T14:36:13Z

Please feel free to override any previous objections that I may have had

…

On Wed, Oct 28, 2020 at 7:02 AM Tom Augspurger ***@***.***> wrote: I added some documentation on this. @mrocklin <https://github.com/mrocklin> IIRC, in the past you've had some hesitation to documenting these failures. I believe your objections were along the lines of 1. It's hard to generalize from a particular example and apply it to your problem. 2. It's possible (likely?) that any specific ordering problem can be fixed, rendering the example obsolete. I'm kicking around some ideas about somehow ignoring common leaf nodes in ordering that might fix Stephan's issue). I agree with those downsides, but I still think this is worth including with some explicit caveats: 1. This is an advanced topic that most people shouldn't hit. 2. Even when you hit an ordering issue, it *may* not be the end of the world. 3. Making changes to achieve "ideal" ordering may have other downsides that slow down the overall computation time. [image: FireShot Capture 013 - Ordering — Dask documentation -] <https://user-images.githubusercontent.com/1312546/97445665-83154800-18fb-11eb-9ea7-3f86be2e1083.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6773 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGOK4JEEQCWKOYUZP3SNAP63ANCNFSM4TBHXCSQ> .

dask/array/core.py

docs/source/order.rst

TomAugspurger · 2020-10-29T15:04:37Z

From #6203 (comment)

I'm asking what the default value for inlining should be for the from_zarr
function. Should from_zarr inline or not inline? Are there things about a
zarr array that could help us better choose a default value?

For example, if the zarr array holds data in memory, then it should
probably not inline. If it points to data on-disk then maybe it's ok? (Or
maybe not, and we should consider caching like Xarray does)

Is it surprising that even with inlined arrays, we see about the same serialized size?

In [8]: import zarr
   ...: import numpy as np
   ...: import dask.array as da
   ...: from distributed.protocol import serialize
   ...:
   ...: a = zarr.ones((12_500, 10_000), chunks=(125, 10_000))
   ...: a[:] = np.random.random(size=a.shape)
   ...:
   ...: x1 = da.from_zarr(a)
   ...:
   ...: x2 = da.from_zarr(a, inline_array=True)
   ...:
   ...: len(x1.dask), len(x2.dask)
   ...:
   ...: len(serialize(x1)[1][0]) / len(serialize(x2)[1][0])

Out[8]: 0.9999993250547557

I would have expected the size of the inlined version to be about 100x, since there are 100 references to the array. But someone (dask? pickle?) is clever enough that to handle that case.

So I guess it comes down to the cost of creating these Zarr objects (including their stores?)

shoyer · 2020-10-29T16:24:14Z

I would have expected the size of the inlined version to be about 100x, since there are 100 references to the array. But someone (dask? pickle?) is clever enough that to handle that case.

Right, pickle is very clever. Each Python object it encounters is only gets serialized once.

It's interesting that dask distributed seems to do the same thing when serializing graphs. I was imagining there might be cases where tasks get individually shipped off to different processes (requiring the Zarr array to be re-serialized), but maybe not? Or maybe the overhead in these cases doesn't matter?

In that case, perhaps it is always the right answer to use inline_array=True always with Zarr and other uses of from_array with serializable arrays (e.g., in xarray). We can't do it generically only because many on-disk array types like those from h5py can't be pickled.

martindurant · 2020-11-05T14:07:53Z

I wonder how this compares to an explicit client.scatter(broadcast=True) for the array object, which only serialises and send once, but you get copies everywhere. Of course, that breaks the graph/execute boundary; but wondering about the conceptual model here.

TomAugspurger · 2020-11-05T14:21:13Z

Just a guess, but I think the main (only?) difference would be that client.scatter(broadcast=True) creates just one instance on the client (before scatter) and many instances on the cluster. With this PR and inline_array=True I think you get many instances on the client and on the cluster.

That said, client.scatter is only a solution for the distributed scheduler. This issue affects the local schedulers too, and the pain is felt most acutely on the local scheduler where static ordering tends to be even more important to keeping memory use down.

martindurant · 2020-11-05T14:30:27Z

Thanks for the thoughts @TomAugspurger . I would certainly not advocate for special workarounds that only applied to distributed - unless there was a really compelling reason.

TomAugspurger · 2020-11-16T15:08:36Z

The main outstanding issue right now is the default for from_zarr: it's possible that from_zarr(..., inline_array=True) is strictly better (we already know it's not strictly better for the general from_array).

Any objections to just merging this as is, and getting feedback from users on if / when inline_array=True is worse? I'll be sure to make the pangeo community aware of this option, and will try to gather feedback.

jrbourbeau

Thanks @TomAugspurger! This looks good overall, I've just got a few small comments

docs/source/order.rst

dask/array/tests/test_array_core.py

dask/array/core.py

Co-authored-by: James Bourbeau <[email protected]>

TomAugspurger · 2020-12-19T18:59:31Z

Thanks James!

fixup

e76fe5a

shoyer reviewed Oct 27, 2020

View reviewed changes

dask/array/core.py Show resolved Hide resolved

TomAugspurger added 3 commits October 28, 2020 06:41

doctest

74ac05a

Merge remote-tracking branch 'upstream/master' into 6668-order

3b8ecf8

Doc: Order debugging issues

184d936

TomAugspurger mentioned this pull request Oct 28, 2020

Inline zarr.Array in da.from_zarr #6203

Closed

shoyer reviewed Oct 28, 2020

View reviewed changes

dask/array/core.py Outdated Show resolved Hide resolved

docs/source/order.rst Outdated Show resolved Hide resolved

docs/source/order.rst Outdated Show resolved Hide resolved

TomAugspurger added 2 commits October 28, 2020 20:55

Fixups

9e105fc

fixup

8f5c8e9

shoyer approved these changes Oct 29, 2020

View reviewed changes

jrbourbeau reviewed Dec 19, 2020

View reviewed changes

TomAugspurger and others added 3 commits December 19, 2020 07:17

Apply suggestions from code review

32f2c97

Co-authored-by: James Bourbeau <[email protected]>

Merge remote-tracking branch 'upstream/master' into 6668-order

3592e0b

remove unused

f21fae8

jrbourbeau approved these changes Dec 19, 2020

View reviewed changes

jrbourbeau merged commit e1301de into dask:master Dec 19, 2020

TomAugspurger deleted the 6668-order branch December 19, 2020 18:59

TomAugspurger mentioned this pull request Jun 2, 2021

Implement dask.sizeof for xarray.core.indexing.ImplicitToExplicitIndexingAdapter pydata/xarray#5426

Open

martindurant mentioned this pull request Nov 2, 2021

Blockwise array creation redux #7417

Merged

2 tasks

TomNicholas mentioned this pull request May 4, 2022

New inline_array kwarg for open_dataset pydata/xarray#6566

Merged

3 tasks

dcherian mentioned this pull request Nov 8, 2023

[Dask.order] Memory usage regression for flox xarray reductions #10618

Closed

m-albert mentioned this pull request Aug 30, 2024

The Gaussian filter is not behaving lazily as expected and is allocating RAM for the entire array dask/dask-image#386

Open

Uh oh!

Add inline_array option to from_array #6773

Add inline_array option to from_array #6773

Uh oh!

Conversation

TomAugspurger commented Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Oct 27, 2020

Uh oh!

Uh oh!

TomAugspurger commented Oct 28, 2020

Uh oh!

mrocklin commented Oct 28, 2020 via email

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Oct 29, 2020

Uh oh!

shoyer commented Oct 29, 2020

Uh oh!

martindurant commented Nov 5, 2020

Uh oh!

TomAugspurger commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented Nov 5, 2020

Uh oh!

TomAugspurger commented Nov 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Dec 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TomAugspurger commented Oct 27, 2020 •

edited

Loading

TomAugspurger commented Nov 5, 2020 •

edited

Loading

TomAugspurger commented Nov 16, 2020 •

edited

Loading