-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add inline_array option to from_array #6773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds an option to `da.from_array` to control how the array object itself is included in the task graph. By default, there's no change: there will be a single key whose value is the array object. The tasks giving each chunk of the array refer to the array by using its key. With `inline_array=True`, each chunk task in the task graph will have the array itself included in the values. I think that this closes dask#6668. I don't think that we would automatically switch the default to True, which would certainly close it.
|
cc @shoyer, if this is what you had in mind for the nicer API. Functionally, I think it's equivalent to the custom inline optimization, it just happens at graph construction time. import dask.array
import numpy as np
import dask
# create data on disk
n = 125 * 4
x = dask.array.zeros((12500, 10000), chunks=('10MB', -1))
dask.array.to_zarr(x, 'saved_x1.zarr', overwrite=True)
dask.array.to_zarr(x, 'saved_y1.zarr', overwrite=True)
dask.array.to_zarr(x, 'saved_x2.zarr', overwrite=True)
dask.array.to_zarr(x, 'saved_y2.zarr', overwrite=True)
x1 = dask.array.from_zarr('saved_x1.zarr', inline_array=True)
y1 = dask.array.from_zarr('saved_x2.zarr', inline_array=True)
x2 = dask.array.from_zarr('saved_y1.zarr', inline_array=True)
y2 = dask.array.from_zarr('saved_y2.zarr', inline_array=True)
def evaluate(x1, y1, x2, y2):
u = dask.array.stack([x1, y1])
v = dask.array.stack([x2, y2])
components = [u, v, u ** 2 + v ** 2]
return [
abs(c[0] - c[1]).mean(axis=-1)
for c in components
]
dask.visualize(evaluate(x1[:n], y1[:n], x2[:n], y2[:n]), optimize_graph=True,
color="order", cmap="autumn", node_attr={"penwidth": "4"})versus this task graph with I guess there may be one remaining task from #6668:
I'm still thinking through that. |
|
I added some documentation on this. @mrocklin IIRC, in the past you've had some hesitation to documenting these failures. I believe your objections were along the lines of
I agree with those downsides, but I still think this is worth including with some explicit caveats:
|
|
Please feel free to override any previous objections that I may have had
…On Wed, Oct 28, 2020 at 7:02 AM Tom Augspurger ***@***.***> wrote:
I added some documentation on this. @mrocklin
<https://github.com/mrocklin> IIRC, in the past you've had some
hesitation to documenting these failures. I believe your objections were
along the lines of
1. It's hard to generalize from a particular example and apply it to
your problem.
2. It's possible (likely?) that any specific ordering problem can be
fixed, rendering the example obsolete. I'm kicking around some ideas about
somehow ignoring common leaf nodes in ordering that might fix Stephan's
issue).
I agree with those downsides, but I still think this is worth including
with some explicit caveats:
1. This is an advanced topic that most people shouldn't hit.
2. Even when you hit an ordering issue, it *may* not be the end of the
world.
3. Making changes to achieve "ideal" ordering may have other downsides
that slow down the overall computation time.
[image: FireShot Capture 013 - Ordering — Dask documentation -]
<https://user-images.githubusercontent.com/1312546/97445665-83154800-18fb-11eb-9ea7-3f86be2e1083.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6773 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGOK4JEEQCWKOYUZP3SNAP63ANCNFSM4TBHXCSQ>
.
|
|
From #6203 (comment)
Is it surprising that even with inlined arrays, we see about the same serialized size? In [8]: import zarr
...: import numpy as np
...: import dask.array as da
...: from distributed.protocol import serialize
...:
...: a = zarr.ones((12_500, 10_000), chunks=(125, 10_000))
...: a[:] = np.random.random(size=a.shape)
...:
...: x1 = da.from_zarr(a)
...:
...: x2 = da.from_zarr(a, inline_array=True)
...:
...: len(x1.dask), len(x2.dask)
...:
...: len(serialize(x1)[1][0]) / len(serialize(x2)[1][0])
Out[8]: 0.9999993250547557I would have expected the size of the inlined version to be about 100x, since there are 100 references to the array. But someone (dask? pickle?) is clever enough that to handle that case. So I guess it comes down to the cost of creating these Zarr objects (including their stores?) |
Right, pickle is very clever. Each Python object it encounters is only gets serialized once. It's interesting that dask distributed seems to do the same thing when serializing graphs. I was imagining there might be cases where tasks get individually shipped off to different processes (requiring the Zarr array to be re-serialized), but maybe not? Or maybe the overhead in these cases doesn't matter? In that case, perhaps it is always the right answer to use |
|
I wonder how this compares to an explicit client.scatter(broadcast=True) for the array object, which only serialises and send once, but you get copies everywhere. Of course, that breaks the graph/execute boundary; but wondering about the conceptual model here. |
|
Just a guess, but I think the main (only?) difference would be that That said, |
|
Thanks for the thoughts @TomAugspurger . I would certainly not advocate for special workarounds that only applied to distributed - unless there was a really compelling reason. |
|
The main outstanding issue right now is the default for Any objections to just merging this as is, and getting feedback from users on if / when |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TomAugspurger! This looks good overall, I've just got a few small comments
Co-authored-by: James Bourbeau <[email protected]>
|
Thanks James! |



This adds an option to
da.from_arrayto control how the array objectitself is included in the task graph. By default, there's no change:
there will be a single key whose value is the array object. The tasks
giving each chunk of the array refer to the array by using its key.
With
inline_array=True, each chunk task in the task graph will havethe array itself included in the values.
The same keyword is added to
from_zarr, which usesfrom_arrayinternally.I think that this closes #6668. I don't
think that we would automatically switch the default to True, which
would certainly close it.