-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Inline zarr.Array in da.from_zarr #6203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This improves task fusion.
| assert "auto" in str(info.value) | ||
|
|
||
|
|
||
| def test_from_zarr_unique_name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved all the Zarr tests to their own file. Hope that's OK.
| from dask.utils import tmpdir | ||
|
|
||
|
|
||
| def test_from_zarr_unique_name(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just moved tests.
| assert a2.chunks == a.chunks | ||
|
|
||
|
|
||
| def test_from_zarry_inline_getter(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only new test, asserting that we're using getter_inline in from_zarr.
|
This functionality has flopped back and forth a couple of times. I don't recall the PR/issue where it was brought up before, but I think the problem was that people were using data stores that were not as trivial to move around as Zarr files. Perhaps |
|
Thanks for investigating this Tom. It seems like you are on the right track to solving some long-standing performance issues related to large distributed reads. |
|
I didn't see much in the git blame. In the original implementation https://github.com/dask/dask/pull/3561/files @jakirkham didn't inline the I'll this try out with some zarr datasets in cloud storage later today. |
|
Sorry, I think the issue I'm thinking of focused on from_array.
In general though, I think that this will break if someone points a Dask
array to a Zarr array that uses in-memory storage.
Maybe there was also an issue with storing or creating the object many
times? Perhaps authentication as taking a while or the serialized form of
the task was large enough that people wanted to store it once rather than n
times?
In general treating the Zarr array separately seems somewhat cleaner. It
would be good to see if there is a deeper issue that can be solved here.
Especially if this is funded work, it would be good to frontload cost
rather than have this change pop up later and have to be handled by
volunteers.
…On Wed, May 13, 2020 at 7:58 AM Tom Augspurger ***@***.***> wrote:
I didn't see much in the git blame. In the original implementation
https://github.com/dask/dask/pull/3561/files @jakirkham
<https://github.com/jakirkham> didn't inline the Array. Perhaps it was
HDF?
I'll this try out with some zarr datasets in cloud storage later today.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTG6IBP3LW65AYJ7MR3RRKYQJANCNFSM4M7ZHVOA>
.
|
|
Hmmm unfortunately this doesn't look like an unambiguous win. I whether or not it's good to inline the getitem depends on the downstream operations. I'm going to explore why the presence of the root node causes issues with fusion in the first place. |
My guess is I just did whatever was easiest to get us started :) |
|
@TomAugspurger, I'm confused by what you say you want to do and what you did in this PR. To me, it looks like the Zarr array wasn't inlined. The getters were inlined. Please let me know if I misunderstood something. Also, I wouldn't expect Looking at the example from #5105 (comment) import dask
import dask.array as da
a = da.random.random((4, 4000, 4000), chunks=(1, 4000, 4000))
da.to_zarr(a, "/tmp/data/original.zarr", overwrite=True)
data = da.from_zarr("/tmp/data/original.zarr")
desired = data.rechunk({0: 4, 1: 2000, 2: 2000})
data2 = data.rechunk({1: 2000, 2: 2000})
data2.to_zarr("/tmp/data/temp.zarr", overwrite=True)
data3 = da.from_zarr("/tmp/data/temp.zarr")
data4 = data3.rechunk({0: 4})
with dask.config.set(**{"optimization.fuse.ave-width": 4}):
display(data4.visualize(optimize_graph=True))This PR gives (due to inlining, not fuse, so For this example, diff --git a/dask/array/core.py b/dask/array/core.py
index 4d38c63c..c4dda7b7 100644
--- a/dask/array/core.py
+++ b/dask/array/core.py
@@ -2861,7 +2861,11 @@ def from_zarr(
chunks = chunks if chunks is not None else z.chunks
if name is None:
name = "from-zarr-" + tokenize(z, component, storage_options, chunks, **kwargs)
- return from_array(z, chunks, name=name)
+ arr = from_array(z, chunks, name=name)
+ alias = 'alias-' + name
+ arr.dask.layers[name][alias] = arr.dask.layers[name][name]
+ arr.dask.layers[name][name] = alias
+ return arr
def to_zarr(As an observation, inlining is much more overzealous than fusing and more difficult to control. Is there a getter that isn't held by |
Answering my own question... here's a less hacky way (learned from this PR) to avoid both diff --git a/dask/array/core.py b/dask/array/core.py
index 4d38c63c..02b91d6a 100644
--- a/dask/array/core.py
+++ b/dask/array/core.py
@@ -135,6 +135,10 @@ def getter_inline(a, b, asarray=True, lock=None):
return getter(a, b, asarray=asarray, lock=lock)
+def getter_nohold(a, b, asarray=True, lock=None):
+ return getter(a, b, asarray=asarray, lock=lock)
+
+
from .optimization import optimize, fuse_slice
@@ -2861,7 +2865,7 @@ def from_zarr(
chunks = chunks if chunks is not None else z.chunks
if name is None:
name = "from-zarr-" + tokenize(z, component, storage_options, chunks, **kwargs)
- return from_array(z, chunks, name=name)
+ return from_array(z, chunks, name=name, getitem=getter_nohold)
def to_zarr( |
|
FWIW, I may be abandoning this PR. It's not obviously better on all workloads. I also think that Jim's PR to fuse Delayed achieves what I wanted, but will need to double check.
Yes, I think that you're correct. |
|
Which PR is that? |
|
Must be #6222, and I'm curious how you expect that to help here. What workloads perform poorly with this PR? |
|
Thanks Erik! Would also be curious to know. Though guessing these are some of the issues. |
I'm a bit curious about this point, @TomAugspurger. Is there a workload where this would behave worse? If so, by how much? |
|
Sorry for the delay in responding here. #6773 is making this a user-facing keyword. import dask
import dask.array as da
original = "original.zarr"
a = da.random.random(size=(4, 4000, 4000), chunks=(1, 4000, 4000))
a.to_zarr(original, overwrite=True)
b = da.from_zarr(original)
c = b.rechunk({1: 2000, 2: 2000})
with dask.config.set(**{"optimization.fuse.ave-width": 4}):
c.visualize(optimize_graph=True, color="order", cmap="autumn", filename="bad")
b = da.from_zarr(original, inline_array=True)
c = b.rechunk({1: 2000, 2: 2000})
with dask.config.set(**{"optimization.fuse.ave-width": 4}):
c.visualize(optimize_graph=True, color="order", cmap="autumn", filename="good") |
|
Should from_zarr provide sane defaults based on the source of the Zarr
array? Do we know what those sane defaults would be?
…On Wed, Oct 28, 2020 at 8:55 AM Tom Augspurger ***@***.***> wrote:
Closed #6203 <#6203>.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6203 (comment)>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCWCPLPJJ5OIXQ4VS3SNA5IFANCNFSM4M7ZHVOA>
.
|
|
I might not be following the question. Could you please provide an example? Think I'm just missing context 😅 |
|
I'm asking what the default value for inlining should be for the from_zarr
function. Should from_zarr inline or not inline? Are there things about a
zarr array that could help us better choose a default value?
For example, if the zarr array holds data in memory, then it should
probably not inline. If it points to data on-disk then maybe it's ok? (Or
maybe not, and we should consider caching like Xarray does)
…On Wed, Oct 28, 2020 at 2:52 PM jakirkham ***@***.***> wrote:
I might not be following the question. Could you please provide an
example? Think I'm just missing context 😅
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6203 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCBJPXG75YQC42QKSTSNCHB7ANCNFSM4M7ZHVOA>
.
|





Inlining means that we take a graph like
{ "zarr-arrray": (zarr.Array, ...), "arr-0-0": (getitem, "zarr-array", 0, 0), "arr-0-1": (getitem, "zarr-array", 0, 0), }and replace it with something like
In this case, we want to inline the
zarr.Arrayto ensure that we get good task fusionlater on. For reasons currently beyond me, the presence of a root node confuses our
fusion algorithms. Compare the two
On master
On this PR
Failure to fuse tends to cause excess communication and memory issues on the distributed scheduler. Compare the to task graph panes in these reports.
c.to_zarr()on master:https://gistcdn.rawgit.org/TomAugspurger/1ebffe84db72e630229efb3a4b9cd7d2/19f63c5b972332708ffec041927f0cd764a302a6/master.html
c.to_zarr()on this PR:https://gistcdn.rawgit.org/TomAugspurger/1fae16981f23a8ff892f120da7dbdba4/c1101da0e41759d2f81642369ed48e1c910147f7/pr.html
xref #5105. There's one more issue with
to_zarrthat I want to track down. I'm seeing what looks like unnecessary data moving around, but need to verify.cc @alimanfoo, @rabernat, and @rsignell-usgs since this has affected you.
Setting the fuse_ave_width to be greater or equal the rechunk factor (4 in this case) is quite important right now. I'm hopeful we'll be able to do this automatically in the future.