Skip to content

Conversation

@djhoese
Copy link
Contributor

@djhoese djhoese commented Oct 6, 2022

See #9533 for details. This reverts the logic implemented in #3754 which was a fix for #3751.

CC @DPeterK who made the original #3751 issue.

The summary is that users shouldn't assume anything about a dask Array's memory allocation or usage. Doing an Array.copy may return the same block of allocated memory once computed as its from_array input or it may return a newly allocated block of memory. Dask has the possibility of using memory however it deems is the most efficient or at least that's the idea. The thing that should be consistent is that doing a no-op operation (like .copy is currently documented as) should not add additional tasks to the dask graph.

    x = np.arange(10)
    y = da.from_array(x, chunks=chunks)
    y_c = y.copy()
    assert y.name == y_c.name

y_c = y.copy()
assert y is not y_c
assert y.compute() is not y_c.compute()
assert y.name == y_c.name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ian-r-rose I just realized, maybe I should do an assert y is not y_c here too to enforce/confirm/define that the expected behavior of .copy is to make a new instance of the Array object but have the same task graph.

@DPeterK
Copy link

DPeterK commented Oct 7, 2022

Thanks @djhoese for the CC! Searching back through my memory (and the old issues referenced here) I think the problem was (and still remains, actually) that Iris expects that, after a copy of a Cube, the two arrays are different entities; that is, a is not b. Copying a Cube will eventually copy its data array and Iris expects that these are not the same item in memory, which is important for most of the operations where Iris makes a copy of a Cube.

However, that is an Iris thing and not a dask thing... I guess what would help Iris, then, is if there is a mechanism (perhaps this is just deepcopy?) that means that Iris can make a new dask array in memory when copying a Cube.

@djhoese
Copy link
Contributor Author

djhoese commented Oct 7, 2022

Copying a Cube will eventually copy its data array and Iris expects that these are not the same item in memory, which is important for most of the operations where Iris makes a copy of a Cube.

The dask Array objects are different, the task graphs are the same, and if it is a single chunk numpy array being given to da.from_array then dask's current threaded scheduler implementation will give you the original numpy array back when you call dask_arr.compute(). Could you shed more light on why Iris needs the memory to not be the same? Does it modify the numpy array under the hood in a map_blocks function or other similar task? Or does it modify the numpy array that it originally gave to dask?

@djhoese
Copy link
Contributor Author

djhoese commented Oct 13, 2022

@DPeterK Any other comments on this?

@ian-r-rose Any chance this can get merged?

Copy link
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, thanks @djhoese. I, too, am curious about @DPeterK's use-case such that the underlying array data need to be different pieces of memory, and whether we can address that in a way that is more dask-y. But I also suspect that the previous implementation where copy's only happened for single-partition arrays wasn't really solving that use-case either.

@ian-r-rose ian-r-rose mentioned this pull request Oct 13, 2022
7 tasks
@djhoese
Copy link
Contributor Author

djhoese commented Oct 14, 2022

But I also suspect that the previous implementation where copy's only happened for single-partition arrays wasn't really solving that use-case either.

If it is like my case, the single-partition/chunk case was only used in unit tests and probably not very likely in real world cases.

@ian-r-rose
Copy link
Collaborator

Thanks @djhoese!

@ian-r-rose ian-r-rose merged commit d6938ac into dask:main Oct 14, 2022
trexfeathers added a commit to trexfeathers/iris that referenced this pull request Oct 31, 2022
trexfeathers added a commit to trexfeathers/iris that referenced this pull request Nov 2, 2022
trexfeathers added a commit to trexfeathers/iris that referenced this pull request Nov 2, 2022
pp-mo pushed a commit to SciTools/iris that referenced this pull request Nov 4, 2022
* Replicate legacy dask behaviour pre dask/dask#9555.

* Revert "Replicate legacy dask behaviour pre dask/dask#9555."

This reverts commit 7363412.

* Accept shared NumPy arrays when copying certain Dask arrays - dask/dask#9555.

* Updated lock-files.

* What's New entry.

* Re-arrange What's New entry.
@rcomer
Copy link
Contributor

rcomer commented Nov 5, 2022

If it is like my case, the single-partition/chunk case was only used in unit tests and probably not very likely in real world cases.

For info, this is basically the conclusion we came to in Iris. Now we wait to see if any user tells us otherwise!
SciTools/iris#5041

@DPeterK
Copy link

DPeterK commented Nov 7, 2022

Oh sorry @djhoese - glad you went ahead and merged rather than waiting on me...

As @rcomer says, Iris had a test that expected this behaviour; indeed this test was why I introduced the former behaviour into dask in the first place. The Iris devs think there was a reason that we expected this behaviour in Iris - something like it was necessary for some math operations - but we can't remember either precisely what the use-case was nor whether that use-case is now legacy as well.

Otherwise I think this change is a positive step for dask, and I reckon Iris should just follow what dask does for array management. If we find there are user problems we can deal with them should they arrive...

@djhoese djhoese deleted the revert-dask-copy-new-task branch April 5, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Array.copy is not a no-op

4 participants