-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Extend __setitem__ to more closely match numpy #7033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @shoyer do you know anyone who would be interested in reviewing this? |
dask/array/core.py
Outdated
| len(index), size, index | ||
| ) | ||
| ) | ||
| index = where(index)[0].compute_chunk_sizes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion (boolean dtype -> positive indices) seems like potentially not worth the trouble.
where(cond) which requires computation at graph construction time (to determine chunk sizes), is generally best avoided, unlike the three argument form where(cond, x, y) can done lazily,.
Also, it looks everything in this code path needs to get converted into a slice(), which seems very restrictive and thus unlikely to be useful. This is particularly surprising given that dask already supports x[index] = y for arbitrary multi-dimensional boolean index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at this, @shoyer - very much appreciated.
Only 1-d integer arrays that can be converted to slices (like [3, 2, 1], or [1, 3, 5]) are converted. I have found in the past that this is worth the effort because the numpy __setitem__ is much, much faster with slices, saving more time than we wasted on checking the 1-d array for regularity.
Other 1-d arrays indices (like [1, 2, 5]) remain as they are.
I will have a think about how we can process boolean indices without converting them to integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will have a think about how we can process boolean indices without converting them to integers.
... at least, perhaps, until compute time ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - apologies, I see now that you had spotted that even integer indices are converted to slices, which is indeed overly restrictive, and caused a bug (#7033 (comment)). This is bug is now fixed, and I'm thinking about the boolean indices issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The premature compute from the one-argument where has gone. The move from regular integer sequences to slices has moved to slicing.setitem and done chunkwise at compute time (c46ccb2)
So, I think there is very little work done now at the time of defining the assignment. The only significant thing that remains is a test for strict monotonicity of integer array indices - if such an index is a dask array then it is computed at at the time of the assignment definition. It is my hope, however, that we can, in the future, remove the need to this test, and allow other type of integer sequences (liket Array.__getitem__ and numpy do).
Illviljan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly grammar stuff from me at this point. I appreciate all the comments in the code.
Perhaps all the extra features being added is well worth it but I was wondering with all this new code is there any change in performance?
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Illviljan <[email protected]>
Co-authored-by: Illviljan <[email protected]>
|
I have spotted a bug that causes, e.g. >>> x = np.ma.arange(60).reshape((6, 10))
>>> dx = da.from_array(x.copy(), chunks=(2, 2))
>>> dx[:, 0] = dx[:, 0]
ValueError: Can't broadcast data with shape (6,) across shape (6, 1)
>>> dx[:, 0] = dx[:, 0:1] # But this is OKThis is related to how the indices are parsed, and is probably a legacy of the code lifted from cf-python, which always retains a size 1 dimension from an integer valued index, whilst numpy and dask drops it. I'm looking into a fix for this so that |
|
02e1b43 fixes #7033 (comment) |
dask/array/core.py
Outdated
| from .routines import where # , count_nonzero | ||
| from .slicing import parse_assignment_indices, setitem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The where function is not used in your parts right? Then that's an unnecessary import when isinstance(key, Array) = False. Perhaps move that row down to just before the try: y = where(key, value, self)?
Same idea with parse_assignment and setitem. Those are not needed when isinstance(key, Array) = True Although since .slicing has already been loaded at the top of the module so maybe parse_assignment_indices, setitem should just be defined at the top as well?
|
I have realised that the code I have written accesses every chunk, regardless of whether or not is being assigned to. For data on disk this is clearly not very efficient. However, since writing it I have learnt all about graphs and think it is possible to refactor the code away from using # Create graph that only includes the chunks being assigned
dsk = ...
graph = HighLevelGraph.from_collections(out, dsk, ...)
y = Array(graph, ...)
self._meta = y._meta
self.dask = y.dask
self.name = y.name
self._chunks = y.chunks
return selfThis will involve running some of the logic currently in Thanks for bearing with me - I should have been more clued up to some of the mechanics of __getitem__, first. |
|
OK - no surprise that I've run into similar ground as dealt with |
|
Oh dear - sorry! There was nothing particularly wrong with the |
|
It looks like this is ready and has just fallen off the radar. Is that correct? |
|
Hi - I was just about to ask where we were with it - thanks for keeping tabs. I think that I've responded to all of the review issues (for which thanks), so I hope that it is ready, too ... |
|
Hi @shoyer, @Illviljan - Thanks for all of your input so far - have you had a chance to look over the current state of this PR? David |
|
We talked about this in the maintainer meeting today and @jrbourbeau brought up that maybe we should merge this after the release on Friday so that there is some time for it to run in the xarray upstream tests. |
|
Ok merging now! Thanks @davidhassell |
|
Great - and thanks, all, for helping me with this |
black dask/flake8 daskFixes #7029