Skip to content

Conversation

@davidhassell
Copy link
Contributor

  • Tests added / passed
  • Passes black dask / flake8 dask

Fixes #7029

@davidhassell davidhassell changed the title Extend the __setitem__ to more closely match numpy Extend __setitem__ to more closely match numpy Jan 6, 2021
@max-sixty max-sixty mentioned this pull request Jan 6, 2021
5 tasks
@mrocklin
Copy link
Member

mrocklin commented Jan 6, 2021

cc @shoyer do you know anyone who would be interested in reviewing this?

len(index), size, index
)
)
index = where(index)[0].compute_chunk_sizes()
Copy link
Member

@shoyer shoyer Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion (boolean dtype -> positive indices) seems like potentially not worth the trouble.

where(cond) which requires computation at graph construction time (to determine chunk sizes), is generally best avoided, unlike the three argument form where(cond, x, y) can done lazily,.

Also, it looks everything in this code path needs to get converted into a slice(), which seems very restrictive and thus unlikely to be useful. This is particularly surprising given that dask already supports x[index] = y for arbitrary multi-dimensional boolean index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this, @shoyer - very much appreciated.

Only 1-d integer arrays that can be converted to slices (like [3, 2, 1], or [1, 3, 5]) are converted. I have found in the past that this is worth the effort because the numpy __setitem__ is much, much faster with slices, saving more time than we wasted on checking the 1-d array for regularity.

Other 1-d arrays indices (like [1, 2, 5]) remain as they are.

I will have a think about how we can process boolean indices without converting them to integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have a think about how we can process boolean indices without converting them to integers.

... at least, perhaps, until compute time ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - apologies, I see now that you had spotted that even integer indices are converted to slices, which is indeed overly restrictive, and caused a bug (#7033 (comment)). This is bug is now fixed, and I'm thinking about the boolean indices issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The premature compute from the one-argument where has gone. The move from regular integer sequences to slices has moved to slicing.setitem and done chunkwise at compute time (c46ccb2)

So, I think there is very little work done now at the time of defining the assignment. The only significant thing that remains is a test for strict monotonicity of integer array indices - if such an index is a dask array then it is computed at at the time of the assignment definition. It is my hope, however, that we can, in the future, remove the need to this test, and allow other type of integer sequences (liket Array.__getitem__ and numpy do).

Copy link
Contributor

@Illviljan Illviljan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly grammar stuff from me at this point. I appreciate all the comments in the code.

Perhaps all the extra features being added is well worth it but I was wondering with all this new code is there any change in performance?

@davidhassell
Copy link
Contributor Author

davidhassell commented Jan 10, 2021

I have spotted a bug that causes, e.g.

>>> x = np.ma.arange(60).reshape((6, 10))
>>> dx = da.from_array(x.copy(), chunks=(2, 2))
>>> dx[:, 0] = dx[:, 0]
ValueError: Can't broadcast data with shape (6,) across shape (6, 1)
>>> dx[:, 0] = dx[:, 0:1]  # But this is OK

This is related to how the indices are parsed, and is probably a legacy of the code lifted from cf-python, which always retains a size 1 dimension from an integer valued index, whilst numpy and dask drops it.

I'm looking into a fix for this so that d[:, 0] = d[:, 0] works as expected.

@davidhassell
Copy link
Contributor Author

02e1b43 fixes #7033 (comment)

Comment on lines 1580 to 1581
from .routines import where # , count_nonzero
from .slicing import parse_assignment_indices, setitem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The where function is not used in your parts right? Then that's an unnecessary import when isinstance(key, Array) = False. Perhaps move that row down to just before the try: y = where(key, value, self)?

Same idea with parse_assignment and setitem. Those are not needed when isinstance(key, Array) = True Although since .slicing has already been loaded at the top of the module so maybe parse_assignment_indices, setitem should just be defined at the top as well?

@davidhassell
Copy link
Contributor Author

I have realised that the code I have written accesses every chunk, regardless of whether or not is being assigned to. For data on disk this is clearly not very efficient.

However, since writing it I have learnt all about graphs and think it is possible to refactor the code away from using map_blocks and follow a similar approach to __getitem__:

            # Create graph that only includes the chunks being assigned 
            dsk = ...
            graph = HighLevelGraph.from_collections(out, dsk, ...)
            y = Array(graph, ...)

            self._meta = y._meta
            self.dask = y.dask
            self.name = y.name
            self._chunks = y.chunks
            return self

This will involve running some of the logic currently in array.slicing.setitem at assignment time rather than compute time, but that should be OK as it's just index wrangling, and doesn't need to start any of its own compute processes.

Thanks for bearing with me - I should have been more clued up to some of the mechanics of __getitem__, first.

@davidhassell
Copy link
Contributor Author

OK - no surprise that I've run into similar ground as dealt with slice_with_bool_dask_array, i.e. we don't know the values of a dask array index before we've computed it. I'm working on it ... (and should be able to reuse a lot more of the getitem functions, too, I hope)

@davidhassell
Copy link
Contributor Author

Oh dear - sorry! There was nothing particularly wrong with the map_blocks approach, afterall - the assigned chunks are still only accessed at compute time, so no problem. The other approach I suggested was a mistake (perhaps encouraged by my new understanding of graphs). I'll write some documentation to make amends.

@jsignell
Copy link
Member

It looks like this is ready and has just fallen off the radar. Is that correct?

@davidhassell
Copy link
Contributor Author

Hi - I was just about to ask where we were with it - thanks for keeping tabs. I think that I've responded to all of the review issues (for which thanks), so I hope that it is ready, too ...

@davidhassell
Copy link
Contributor Author

Hi @shoyer, @Illviljan - Thanks for all of your input so far - have you had a chance to look over the current state of this PR? David

@jsignell
Copy link
Member

jsignell commented Feb 2, 2021

We talked about this in the maintainer meeting today and @jrbourbeau brought up that maybe we should merge this after the release on Friday so that there is some time for it to run in the xarray upstream tests.

@jsignell
Copy link
Member

Ok merging now! Thanks @davidhassell

@jsignell jsignell merged commit 3f7aa3f into dask:master Feb 10, 2021
@davidhassell
Copy link
Contributor Author

Great - and thanks, all, for helping me with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal for a more fully featured __setitem__

5 participants