Extend setitem to more closely match numpy #7033

davidhassell · 2021-01-06T10:48:03Z

Tests added / passed
Passes black dask / flake8 dask

Fixes #7029

…hen key is Array

mrocklin · 2021-01-06T16:49:02Z

cc @shoyer do you know anyone who would be interested in reviewing this?

…da.empty)

shoyer · 2021-01-06T17:15:32Z

dask/array/core.py

+                                    len(index), size, index
+                                )
+                            )
+                        index = where(index)[0].compute_chunk_sizes()


This conversion (boolean dtype -> positive indices) seems like potentially not worth the trouble.

where(cond) which requires computation at graph construction time (to determine chunk sizes), is generally best avoided, unlike the three argument form where(cond, x, y) can done lazily,.

Also, it looks everything in this code path needs to get converted into a slice(), which seems very restrictive and thus unlikely to be useful. This is particularly surprising given that dask already supports x[index] = y for arbitrary multi-dimensional boolean index.

Thanks for looking at this, @shoyer - very much appreciated.

Only 1-d integer arrays that can be converted to slices (like [3, 2, 1], or [1, 3, 5]) are converted. I have found in the past that this is worth the effort because the numpy __setitem__ is much, much faster with slices, saving more time than we wasted on checking the 1-d array for regularity.

Other 1-d arrays indices (like [1, 2, 5]) remain as they are.

I will have a think about how we can process boolean indices without converting them to integers.

I will have a think about how we can process boolean indices without converting them to integers.

... at least, perhaps, until compute time ....

@shoyer - apologies, I see now that you had spotted that even integer indices are converted to slices, which is indeed overly restrictive, and caused a bug (#7033 (comment)). This is bug is now fixed, and I'm thinking about the boolean indices issues.

The premature compute from the one-argument where has gone. The move from regular integer sequences to slices has moved to slicing.setitem and done chunkwise at compute time (c46ccb2)

So, I think there is very little work done now at the time of defining the assignment. The only significant thing that remains is a test for strict monotonicity of integer array indices - if such an index is a dask array then it is computed at at the time of the assignment definition. It is my hope, however, that we can, in the future, remove the need to this test, and allow other type of integer sequences (liket Array.__getitem__ and numpy do).

dask/array/core.py

dask/array/tests/test_array_core.py

dask/array/core.py

Illviljan

Mostly grammar stuff from me at this point. I appreciate all the comments in the code.

Perhaps all the extra features being added is well worth it but I was wondering with all this new code is there any change in performance?

dask/array/core.py

Co-authored-by: Illviljan <[email protected]>

…:] = d[::-1]

dask/array/core.py

Co-authored-by: Illviljan <[email protected]>

….] = d[...]

davidhassell · 2021-01-10T16:35:31Z

I have spotted a bug that causes, e.g.

>>> x = np.ma.arange(60).reshape((6, 10))
>>> dx = da.from_array(x.copy(), chunks=(2, 2))
>>> dx[:, 0] = dx[:, 0]
ValueError: Can't broadcast data with shape (6,) across shape (6, 1)
>>> dx[:, 0] = dx[:, 0:1]  # But this is OK

This is related to how the indices are parsed, and is probably a legacy of the code lifted from cf-python, which always retains a size 1 dimension from an integer valued index, whilst numpy and dask drops it.

I'm looking into a fix for this so that d[:, 0] = d[:, 0] works as expected.

davidhassell · 2021-01-11T09:41:44Z

02e1b43 fixes #7033 (comment)

Illviljan · 2021-01-12T22:22:08Z

dask/array/core.py

+        from .routines import where  # , count_nonzero
+        from .slicing import parse_assignment_indices, setitem


The where function is not used in your parts right? Then that's an unnecessary import when isinstance(key, Array) = False. Perhaps move that row down to just before the try: y = where(key, value, self)?

Same idea with parse_assignment and setitem. Those are not needed when isinstance(key, Array) = True Although since .slicing has already been loaded at the top of the module so maybe parse_assignment_indices, setitem should just be defined at the top as well?

davidhassell · 2021-01-14T16:41:54Z

I have realised that the code I have written accesses every chunk, regardless of whether or not is being assigned to. For data on disk this is clearly not very efficient.

However, since writing it I have learnt all about graphs and think it is possible to refactor the code away from using map_blocks and follow a similar approach to __getitem__:

            # Create graph that only includes the chunks being assigned 
            dsk = ...
            graph = HighLevelGraph.from_collections(out, dsk, ...)
            y = Array(graph, ...)

            self._meta = y._meta
            self.dask = y.dask
            self.name = y.name
            self._chunks = y.chunks
            return self

This will involve running some of the logic currently in array.slicing.setitem at assignment time rather than compute time, but that should be OK as it's just index wrangling, and doesn't need to start any of its own compute processes.

Thanks for bearing with me - I should have been more clued up to some of the mechanics of __getitem__, first.

davidhassell · 2021-01-14T23:30:17Z

OK - no surprise that I've run into similar ground as dealt with slice_with_bool_dask_array, i.e. we don't know the values of a dask array index before we've computed it. I'm working on it ... (and should be able to reuse a lot more of the getitem functions, too, I hope)

davidhassell · 2021-01-15T07:32:05Z

Oh dear - sorry! There was nothing particularly wrong with the map_blocks approach, afterall - the assigned chunks are still only accessed at compute time, so no problem. The other approach I suggested was a mistake (perhaps encouraged by my new understanding of graphs). I'll write some documentation to make amends.

jsignell · 2021-01-25T20:52:56Z

It looks like this is ready and has just fallen off the radar. Is that correct?

davidhassell · 2021-01-26T08:13:15Z

Hi - I was just about to ask where we were with it - thanks for keeping tabs. I think that I've responded to all of the review issues (for which thanks), so I hope that it is ready, too ...

davidhassell · 2021-02-02T17:08:01Z

Hi @shoyer, @Illviljan - Thanks for all of your input so far - have you had a chance to look over the current state of this PR? David

jsignell · 2021-02-02T17:13:13Z

We talked about this in the maintainer meeting today and @jrbourbeau brought up that maybe we should merge this after the release on Friday so that there is some time for it to run in the xarray upstream tests.

jsignell · 2021-02-10T18:29:35Z

Ok merging now! Thanks @davidhassell

davidhassell · 2021-02-10T19:05:10Z

Great - and thanks, all, for helping me with this

davidhassell added 6 commits January 5, 2021 15:17

First implementation of extended __setitem__

b25843a

style and typo

437d24d

unite test, and fixes to make it pass. Reinstated 'where' for cases w…

0b8b2e7

…hen key is Array

test ValueErrors are raised for disallowed index combintations

bd9af34

style

047411a

Comments

9333eac

davidhassell changed the title ~~Extend the __setitem__ to more closely match numpy~~ Extend __setitem__ to more closely match numpy Jan 6, 2021

Extra tests: assignment of N-d arrays and brodacasting

17d5878

max-sixty mentioned this pull request Jan 6, 2021

Faster unstacking pydata/xarray#4746

Merged

5 tasks

Allow assignment to np.ma.masked, and also to read-only arrays (e.g. …

6cfba3f

…da.empty)

shoyer reviewed Jan 6, 2021

View reviewed changes

ValueError -> NotImplementedError

f42002a

Illviljan reviewed Jan 6, 2021

View reviewed changes

davidhassell and others added 8 commits January 7, 2021 08:40

numpy style docstring

28340d5

Co-authored-by: Illviljan <[email protected]>

Doc strings; value_shape1 renamed to value_common_shape

37b42f1

Update message format

faa8322

Co-authored-by: Illviljan <[email protected]>

Update message format

2ca0408

Co-authored-by: Illviljan <[email protected]>

test more non-valid assignments

6a4845d

Style

de7e58b

Style

25c9397

Prevent corruption of chunks shared between objects - allows, e.g. d[…

6a18eac

…:] = d[::-1]

Illviljan reviewed Jan 10, 2021

View reviewed changes

dask/array/core.py Outdated Show resolved Hide resolved

davidhassell and others added 2 commits January 10, 2021 15:41

Fix f-string

9399fd1

Co-authored-by: Illviljan <[email protected]>

Prevent hanging when value is self - allows, e.g. d[...] = d and d[..…

4440264

….] = d[...]

Fix for integer-valued indices - allows, e.g. d[:, 0] = d[:, 2]

02e1b43

Restructure code, catch illegal corner cases, remove premature compute

c46ccb2

Illviljan reviewed Jan 12, 2021

View reviewed changes

Reorganise imports

476fe3e

davidhassell added 2 commits January 15, 2021 16:07

setitem assignment docs

8f31414

fixed labels, add to toctree

96f5a12

jsignell merged commit 3f7aa3f into dask:master Feb 10, 2021

ryan-williams mentioned this pull request Feb 22, 2021

Reintroduce __setitem__ which more closely matches NumPy #7261

Closed

jrbourbeau mentioned this pull request Feb 26, 2021

Release 2021.03.0 dask/community#129

Closed

jrbourbeau mentioned this pull request Mar 5, 2021

Temporarily revert recent Array.__setitem__ updates #7326

Merged

jni mentioned this pull request Mar 11, 2021

test_thresholding fails with last dask version (2021.3.0) scikit-image/scikit-image#5266

Closed

davidhassell mentioned this pull request Mar 15, 2021

Redesigned __setitem__ implementation #7392

Closed

jakirkham mentioned this pull request Apr 12, 2021

Allowing setitem-like operation on dask array #2000

Closed

		from .routines import where # , count_nonzero
		from .slicing import parse_assignment_indices, setitem

Uh oh!

Extend __setitem__ to more closely match numpy #7033

Extend __setitem__ to more closely match numpy #7033

Uh oh!

Conversation

davidhassell commented Jan 6, 2021

Uh oh!

mrocklin commented Jan 6, 2021

Uh oh!

shoyer Jan 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidhassell Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

davidhassell Jan 6, 2021

Choose a reason for hiding this comment

Uh oh!

davidhassell Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

davidhassell Jan 11, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Illviljan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidhassell commented Jan 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidhassell commented Jan 11, 2021

Uh oh!

Illviljan Jan 12, 2021

Choose a reason for hiding this comment

Uh oh!

davidhassell commented Jan 14, 2021

Uh oh!

davidhassell commented Jan 14, 2021

Uh oh!

davidhassell commented Jan 15, 2021

Uh oh!

jsignell commented Jan 25, 2021

Uh oh!

davidhassell commented Jan 26, 2021

Uh oh!

davidhassell commented Feb 2, 2021

Uh oh!

jsignell commented Feb 2, 2021

Uh oh!

jsignell commented Feb 10, 2021

Uh oh!

davidhassell commented Feb 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Extend setitem to more closely match numpy #7033

Extend setitem to more closely match numpy #7033

shoyer Jan 6, 2021 •

edited

Loading

davidhassell commented Jan 10, 2021 •

edited

Loading