Skip to content

Conversation

@crusaderky
Copy link
Collaborator

Closes #7093

commit 2ceaf46
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 18:53:42 2021 +0000

    polish

commit 2643032
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 17:13:57 2021 +0000

    black

commit 1b29cd8
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 17:13:45 2021 +0000

    bugfix

commit 0a50ceb
Merge: 8286a36 9bb586a
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 16:48:34 2021 +0000

    Merge remote-tracking branch 'upstream/master' into graphsurgery

commit 8286a36
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 16:48:12 2021 +0000

    bugfix

commit 802ece7
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 16:22:47 2021 +0000

    bugfixes

commit 8d31b85
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 15:38:33 2021 +0000

    Drop unused layers

commit 966d5a7
Author: crusaderky <[email protected]>
Date:   Thu Jan 28 15:20:42 2021 +0000

    bugfixes

commit afdc65d
Merge: 1b30c21 5694253
Author: crusaderky <[email protected]>
Date:   Tue Jan 26 19:00:35 2021 +0000

    Merge remote-tracking branch 'upstream/master' into graphsurgery

commit 1b30c21
Author: crusaderky <[email protected]>
Date:   Tue Jan 26 14:07:59 2021 +0000

    polish

commit 4d03052
Author: crusaderky <[email protected]>
Date:   Tue Jan 26 13:46:02 2021 +0000

    bind implementation

commit 741b0e4
Author: crusaderky <[email protected]>
Date:   Tue Jan 26 11:30:12 2021 +0000

    fix

commit 5005c60
Author: crusaderky <[email protected]>
Date:   Tue Jan 26 11:26:46 2021 +0000

    refactor

commit 050aac8
Merge: c68a459 fbbc271
Author: crusaderky <[email protected]>
Date:   Tue Jan 26 11:16:05 2021 +0000

    Merge remote-tracking branch 'upstream/master' into graphsurgery

commit c68a459
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 21:57:43 2021 +0000

    bind implementation

commit 6589c18
Merge: 7ac6817 4e2d838
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 21:24:05 2021 +0000

    Merge remote-tracking branch 'upstream/master' into graphsurgery

commit 7ac6817
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 21:23:53 2021 +0000

    Rename choke -> checkpoint

commit deec2b3
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 21:19:49 2021 +0000

    Rename graphsurgery -> graph_manipulation

commit 60e928d
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 21:18:27 2021 +0000

    more bind implementation

commit fe17526
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 13:55:54 2021 +0000

    implement bind (partial)

    d

commit 515f64f
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 11:33:17 2021 +0000

    polish

commit 4046708
Merge: 16fbb10 5c96d82
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 10:43:51 2021 +0000

    Merge remote-tracking branch 'dask/master' into graphsurgery

commit 16fbb10
Author: crusaderky <[email protected]>
Date:   Mon Jan 25 10:35:21 2021 +0000

    Implement name=

commit d0f0348
Author: crusaderky <[email protected]>
Date:   Fri Jan 22 18:15:34 2021 +0000

    postpersist name= argument

commit c1d695f
Author: crusaderky <[email protected]>
Date:   Fri Jan 22 18:15:17 2021 +0000

    bind implementation (incomplete)

commit 1ae20fe
Author: crusaderky <[email protected]>
Date:   Fri Jan 22 14:44:20 2021 +0000

    polish

commit a30f0ed
Author: crusaderky <[email protected]>
Date:   Fri Jan 22 14:34:25 2021 +0000

    polish

commit b979ea2
Merge: 793df1c 4395973
Author: crusaderky <[email protected]>
Date:   Fri Jan 22 14:05:49 2021 +0000

    Merge remote-tracking branch 'dask/master' into graphsurgery

commit 793df1c
Author: crusaderky <[email protected]>
Date:   Fri Jan 22 14:05:42 2021 +0000

    impl

commit 1f66b9a
Merge: ba43137 cb9fc26
Author: crusaderky <[email protected]>
Date:   Thu Jan 21 17:55:44 2021 +0000

    Merge remote-tracking branch 'dask/master' into graphsurgery

commit ba43137
Author: crusaderky <[email protected]>
Date:   Thu Jan 21 17:55:30 2021 +0000

    graphsurgery (incomplete)

commit 0ad5fe3
Author: crusaderky <[email protected]>
Date:   Thu Jan 21 14:04:59 2021 +0000

    O(1) access to keys in HighLevelGraph

commit 405205a
Merge: b4a7197 30c9a3a
Author: crusaderky <[email protected]>
Date:   Thu Jan 21 10:59:55 2021 +0000

    Merge remote-tracking branch 'dask/master' into graphsurgery

commit b4a7197
Author: crusaderky <[email protected]>
Date:   Wed Jan 20 21:00:48 2021 +0000

    drop prototype

    bugfix
More tests


More tests
@crusaderky crusaderky closed this Feb 1, 2021
@crusaderky crusaderky reopened this Feb 1, 2021
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and thoroughly testing / documenting the changes here @crusaderky!

cc @jcrist @eriknw @shoyer for visibility

return self._rebuild, ()

def _rebuild(self, dsk, name=None):
return type(self)(dsk, name or self.name, self.npartitions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do we need to use type(self) here instead of Bag? I noticed we mostly hard-code the class type in other _rebuild methods

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For historical reasons, dask bag supports subclassing, so the type needs to be generic.

Comment on lines +606 to +610
if not isinstance(other, Tuple):
return NotImplemented # pragma: nocover
name = tokenize(self, other)
dsk = {(name, i): k for i, k in enumerate(self._keys + other._keys)}
return Tuple(merge(dsk, self._dask, other._dask), list(dsk))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for aligning this with the updated key standard

assert t2._dask == dict(zip("abcdef", range(1, 7)))
assert t2.compute() == (1, 2, 3, 4, 5, 6)
x2, y2, z2 = dask.persist(x, y, z)
assert t2._keys == t._keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we're no longer testing t2._dask here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was because I had to complicate how Tuple.__add__ works to make it compliant to the new protocol spec. I now partially reintegrated the test.

ddf2 = ddf1.persist()
assert isinstance(ddf2, dd.DataFrame)
assert len(ddf2.__dask_graph__()) == 2
dd._compat.tm.assert_frame_equal(ddf2.compute(), ddf1.compute())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use dd.utils.assert_eq (and the array and bag equivalent) for testing that collections are equal to one another. The assert_eq utilities call compute internally and perform a few additional sanity checks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

.. _`Stuart Berg`: https://github.com/stuarteberg
.. _`Guillaume Eynard-Bontemps`: https://github.com/guillaumeeb
.. _`Adam Beberg`: https://github.com/beberg
.. _`Johnnie Gray`: https://github.com/jcmgray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching some changelog issues. The other updates look great, but I'm confused why this entry was removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for clarifying 👍

names : tuple
Tuple of names of the HighLevelGraph layers which contain all keys returned by
:meth:`__dask_keys__`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, should we include the name= keyword in __dask_postpersist__ here or in #7109?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here. I was convinced I added it already; fixed now.

@crusaderky
Copy link
Collaborator Author

@jrbourbeau implemented requested changes

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the custom collection docs, we'll need to update the example at https://docs.dask.org/en/latest/custom-collections.html#example-dask-collection too.

As this touches the task graph spec, let's give a bit more time for feedback from others. But overall the changes here seem reasonable.

@crusaderky
Copy link
Collaborator Author

crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 2, 2021
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky! This is in

@jrbourbeau jrbourbeau merged commit d096fde into dask:master Feb 3, 2021
@crusaderky crusaderky deleted the postpersist branch February 3, 2021 17:07

- a string, or
- a tuple of one or more elements, whose first element is a string and the remainder are
arbitrary hashables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the definition of a key here seems unnecessary, and the old definition remains true. In practice, keys are typically strings or tuples as you describe, but this isn't enforced from what I see. For example, perhaps somebody paranoid would prefer to build a dask graph using objects of a Key class that they create. I haven't seen this done, but the previous spec definition allows for this. Dask collections (including custom collections) are not the only way to create dask graphs.

Updating the specification for keys of collections (the bulk of this PR) is a good idea.

jakirkham pushed a commit that referenced this pull request Feb 26, 2021
* Add array_safe function supporting like= kwarg

* Support for like= in percentile

* Add CuPy tests for percentiles

* Fix array_like_safe for NumPy < 1.20

* Improve percentile perf

* Generically rebuild a collection with different keys (#7142)

* Add array_safe function supporting like= kwarg

* Support for like= in percentile

* Add CuPy tests for percentiles

* Fix array_like_safe for NumPy < 1.20

* Improve percentile perf

* Remove note comparing np/merge_sorted

* Update low-level graph spec to use any hashable for keys (#7163)

* Remove unused merge_sorted

Co-authored-by: Peter Andreas Entschev <[email protected]>
Co-authored-by: Ashwin Srinath <[email protected]>
Co-authored-by: crusaderky <[email protected]>
Co-authored-by: James Bourbeau <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generically rebuild a collection with different keys

4 participants