-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Generically rebuild a collection with different keys #7142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
commit 2ceaf46 Author: crusaderky <[email protected]> Date: Thu Jan 28 18:53:42 2021 +0000 polish commit 2643032 Author: crusaderky <[email protected]> Date: Thu Jan 28 17:13:57 2021 +0000 black commit 1b29cd8 Author: crusaderky <[email protected]> Date: Thu Jan 28 17:13:45 2021 +0000 bugfix commit 0a50ceb Merge: 8286a36 9bb586a Author: crusaderky <[email protected]> Date: Thu Jan 28 16:48:34 2021 +0000 Merge remote-tracking branch 'upstream/master' into graphsurgery commit 8286a36 Author: crusaderky <[email protected]> Date: Thu Jan 28 16:48:12 2021 +0000 bugfix commit 802ece7 Author: crusaderky <[email protected]> Date: Thu Jan 28 16:22:47 2021 +0000 bugfixes commit 8d31b85 Author: crusaderky <[email protected]> Date: Thu Jan 28 15:38:33 2021 +0000 Drop unused layers commit 966d5a7 Author: crusaderky <[email protected]> Date: Thu Jan 28 15:20:42 2021 +0000 bugfixes commit afdc65d Merge: 1b30c21 5694253 Author: crusaderky <[email protected]> Date: Tue Jan 26 19:00:35 2021 +0000 Merge remote-tracking branch 'upstream/master' into graphsurgery commit 1b30c21 Author: crusaderky <[email protected]> Date: Tue Jan 26 14:07:59 2021 +0000 polish commit 4d03052 Author: crusaderky <[email protected]> Date: Tue Jan 26 13:46:02 2021 +0000 bind implementation commit 741b0e4 Author: crusaderky <[email protected]> Date: Tue Jan 26 11:30:12 2021 +0000 fix commit 5005c60 Author: crusaderky <[email protected]> Date: Tue Jan 26 11:26:46 2021 +0000 refactor commit 050aac8 Merge: c68a459 fbbc271 Author: crusaderky <[email protected]> Date: Tue Jan 26 11:16:05 2021 +0000 Merge remote-tracking branch 'upstream/master' into graphsurgery commit c68a459 Author: crusaderky <[email protected]> Date: Mon Jan 25 21:57:43 2021 +0000 bind implementation commit 6589c18 Merge: 7ac6817 4e2d838 Author: crusaderky <[email protected]> Date: Mon Jan 25 21:24:05 2021 +0000 Merge remote-tracking branch 'upstream/master' into graphsurgery commit 7ac6817 Author: crusaderky <[email protected]> Date: Mon Jan 25 21:23:53 2021 +0000 Rename choke -> checkpoint commit deec2b3 Author: crusaderky <[email protected]> Date: Mon Jan 25 21:19:49 2021 +0000 Rename graphsurgery -> graph_manipulation commit 60e928d Author: crusaderky <[email protected]> Date: Mon Jan 25 21:18:27 2021 +0000 more bind implementation commit fe17526 Author: crusaderky <[email protected]> Date: Mon Jan 25 13:55:54 2021 +0000 implement bind (partial) d commit 515f64f Author: crusaderky <[email protected]> Date: Mon Jan 25 11:33:17 2021 +0000 polish commit 4046708 Merge: 16fbb10 5c96d82 Author: crusaderky <[email protected]> Date: Mon Jan 25 10:43:51 2021 +0000 Merge remote-tracking branch 'dask/master' into graphsurgery commit 16fbb10 Author: crusaderky <[email protected]> Date: Mon Jan 25 10:35:21 2021 +0000 Implement name= commit d0f0348 Author: crusaderky <[email protected]> Date: Fri Jan 22 18:15:34 2021 +0000 postpersist name= argument commit c1d695f Author: crusaderky <[email protected]> Date: Fri Jan 22 18:15:17 2021 +0000 bind implementation (incomplete) commit 1ae20fe Author: crusaderky <[email protected]> Date: Fri Jan 22 14:44:20 2021 +0000 polish commit a30f0ed Author: crusaderky <[email protected]> Date: Fri Jan 22 14:34:25 2021 +0000 polish commit b979ea2 Merge: 793df1c 4395973 Author: crusaderky <[email protected]> Date: Fri Jan 22 14:05:49 2021 +0000 Merge remote-tracking branch 'dask/master' into graphsurgery commit 793df1c Author: crusaderky <[email protected]> Date: Fri Jan 22 14:05:42 2021 +0000 impl commit 1f66b9a Merge: ba43137 cb9fc26 Author: crusaderky <[email protected]> Date: Thu Jan 21 17:55:44 2021 +0000 Merge remote-tracking branch 'dask/master' into graphsurgery commit ba43137 Author: crusaderky <[email protected]> Date: Thu Jan 21 17:55:30 2021 +0000 graphsurgery (incomplete) commit 0ad5fe3 Author: crusaderky <[email protected]> Date: Thu Jan 21 14:04:59 2021 +0000 O(1) access to keys in HighLevelGraph commit 405205a Merge: b4a7197 30c9a3a Author: crusaderky <[email protected]> Date: Thu Jan 21 10:59:55 2021 +0000 Merge remote-tracking branch 'dask/master' into graphsurgery commit b4a7197 Author: crusaderky <[email protected]> Date: Wed Jan 20 21:00:48 2021 +0000 drop prototype bugfix
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR and thoroughly testing / documenting the changes here @crusaderky!
| return self._rebuild, () | ||
|
|
||
| def _rebuild(self, dsk, name=None): | ||
| return type(self)(dsk, name or self.name, self.npartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, do we need to use type(self) here instead of Bag? I noticed we mostly hard-code the class type in other _rebuild methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For historical reasons, dask bag supports subclassing, so the type needs to be generic.
| if not isinstance(other, Tuple): | ||
| return NotImplemented # pragma: nocover | ||
| name = tokenize(self, other) | ||
| dsk = {(name, i): k for i, k in enumerate(self._keys + other._keys)} | ||
| return Tuple(merge(dsk, self._dask, other._dask), list(dsk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for aligning this with the updated key standard
| assert t2._dask == dict(zip("abcdef", range(1, 7))) | ||
| assert t2.compute() == (1, 2, 3, 4, 5, 6) | ||
| x2, y2, z2 = dask.persist(x, y, z) | ||
| assert t2._keys == t._keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we're no longer testing t2._dask here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was because I had to complicate how Tuple.__add__ works to make it compliant to the new protocol spec. I now partially reintegrated the test.
dask/tests/test_base.py
Outdated
| ddf2 = ddf1.persist() | ||
| assert isinstance(ddf2, dd.DataFrame) | ||
| assert len(ddf2.__dask_graph__()) == 2 | ||
| dd._compat.tm.assert_frame_equal(ddf2.compute(), ddf1.compute()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use dd.utils.assert_eq (and the array and bag equivalent) for testing that collections are equal to one another. The assert_eq utilities call compute internally and perform a few additional sanity checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| .. _`Stuart Berg`: https://github.com/stuarteberg | ||
| .. _`Guillaume Eynard-Bontemps`: https://github.com/guillaumeeb | ||
| .. _`Adam Beberg`: https://github.com/beberg | ||
| .. _`Johnnie Gray`: https://github.com/jcmgray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching some changelog issues. The other updates look great, but I'm confused why this entry was removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks for clarifying 👍
| names : tuple | ||
| Tuple of names of the HighLevelGraph layers which contain all keys returned by | ||
| :meth:`__dask_keys__`. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, should we include the name= keyword in __dask_postpersist__ here or in #7109?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here. I was convinced I added it already; fixed now.
Co-authored-by: James Bourbeau <[email protected]>
Co-authored-by: James Bourbeau <[email protected]>
Co-authored-by: James Bourbeau <[email protected]>
|
@jrbourbeau implemented requested changes |
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the custom collection docs, we'll need to update the example at https://docs.dask.org/en/latest/custom-collections.html#example-dask-collection too.
As this touches the task graph spec, let's give a bit more time for feedback from others. But overall the changes here seem reasonable.
56e62d6 to
fba0bd8
Compare
jrbourbeau
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crusaderky! This is in
|
|
||
| - a string, or | ||
| - a tuple of one or more elements, whose first element is a string and the remainder are | ||
| arbitrary hashables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the definition of a key here seems unnecessary, and the old definition remains true. In practice, keys are typically strings or tuples as you describe, but this isn't enforced from what I see. For example, perhaps somebody paranoid would prefer to build a dask graph using objects of a Key class that they create. I haven't seen this done, but the previous spec definition allows for this. Dask collections (including custom collections) are not the only way to create dask graphs.
Updating the specification for keys of collections (the bulk of this PR) is a good idea.
* Add array_safe function supporting like= kwarg * Support for like= in percentile * Add CuPy tests for percentiles * Fix array_like_safe for NumPy < 1.20 * Improve percentile perf * Generically rebuild a collection with different keys (#7142) * Add array_safe function supporting like= kwarg * Support for like= in percentile * Add CuPy tests for percentiles * Fix array_like_safe for NumPy < 1.20 * Improve percentile perf * Remove note comparing np/merge_sorted * Update low-level graph spec to use any hashable for keys (#7163) * Remove unused merge_sorted Co-authored-by: Peter Andreas Entschev <[email protected]> Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: crusaderky <[email protected]> Co-authored-by: James Bourbeau <[email protected]>
Closes #7093