-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
dask/distributed
#8512Description
As the ongoing changes in tokenization are getting more complicated, I'm writing a meta-issue that maps them down.
High level goals
- Ensure that
tokenize()is idempotent (call it twice on the same object, get the same token) - Ensure that
tokenize()is deterministic (call it twice on identical objects, or on the same object after a serialization round-trip, and get the same token). This is limited to the same interpreter. Determinism is not guaranteed across interpreters. - Ensure that, when
tokenize()can't return a deterministic result, there is a system for notifying the dask code (e.g. so that you don't raise after comparing two non-deterministic tokens) - Robustly detect when Reuse of keys in blockwise fusion can cause spurious KeyErrors on distributed cluster #9888 happens in order to mitigate its impact
There are a handful of known objects that violate idempotency/determinism:
object()is idempotent, but not deterministic (by choice, as it's normally used as a singleton).- objects that can't be serialized with cloudpickle are neither idempotent nor deterministic. Expect them to break spectacularly in dask_expr for sure, and probably going forward in many other places too.
Notably, all callables (including lambdas) become deterministic.
PRs
- Make tokenization more deterministic #10876
- Tokenize SubgraphCallable #10898
- Tweak sequence tokenization #10904
- these two must go in together:
4a. Deterministic hashing for almost everything #10883
4b. Remove lambda tokenization hack dask-expr#822 - Test numba tokenization #10896
- Remove redundant normalize_token variants #10884
- Override tokenize.ensure-deterministic config flag #10913
- Config toggle to disable blockwise fusion #10909
- tokenize: Don't call str() on dict values #10919
- Tweaks to update_graph (backport from #8185) distributed#8498
- Tokenization-related test tweaks (backport from #8185) distributed#8499
- Warn if tasks are submitted with identical keys but different
run_specdistributed#8185 - Keep old dependencies on run_spec collision distributed#8512
Closes
- Non deterministic tokenization for empty numpy arrays after pickle roundtrip #10799
- GPU Tokenization #6718
Superseded PRs
- add test that shows how lambda tokenization is broken dask-expr#765
- Ensure tokenize is consistent for pickle roundtrips #10808
Other actions
✔️ A/B tests show no impact whatsoever from the additional tokenization labour on the end-to-end workflows in coiled/benchmarks
✔️ A/B tests on dask-expr optimization show 50~150ms slowdown for production-sized TPCH queries, which IMHO is negligible
Metadata
Metadata
Assignees
Labels
No labels