Skip to content

Tokenization meta-issue #10905

@crusaderky

Description

@crusaderky

As the ongoing changes in tokenization are getting more complicated, I'm writing a meta-issue that maps them down.

High level goals

  • Ensure that tokenize() is idempotent (call it twice on the same object, get the same token)
  • Ensure that tokenize() is deterministic (call it twice on identical objects, or on the same object after a serialization round-trip, and get the same token). This is limited to the same interpreter. Determinism is not guaranteed across interpreters.
  • Ensure that, when tokenize() can't return a deterministic result, there is a system for notifying the dask code (e.g. so that you don't raise after comparing two non-deterministic tokens)
  • Robustly detect when Reuse of keys in blockwise fusion can cause spurious KeyErrors on distributed cluster #9888 happens in order to mitigate its impact

There are a handful of known objects that violate idempotency/determinism:

  • object() is idempotent, but not deterministic (by choice, as it's normally used as a singleton).
  • objects that can't be serialized with cloudpickle are neither idempotent nor deterministic. Expect them to break spectacularly in dask_expr for sure, and probably going forward in many other places too.

Notably, all callables (including lambdas) become deterministic.

PRs

  1. Make tokenization more deterministic #10876
  2. Tokenize SubgraphCallable #10898
  3. Tweak sequence tokenization #10904
  4. these two must go in together:
    4a. Deterministic hashing for almost everything #10883
    4b. Remove lambda tokenization hack dask-expr#822
  5. Test numba tokenization #10896
  6. Remove redundant normalize_token variants #10884
  7. Override tokenize.ensure-deterministic config flag #10913
  8. Config toggle to disable blockwise fusion #10909
  9. tokenize: Don't call str() on dict values #10919
  10. Tweaks to update_graph (backport from #8185) distributed#8498
  11. Tokenization-related test tweaks (backport from #8185) distributed#8499
  12. Warn if tasks are submitted with identical keys but different run_spec distributed#8185
  13. Keep old dependencies on run_spec collision distributed#8512

Closes

Superseded PRs

Other actions

✔️ A/B tests show no impact whatsoever from the additional tokenization labour on the end-to-end workflows in coiled/benchmarks
✔️ A/B tests on dask-expr optimization show 50~150ms slowdown for production-sized TPCH queries, which IMHO is negligible

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions