Skip to content

Conversation

@crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Jan 30, 2024

Cut-down variant of #10808, without the more problematic changes about normalize_object and normalize_callable

  • Thoroughly test tokenize idempotency and determinism (same output after a pickle roundtrip)
  • Tougher tests in general
  • Avoid ambiguity between collections
  • Deterministic tokenization for collections containing circular references
  • Better tokenization for dataclasses, numpy objects, and sparse matrices

This PR changes all tokens. Expect downstream tests that assert vs. hardcoded dask keys to fail.

@crusaderky crusaderky marked this pull request as draft January 30, 2024 15:32
@crusaderky crusaderky force-pushed the tokenize_without_object branch from f9248fa to 2866e5a Compare January 30, 2024 16:10
@crusaderky crusaderky self-assigned this Jan 30, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch from 2866e5a to e1b6e48 Compare January 30, 2024 16:15
@github-actions
Copy link
Contributor

github-actions bot commented Jan 30, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

     15 files  ± 0       15 suites  ±0   3h 21m 28s ⏱️ -50s
 12 987 tests + 1   12 058 ✅ + 2     929 💤 ± 0  0 ❌  - 1 
160 492 runs  +15  143 983 ✅ +32  16 509 💤  - 16  0 ❌  - 1 

Results for commit 052d1cf. ± Comparison against base commit 97f47b4.

This pull request removes 13 and adds 14 tests. Note that renamed tests count towards both.
dask.tests.test_delayed ‑ test_name_consistent_across_instances
dask.tests.test_tokenize ‑ test_normalize_function_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[bsr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[coo]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csc]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[csr]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[dia]
dask.tests.test_tokenize ‑ test_tokenize_dense_sparse_array[lil]
dask.tests.test_tokenize ‑ test_tokenize_function_cloudpickle
…
dask.tests.test_delayed ‑ test_deterministic_name
dask.tests.test_tokenize ‑ test_check_tokenize
dask.tests.test_tokenize ‑ test_empty_numpy_array
dask.tests.test_tokenize ‑ test_tokenize_callable_class
dask.tests.test_tokenize ‑ test_tokenize_circular_recursion
dask.tests.test_tokenize ‑ test_tokenize_dataclass_field_no_repr
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other0]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other1]
dask.tests.test_tokenize ‑ test_tokenize_datetime_date[other2]
dask.tests.test_tokenize ‑ test_tokenize_local_functions
…

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the tokenize_without_object branch 5 times, most recently from f530dbb to bf08d1e Compare January 31, 2024 12:57
@crusaderky crusaderky changed the title Make more tokenizations deterministic Make tokenization more deterministic Jan 31, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch 2 times, most recently from 690460e to 7184889 Compare January 31, 2024 16:37
Comment on lines +1080 to +1082
# This variable is recreated anew every time you call tokenize(). Note that this means
# that you could call tokenize() from inside tokenize() and they would be fully
# independent.
Copy link
Collaborator Author

@crusaderky crusaderky Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask-expr actually does this. A previous version was creating the seen dict in the outermost call to _normalize_seq_func instead of in tokenize, and that was making a test in dask-expr fail.

@crusaderky crusaderky marked this pull request as ready for review January 31, 2024 17:06
@crusaderky crusaderky force-pushed the tokenize_without_object branch from 376c1db to 733a03d Compare February 1, 2024 14:53
crusaderky added a commit to fjetter/dask that referenced this pull request Feb 1, 2024
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 2, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch from 733a03d to 6d9a7cb Compare February 2, 2024 12:14
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch from 6d9a7cb to 25cf33f Compare February 5, 2024 10:05
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch from 25cf33f to a190d9f Compare February 5, 2024 13:11
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 5, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch from a190d9f to ca8a0ff Compare February 5, 2024 15:31
crusaderky added a commit to crusaderky/dask that referenced this pull request Feb 6, 2024
@crusaderky crusaderky force-pushed the tokenize_without_object branch from ca8a0ff to 0f07cdf Compare February 6, 2024 09:32
@hendrikmakait hendrikmakait self-requested a review February 6, 2024 10:41
function_cache.clear()


def tokenize_roundtrip(*args, idempotent=True, deterministic=None, copy=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called _roundtrip? I don't see any actual "roundtripping" in there, am I missing something?

Also, this function does a lot under the hood, I'm wondering if it would be easier to understand what's being tested if this were more explicit. This would result in significantly more test code but personally, I prefer DAMP tests over DRY ones, so I don't see this as much of a problem. If not, is there an easy way to test tokenize_roundtrip? Basically all our testing relies on it to work, so having a dedicated unit test would make me feel more comfortable.

Copy link
Collaborator Author

@crusaderky crusaderky Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests tokenization after a pickle round-trip.
Not too happy about the name either. Should we just rename it to check_tokenize?
I'm very happy about how encapsulated it is though. It allowed a huge amount of issues to crop up with tokenization of the various object types.
Added unit tests for it.

Copy link
Member

@hendrikmakait hendrikmakait Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too happy about the name either. Should we just rename it to check_tokenize?

Yes, something like check_tokenize, checked_tokenize, asserting_tokenize would sound better to me. I'll let you choose!

I'm very happy about how DAMP it is though. It caused a huge amount of issues to crop up with tokenization of the various object types.

Don't get me wrong, it looks a lot better than what we had before! After renaming and adding unit tests for the function, this should be clear enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done!

@hendrikmakait
Copy link
Member

hendrikmakait commented Feb 6, 2024

CI on mindeps isn't happy with test_check_tokenize.

@crusaderky
Copy link
Collaborator Author

CI on mindeps isn't happy with test_check_tokenize.

should be fixed

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @crusaderky! Consider the nits non-blocking and the comment just a clarification for my understanding.

register()
self._lazy.pop(toplevel, None)
return self.dispatch(cls) # recurse
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we switching the order here? What went wrong before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusaderky crusaderky force-pushed the tokenize_without_object branch from 3019d13 to bcb2a3e Compare February 6, 2024 15:42
Fix mindeps

Update dask/tests/test_tokenize.py

Co-authored-by: Hendrik Makait <[email protected]>
Update dask/tests/test_tokenize.py

Co-authored-by: Hendrik Makait <[email protected]>
nit

lint
@maartenbreddels
Copy link
Contributor

Expect downstream tests that assert vs. hardcoded dask keys to fail.

Hi,

This happened at vaexio/vaex#2331 indeed. Do you know if this was unavoidable, or can this be fixed?

Regards,

Maarten

@maartenbreddels
Copy link
Contributor

Also, are there plans to keep them stable in the future, or will dask not make this guarantee (our cache keys depend on that, and our CI will fail if they change)

@crusaderky
Copy link
Collaborator Author

@maartenbreddels I would expect token output to change infrequently (although it will change again in the next release after all PRs of #10905 get merged). As a rule of thumb, you should not rely on it to be stable across releases. In fact, it's not even guaranteed to be stable across interpreter restarts - it just happens to be in most use cases.

I advise to tweak your tests so that they don't test against hardcoded token output; instead, they should verify that two identical objects produce the same token and that two different objects produce different tokens.

@maartenbreddels
Copy link
Contributor

In fact, it's not even guaranteed to be stable across interpreter restarts

Why is that? If you use dask, and persist something in your cluster, you'd want to have the same hashkey/fingerprint right? At least, this was my assumption, and therefore we build on top of dask's hashing feature.

@crusaderky
Copy link
Collaborator Author

Why is that?

tokenize() now hashes pickle or cloudpickle output for unknown objects (including local functions).
cloudpickle does not guarantee cross-interpreter determinism in many edge cases.

Add to this that one may (commonly) run the Client on Windows or MacOSX and the scheduler and workers on Linux, which introduces an extra layer of uncertainty in the tokenization process because, again, pickle/cloudpickle output should not be expected to be identical across OS'es and possibly different versions of some of the 200+ packages that are typically deployed, unpinned, in a typical dask environment.

If you use dask, and persist something in your cluster, you'd want to have the same hashkey/fingerprint right?

tokenize() is designed to generate unique graph keys during graph definition time, which happens on the client.
Starting from dask/distributed#8185, it's also going to be used, on the scheduler, to verify that the run_spec (the values of the dask graph) are identical if keys are identical.

It was never designed to produce a stable, cross-interpreter, cross-host, cross-OS, cross-version fingerprint of arbitrary data.

snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/dask_dask_pr_10876_94006020-5264-4fbd-9dc9-520d4ebd3b9e that referenced this pull request Oct 2, 2025
Original PR #10876 by crusaderky
Original: dask/dask#10876
snorkelopstesting1-a11y added a commit to snorkel-marlin-repos/dask_dask_pr_10876_94006020-5264-4fbd-9dc9-520d4ebd3b9e that referenced this pull request Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants