-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething is brokenSomething is broken
Description
An empty numpy array that defines strides may not provide stable tokenization upon pickle roundtrips
import numpy as np
arr = np.array([])Note: This only works when using the array constructor, not the ndarray since ndarray will initialize strides to ()
The above numpy array is empty but is defining a strides attribute
>>> arr.strides
(0,)which means that the normalize_token will return a four tuple where the last element are the strides.
>>> from dask.base import normalize_token, tokenize
>>> normalize_token(arr)
('da39a3ee5e6b4b0d3255bfef95601890afd80709', dtype('int64'), (0,), (0,))A pickle roundtrip of this empty array now sets the strides to a non-zero element
>>> import pickle
>>> pickle.loads(pickle.dumps(arr)).strides
(8,)
# This should be true
>>> assert tokenize(arr) != tokenize(roundtrip)which is causing our deterministic tokenization to break. This is a problem for graphs that are only generated on the scheduler since the original keys on client side may differ from those on scheduler side.
numpy 1.26.2 py310h30ee222_0 conda-forge
pandas 2.1.4 py310h6e3cc31_0 conda-forge
python 3.10.13 h2469fbe_0_cpython conda-forge
Metadata
Metadata
Assignees
Labels
bugSomething is brokenSomething is broken