Skip to content

Non deterministic tokenization for empty numpy arrays after pickle roundtrip #10799

@fjetter

Description

@fjetter

An empty numpy array that defines strides may not provide stable tokenization upon pickle roundtrips

import numpy as np
arr = np.array([])

Note: This only works when using the array constructor, not the ndarray since ndarray will initialize strides to ()

The above numpy array is empty but is defining a strides attribute

>>> arr.strides
(0,)

which means that the normalize_token will return a four tuple where the last element are the strides.

>>> from dask.base import normalize_token, tokenize
>>> normalize_token(arr)
('da39a3ee5e6b4b0d3255bfef95601890afd80709', dtype('int64'), (0,), (0,))

A pickle roundtrip of this empty array now sets the strides to a non-zero element

>>> import pickle
>>> pickle.loads(pickle.dumps(arr)).strides
(8,)

# This should be true
>>> assert tokenize(arr) != tokenize(roundtrip)

which is causing our deterministic tokenization to break. This is a problem for graphs that are only generated on the scheduler since the original keys on client side may differ from those on scheduler side.


numpy                     1.26.2          py310h30ee222_0    conda-forge
pandas                    2.1.4           py310h6e3cc31_0    conda-forge
python                    3.10.13         h2469fbe_0_cpython    conda-forge

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions