Skip to content

Cloudpickle non-deterministic dump when file is innocuously modified #385

@richardwu

Description

@richardwu

Cloudpickle seems to produce non-deterministic dumps when the file's formatting is "innocuously" modified (e.g., formatting changes outside of pickled object's definition) whereas dill and pickle would produce deterministic dumps.

For example, inserting a blank line anywhere after where the pickled function foo is defined will initially produce a different hash, then subsequently produce the same hash upon successive runs:

import cloudpickle
import dill
import pickle

def foo():
    pass

def get_cpickle():
    return cloudpickle.dumps(foo)

def get_dill():
    return dill.dumps(foo)

def get_pickle():
    return pickle.dumps(foo)

if __name__ == '__main__':
    print('Cpickle:', hash(get_cpickle()))
    print('Dill:', hash(get_dill()))
    print('Pickle:', hash(get_pickle()))

Command:

PYTHONHASHSEED=1 python bad_pickle.py

First run:

Cpickle: -185195056977094428
Dill: 1827482599472099751
Pickle: -2221802750934099445

Second run:

Cpickle: 5072829361071368526
Dill: 1827482599472099751
Pickle: -2221802750934099445

Blank line inserted after print('Cpickle:', ...) (third run):

Cpickle: -185195056977094428
Dill: 1827482599472099751
Pickle: -2221802750934099445

Fourth run:

Cpickle: 5072829361071368526
Dill: 1827482599472099751
Pickle: -2221802750934099445

This was tested on the following versions:

Cpickle version: 1.2.2
Dill version: 0.2.7.1
Python version: 3.6.10 (default, Jan  1 2020, 00:00:00)

This seems like perhaps Cloudpickle is also hashing some eventually cached version of the source file (e.g., .pyc).

This is also somewhat related to #120 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions