Skip to content

Making cloudpickle produce "consistent/deterministic" results. #120

@robertnishihara

Description

@robertnishihara

This question arose in the following context. I have multiple Python processes, and some classes are defined in each process. Sometimes class definitions are shipped from one process to another (using cloudpickle). Sometimes classes may be shipped multiple times or in multiple ways to a given process, and I'd like to deduplicate them based on the output of cloudpickle (that is, if cloudpickle.dumps(class1) == cloudpickle.dumps(class2), then the classes are the "same" and I can throw away one of them. This works, but there are way too many false negatives (that is, two classes really are the same (in some sense), but cloudpickle.dumps gives different results on the two classes.

Here's one example that sort of illustrates the issue (although there are a number of ways this kind of thing can arise).

Suppose I do the following.

import cloudpickle

class Foo1(object):
    def __init__(self):
        pass

serialized1 = cloudpickle.dumps(Foo1)
Foo2 = cloudpickle.loads(serialized1)
serialized2 = cloudpickle.dumps(Foo2)

assert serialized1 == serialized2  # This assertion fails.

I'd love for this kind of assertion to succeed. Does anyone know if this is achievable or what the main obstacles are?

Interestingly, if I iterate this a third time,

Foo3 = cloudpickle.loads(serialized2)
serialized3 = cloudpickle.dumps(Foo3)

assert serialized2 == serialized3  # This succeeds.

then the assert succeeds, so maybe it suffices to use cloudpickle.dumps(cloudpickle.loads(cloudpickle.dumps(cls))) to deduplicate classes (though this seems kind of insane, and I haven't tested this extensively). Would you expect this to work?

One thing that may be related/revealing is the outputs I get if I do something similar in an IPython interpreter (instead of a regular Python interpreter).

First copy and paste this block into IPython.

import cloudpickle

class Foo(object):
    def __init__(self):
        pass

serialized1 = cloudpickle.dumps(Foo)

Then copy and paste this block into IPython.

class Foo(object):
    def __init__(self):
        pass

serialized2 = cloudpickle.dumps(Foo)

Comparing serialized1 and serialized2 next to each other, they are

serialized1  # b'\x80\x02ccloudpickle.cloudpickle\n_rehydrate_skeleton_class\nq\x00(ccloudpickle.cloudpickle\n_builtin_type\nq\x01X\t\x00\x00\x00ClassTypeq\x02\x85q\x03Rq\x04X\x03\x00\x00\x00Fooq\x05c__builtin__\nobject\nq\x06\x85q\x07}q\x08X\x07\x00\x00\x00__doc__q\tNs\x87q\nRq\x0b}q\x0c(X\n\x00\x00\x00__module__q\rX\x08\x00\x00\x00__main__q\x0eX\x08\x00\x00\x00__init__q\x0fccloudpickle.cloudpickle\n_fill_function\nq\x10(ccloudpickle.cloudpickle\n_make_skel_func\nq\x11h\x01X\x08\x00\x00\x00CodeTypeq\x12\x85q\x13Rq\x14(K\x01K\x00K\x01K\x01KCc_codecs\nencode\nq\x15X\x04\x00\x00\x00d\x00S\x00q\x16X\x06\x00\x00\x00latin1q\x17\x86q\x18Rq\x19N\x85q\x1a)X\x04\x00\x00\x00selfq\x1b\x85q\x1cX\x1e\x00\x00\x00<ipython-input-1-d9b5c81388ae>q\x1dh\x0fK\x04h\x15X\x02\x00\x00\x00\x00\x01q\x1eh\x17\x86q\x1fRq ))tq!Rq"J\xff\xff\xff\xff}q#\x87q$Rq%}q&N}q\'NtRutR.'
serialized2  # b'\x80\x02ccloudpickle.cloudpickle\n_rehydrate_skeleton_class\nq\x00(ccloudpickle.cloudpickle\n_builtin_type\nq\x01X\t\x00\x00\x00ClassTypeq\x02\x85q\x03Rq\x04X\x03\x00\x00\x00Fooq\x05c__builtin__\nobject\nq\x06\x85q\x07}q\x08X\x07\x00\x00\x00__doc__q\tNs\x87q\nRq\x0b}q\x0c(X\n\x00\x00\x00__module__q\rX\x08\x00\x00\x00__main__q\x0eX\x08\x00\x00\x00__init__q\x0fccloudpickle.cloudpickle\n_fill_function\nq\x10(ccloudpickle.cloudpickle\n_make_skel_func\nq\x11h\x01X\x08\x00\x00\x00CodeTypeq\x12\x85q\x13Rq\x14(K\x01K\x00K\x01K\x01KCc_codecs\nencode\nq\x15X\x04\x00\x00\x00d\x00S\x00q\x16X\x06\x00\x00\x00latin1q\x17\x86q\x18Rq\x19N\x85q\x1a)X\x04\x00\x00\x00selfq\x1b\x85q\x1cX\x1e\x00\x00\x00<ipython-input-2-a08e1f07615d>q\x1dh\x0fK\x02h\x15X\x02\x00\x00\x00\x00\x01q\x1eh\x17\x86q\x1fRq ))tq!Rq"J\xff\xff\xff\xff}q#\x87q$Rq%}q&N}q\'NtRutR.'

They seem to be the same everywhere except that the first includes the string <ipython-input-1-d9b5c81388ae>q\x1dh\x0fK\x04h and the second includes the string <ipython-input-2-a08e1f07615d>q\x1dh\x0fK\x02h. Any idea where these strings come from or if it is possible to remove them?

cc @Wapaul1 @mehrdadn

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions