Skip to content

pickle: pad numpy bytes to ensure correct alignment with memmap#570

Closed
aabadie wants to merge 7 commits intojoblib:masterfrom
aabadie:fix_memmap_aligned
Closed

pickle: pad numpy bytes to ensure correct alignment with memmap#570
aabadie wants to merge 7 commits intojoblib:masterfrom
aabadie:fix_memmap_aligned

Conversation

@aabadie
Copy link
Contributor

@aabadie aabadie commented Nov 9, 2017

This is an attempt to fix #563.

The idea is to provide the same strategy that numpy does but in the pickle file, since the array bytes are stored inlined in the pickle stream.

For the moment, it doesn't work (I'm even not sure if my test function is valid...).

@aabadie
Copy link
Contributor Author

aabadie commented Nov 9, 2017

If we keep a similar strategy in the end, I'll also have to provide new test data (for testing backward compat).

@codecov
Copy link

codecov bot commented Nov 9, 2017

Codecov Report

Merging #570 into master will increase coverage by 0.05%.
The diff coverage is 96.15%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #570      +/-   ##
=========================================
+ Coverage   94.24%   94.3%   +0.05%     
=========================================
  Files          39      39              
  Lines        5008    5059      +51     
=========================================
+ Hits         4720    4771      +51     
  Misses        288     288
Impacted Files Coverage Δ
joblib/test/test_numpy_pickle.py 99.06% <100%> (+0.01%) ⬆️
joblib/numpy_pickle.py 97.89% <95.23%> (-0.58%) ⬇️
joblib/_parallel_backends.py 93.45% <0%> (-1.41%) ⬇️
joblib/test/test_memory.py 99.36% <0%> (+0.42%) ⬆️
joblib/memory.py 94.13% <0%> (+0.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efa87af...5f2ce2f. Read the comment docs.

@aabadie
Copy link
Contributor Author

aabadie commented Nov 13, 2017

@ogrisel, any idea why this approach doesn't work ?

@aabadie
Copy link
Contributor Author

aabadie commented Nov 13, 2017

Ok, I think I made some progress: the dumps generated from the NumpyArrayWrapper also contains the pickle header which is already taken into account by the position in the pickle stream. I have to subtract this pickle 'header' and pickle closing bytes from the computed position.

I computed 22 bytes to remove but actually only 19 is working. I still can't really explain why...

And it fails on Appveyor, maybe because it's a 32 bit version?, or something like this.

@lesteve
Copy link
Member

lesteve commented Nov 14, 2017

Hmmm it looks like Travis Python 2.7 is broken as well.

@aabadie
Copy link
Contributor Author

aabadie commented Nov 14, 2017

it looks like Travis Python 2.7 is broken as well.

yes, I wanted to have a more complex test, just to be sure. There's something with the pickle protocol version I think

@aabadie
Copy link
Contributor Author

aabadie commented Nov 15, 2017

Hmmm it looks like Travis Python 2.7 is broken as well.

Not the case anymore !!! And Windows seems to work as well :)

I finally use another custom pickler to determine the actual size of the NumpyArrayWrapper in the stream. This was the easy part. Because it was still not working.

The main problem was coming from pickle that does some optimisation when the same object has to be pickled: it uses an internal memo pattern and reuse in place of the full pickled object. I fixed that calling self.clear_memo() after pickling the NumpyArrayWrapper.
It might not be right solution and it probably has side effects (mainly on performance I think).

The last trick was to use np.int8 instead of a basic python int for the introducedalignment_padding attribute of the array wrapper. This is because Pickle is again doing optimisation and the stream generated after setting this attribute was different from the stream generated from the default constructor (0).

I know I introduced some ugly hacks to solve the padding issue but that the best I could came up (for the moment).

self.order = order
self.dtype = dtype
self.allow_mmap = allow_mmap
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be the best place to do that. Good ideas are welcome

@aabadie aabadie changed the title pickle: pad numpy bytes to ensure correct alignment with memmap (WIP) pickle: pad numpy bytes to ensure correct alignment with memmap Nov 15, 2017
@aabadie aabadie added the bug label Nov 16, 2017
@aabadie
Copy link
Contributor Author

aabadie commented Nov 16, 2017

I tried locally to play with different values of pickle protocol and it hardly fails with values different from the default ones for both python 2 & 3.

@lesteve
Copy link
Member

lesteve commented Nov 16, 2017

it hardly fails with values different from the default ones for both python 2 & 3.

What do you mean by "hardly fails"?

for memmap in l_reloaded:
assert isinstance(memmap, np.memmap)
np.testing.assert_array_equal(arr, memmap)
assert memmap.ctypes.data % 8 == 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check .flags.aligned as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@aabadie
Copy link
Contributor Author

aabadie commented Nov 16, 2017

What do you mean by "hardly fails"?

I mean I have a weird issue with python 3 and pickle.HIGHEST_PROTOCOL (4):


    @with_numpy
    def test_memmap_alignment_padding(tmpdir):
        # Test that memmaped arrays returned by numpy.load are correctly aligned
    
        fname = tmpdir.join('test.mmap').strpath
        arr = np.random.randn(2)
    
        l = [arr, arr, arr, arr]
    
        numpy_pickle.dump(l, fname, protocol=pickle.HIGHEST_PROTOCOL)
>       l_reloaded = numpy_pickle.load(fname, mmap_mode='r')

joblib/test/test_numpy_pickle.py:931: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
joblib/numpy_pickle.py:645: in load
    obj = _unpickle(fobj, filename, mmap_mode)
joblib/numpy_pickle.py:575: in _unpickle
    obj = unpickler.load()
../../../anaconda3/lib/python3.5/pickle.py:1039: in load
    dispatch[key[0]](self)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <joblib.numpy_pickle.NumpyUnpickler object at 0x7f7482ad2f28>

    def load_reduce(self):
        stack = self.stack
        args = stack.pop()
        func = stack[-1]
>       stack[-1] = func(*args)
E       TypeError: 'str' object is not callable

../../../anaconda3/lib/python3.5/pickle.py:1396: TypeError

The generated pickle is invalid but it's hard to understand why.

@lesteve
Copy link
Member

lesteve commented Nov 17, 2017

I mean I have a weird issue with python 3 and pickle.HIGHEST_PROTOCOL (4):

Can you add a test that fails? Basically a test that tests on all protocols looks like a good idea.

@aabadie
Copy link
Contributor Author

aabadie commented Nov 17, 2017

Just to give more details when playing with protocol versions:

  • python3 (versions are from 0 to 4): memmap is not aligned with version 2, pickle broken with version 4
  • python2 (versions are from 0 to 2): memmap not aligned with version 0

So 3 cases fail.

@aabadie
Copy link
Contributor Author

aabadie commented Nov 17, 2017

@lesteve, just pushed an adapted version of the test. It should be possible to find the correct alignment for (python 2, protocol 0) and (python 3, protocol 2), but for (python 3, protocol 4), I have no idea...

@lesteve
Copy link
Member

lesteve commented Nov 17, 2017

The main problem was coming from pickle that does some optimisation when the same object has to be pickled: it uses an internal memo pattern and reuse in place of the full pickled object. I fixed that calling self.clear_memo() after pickling the NumpyArrayWrapper.

Can you comment a bit more on this? I know about the pickle memoization but I don't get why clear_memo is needed.

@aabadie
Copy link
Contributor Author

aabadie commented Nov 17, 2017

Can you comment a bit more on this? I know about the pickle memoization but I don't get why clear_memo is needed.

This fixes the problem when you have multiple object of the same type to pickle. The actual test uses a list of numpy arrays and shows this problem.

The first time, pickle will write the full pickled stream of the object and the next times, pickle will check in the memo for references of the same object id. If the object id is found in memo, pickle uses it and the resulting pickle is different => the size is different. Since we are comparing to an uniquely pickled object (using NumpyArrayWrapperPickler), the computed padding is wrong. If I clear the memo, I'm sure the object will never be found => always the same length.

Maybe I should just drop the key of numpy array wrapper id only from the memo instead of flushing it entirely.

@aabadie
Copy link
Contributor Author

aabadie commented Nov 17, 2017

@lesteve, do you think it's worth supporting pickle protocol < 2 for py2 and py3? If not, we can concentrate on py3 issues.

@lesteve
Copy link
Member

lesteve commented Nov 17, 2017

py3 issues are probably more of a priority I would say.

I have to say I don't manage to follow everything ... I can't wrap my head around why the memoization is a problem, it feels it should "just work". Which object is memoized, the array_wrapper or the array?

@aabadie
Copy link
Contributor Author

aabadie commented Nov 17, 2017

Which object is memoized, the array_wrapper or the array?

Every kind of object is memorized. and that's the problem. One cannot just remove the wrapper from memoize since, for example, the np.dtype will be inserted in during pickling of the wrapper itself !

@aabadie
Copy link
Contributor Author

aabadie commented Nov 17, 2017

I dumped the memo just to have a look at the content. In pickle.py there's specific save code that depends on the protocol. I'm trying to better understand it.

@lesteve
Copy link
Member

lesteve commented Nov 17, 2017

Ah but that's because of your solution with your separate pickler for computing the length of the array_wrapper that is going to be written right? Can you not set the memo of the dumb pickler to the main one?

@aabadie
Copy link
Contributor Author

aabadie commented Nov 17, 2017

Can you not set the memo of the dumb pickler to the main one?

Do you mean sharing the same memo instance between the 2 ? I can try.

@lesteve
Copy link
Member

lesteve commented Nov 17, 2017

Do you mean sharing the same memo instance between the 2 ? I can try.

Yep this is what I had in mind. Seems hacky but worth a try. There may be complications that need to be taken into account. You may need to remove the array wrapper from the memo after computing the expected length because the main pickler may use the memoized array_wrapper if you are unlucky and the offset is zero (or use -1 as the default padding offset).

@aabadie aabadie force-pushed the fix_memmap_aligned branch 3 times, most recently from ac0bcaf to adb7c4d Compare November 18, 2017 23:31
@aabadie
Copy link
Contributor Author

aabadie commented Nov 20, 2017

Seems hacky but worth a try

I tried to share the same instance of memo but it was not working as expected (don't remember exactly why, sorry)

You may need to remove the array wrapper from the memo after computing the expected length because the main pickler may use the memoized array_wrapper if you are unlucky and the offset is zero (or use -1 as the default padding offset).

I came up with another approach which is to set a copy of the numpy pickler memo to the wrapper pickler. Then I ensure the future pickled wrapper will have the same size.
But still, it doesn't fully work with a list containing more than 2 arrays... The sizes are nearly the same but there is still something that changes between the 2 and I couldn't find what it is (yet).

Also added another case: using dict of arrays. This kind of data also doesn't work...

The issue with protocol 4 is now fixed, it was related to the new framing mecanism.

@lesteve
Copy link
Member

lesteve commented Nov 20, 2017

You seem to be on the right track, keep pushing, sorry if I am not of much help on the low-level details ...

@aabadie
Copy link
Contributor Author

aabadie commented Mar 5, 2019

I don't think I'll work again on this and I don't think it's worth, so I'm closing it. Feel free to reopen or provided a similar change in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alignment issue when passing arrays though mmap

2 participants