-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 UnpicklingError when trying to load multiple objects from a file
Pickle allows to dump multiple self-contained objects into the same file and later load them through subsequent reads. The pickling mechanism from PyTorch has the same behavior when using an in-memory buffer like io.BytesIO, but raises an error when using regular files.
To Reproduce
This is the behavior of the standard pickle module:
import io
import torch
import pickle
b=open('/tmp/file.pt', 'wb')
for i in range(3):
was_at = b.tell()
pickle.dump(torch.ones(10), b)
print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
b.close()>>> 0: 0000-0427 (427)
>>> 1: 0427-0854 (427)
>>> 2: 0854-1281 (427)
i = 0
b=open('/tmp/file.pt', 'rb')
while True:
try:
was_at = b.tell()
pickle.load(b)
print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
i+=1
except EOFError:
break
b.close()>>> 0: 0000-0427 (427)
>>> 1: 0427-0854 (427)
>>> 2: 0854-1281 (427)
PyTorch works fine with io.BytesIO, I get the same behavior:
b=io.BytesIO()
for i in range(3):
was_at = b.tell()
torch.save(torch.ones(10), b)
print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')>>> 0: 0000-0377 (377)
>>> 1: 0377-0754 (377)
>>> 2: 0754-1131 (377)
i = 0
b.seek(0)
while True:
try:
was_at = b.tell()
torch.load(b)
print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
i+=1
except EOFError:
break>>> 0: 0000-0377 (377)
>>> 1: 0377-0754 (377)
>>> 2: 0754-1131 (377)
However, UnpicklingError is raised when using the serialization methods from PyTorch on a regular file:
b=open('/tmp/file.pt', 'wb')
for i in range(3):
was_at = b.tell()
torch.save(torch.ones(10), b)
print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
b.close()>>> 0: 0000-0377 (377)
>>> 1: 0377-0754 (377)
>>> 2: 0754-1131 (377)
i = 0
b=open('/tmp/file.pt', 'rb')
while True:
try:
was_at = b.tell()
torch.load(b)
print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
i+=1
except EOFError:
break
b.close()>>> 0: 0000--425 (-425)
---------------------------------------------------------------------------
UnpicklingError Traceback (most recent call last)
<ipython-input-38-a8789bdba75a> in <module>
12 try:
13 was_at = b.tell()
---> 14 torch.load(b)
15 print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
16 i+=1
.../python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module)
366 f = open(f, 'rb')
367 try:
--> 368 return _load(f, map_location, pickle_module)
369 finally:
370 if new_fd:
.../python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module)
530 f.seek(0)
531
--> 532 magic_number = pickle_module.load(f)
533 if magic_number != MAGIC_NUMBER:
534 raise RuntimeError("Invalid magic number; corrupt file?")
UnpicklingError: invalid load key, '\x0a'.
Note how the read location inside the file given by b.tell() results to be negative: -425.
Expected behavior
The serialization methods of Pickle and Pytorch should work in similar ways.
Environment
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1
GPU models and configuration: GPU 0: GeForce GTX 1050 Ti with Max-Q Design
Nvidia driver version: 418.43
cuDNN version: Could not collect
Versions of relevant libraries:
[pip] 19.0.3
[conda] pytorch 1.0.1 py3.7_cuda10.0.130_cudnn7.4.2_2 pytorch
Additional context
The main reason I want to serialize multiple objects individually rather than packing them in a list is because they represent inputs that might be created at different times from different processes and that I still want to process in batch.