Skip to content

UnpicklingError when trying to load multiple objects from a file #18436

@baldassarreFe

Description

@baldassarreFe

🐛 UnpicklingError when trying to load multiple objects from a file

Pickle allows to dump multiple self-contained objects into the same file and later load them through subsequent reads. The pickling mechanism from PyTorch has the same behavior when using an in-memory buffer like io.BytesIO, but raises an error when using regular files.

To Reproduce

This is the behavior of the standard pickle module:

import io
import torch
import pickle

b=open('/tmp/file.pt', 'wb')
for i in range(3):
  was_at = b.tell()
  pickle.dump(torch.ones(10), b)
  print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
b.close()
>>> 0: 0000-0427 (427)
>>> 1: 0427-0854 (427)
>>> 2: 0854-1281 (427)
i = 0
b=open('/tmp/file.pt', 'rb')
while True:
  try:
      was_at = b.tell()
      pickle.load(b)
      print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
      i+=1
  except EOFError:
      break
b.close()
>>> 0: 0000-0427 (427)
>>> 1: 0427-0854 (427)
>>> 2: 0854-1281 (427)    

PyTorch works fine with io.BytesIO, I get the same behavior:

b=io.BytesIO()
for i in range(3):
  was_at = b.tell()
  torch.save(torch.ones(10), b)
  print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
>>> 0: 0000-0377 (377)
>>> 1: 0377-0754 (377)
>>> 2: 0754-1131 (377)
i = 0
b.seek(0)
while True:
  try:
      was_at = b.tell()
      torch.load(b)
      print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
      i+=1
  except EOFError:
      break
>>> 0: 0000-0377 (377)
>>> 1: 0377-0754 (377)
>>> 2: 0754-1131 (377)

However, UnpicklingError is raised when using the serialization methods from PyTorch on a regular file:

b=open('/tmp/file.pt', 'wb')
for i in range(3):
  was_at = b.tell()
  torch.save(torch.ones(10), b)
  print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
b.close()
>>> 0: 0000-0377 (377)
>>> 1: 0377-0754 (377)
>>> 2: 0754-1131 (377)
i = 0
b=open('/tmp/file.pt', 'rb')
while True:
  try:
      was_at = b.tell()
      torch.load(b)
      print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
      i+=1
  except EOFError:
      break
b.close()
>>> 0: 0000--425 (-425) 
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-38-a8789bdba75a> in <module>
   12     try:
   13         was_at = b.tell()
---> 14         torch.load(b)
   15         print(f'{i}: {was_at:04d}-{b.tell():04d} ({b.tell()-was_at:03d})')
   16         i+=1

.../python3.7/site-packages/torch/serialization.py in load(f, map_location, pickle_module)
  366         f = open(f, 'rb')
  367     try:
--> 368         return _load(f, map_location, pickle_module)
  369     finally:
  370         if new_fd:

.../python3.7/site-packages/torch/serialization.py in _load(f, map_location, pickle_module)
  530             f.seek(0)
  531 
--> 532     magic_number = pickle_module.load(f)
  533     if magic_number != MAGIC_NUMBER:
  534         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '\x0a'.

Note how the read location inside the file given by b.tell() results to be negative: -425.

Expected behavior

The serialization methods of Pickle and Pytorch should work in similar ways.

Environment

PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1
GPU models and configuration: GPU 0: GeForce GTX 1050 Ti with Max-Q Design
Nvidia driver version: 418.43
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] 19.0.3
[conda] pytorch                   1.0.1           py3.7_cuda10.0.130_cudnn7.4.2_2    pytorch

Additional context

The main reason I want to serialize multiple objects individually rather than packing them in a list is because they represent inputs that might be created at different times from different processes and that I still want to process in batch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: pickleProblems related to pickling of PyTorch objectsmodule: serializationIssues related to serialization (e.g., via pickle, or otherwise) of PyTorch objectstriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions