Skip to content

file_descriptor sharing strategy may be leaking FDs, resulting in DataLoader causing RuntimeError: received 0 items of ancdata #973

@jfsantos

Description

@jfsantos

Editorial note: If you are having this problem, try running torch.multiprocessing.set_sharing_strategy('file_system') right after your import of torch


I am using a DataLoader in my code with a custom Dataset class, and it worked fine during training for several epochs. However, when testing my model, after a bit less than 1k iterations, I'm getting the following error:

RuntimeError                              Traceback (most recent call last)
/home/jfsantos/src/pytorch_models/test_model.py in <module>()
     82
     83 print('Generating samples...')
---> 84 for k, batch in tqdm(enumerate(test_loader)):
     85     f = G_test.audio_paths[k]
     86     spec, phase = spectrogram_from_file(f, window=window, step=step)

/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/site-packages/tqdm/_tqdm.py in __iter__(self)
    831 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
    832
--> 833             for obj in iterable:
    834                 yield obj
    835                 # Update and print the progressbar.

/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py in __next__(self)
    166         while True:
    167             assert (not self.shutdown and self.batches_outstanding > 0)
--> 168             idx, batch = self.data_queue.get()
    169             self.batches_outstanding -= 1
    170             if idx != self.rcvd_idx:

/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/queues.py in get(self)
    343             res = self._reader.recv_bytes()
    344         # unserialize the data after having released the lock
--> 345         return ForkingPickler.loads(res)
    346
    347     def put(self, obj):

/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
     68         fd = multiprocessing.reduction.rebuild_handle(df)
     69     else:
---> 70         fd = df.detach()
     71     try:
     72         storage = storage_from_cache(cls, fd_id(fd))

/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/resource_sharer.py in detach(self)
     56             '''Get the fd.  This should only be called once.'''
     57             with _resource_sharer.get_connection(self._id) as conn:
---> 58                 return reduction.recv_handle(conn)
     59
     60

/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/reduction.py in recv_handle(conn)
    179         '''Receive a handle over a local connection.'''
    180         with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
--> 181             return recvfds(s, 1)[0]
    182
    183     def DupFd(fd):

/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/reduction.py in recvfds(sock, size)
    158             if len(ancdata) != 1:
    159                 raise RuntimeError('received %d items of ancdata' %
--> 160                                    len(ancdata))
    161             cmsg_level, cmsg_type, cmsg_data = ancdata[0]
    162             if (cmsg_level == socket.SOL_SOCKET and

RuntimeError: received 0 items of ancdata

However, if I just do idxs = [k for k, batch in tqdm(enumerate(test_loader))] I do not have this issue.

I do not have any idea on how to test it as my knowledge of how PyTorch does this is currently very limited, but I could help debug this given some instructions. Does anyone have any idea on where I could start?

Metadata

Metadata

Assignees

Labels

has workaroundhigh prioritymodule: crashProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions