-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
When I use pytorch to train a small network in a multi-user Ubuntu 16.04 + cuda 8.0 + python2.7, I came across the OSError. It happens accidentally. Some time within 1 epoch, some time 4 epoch. The log is:
Traceback (most recent call last):
File "autoencodertrain.py", line 43, in
data = dataloader.get_next_iter()
File "/home/pytorch/codes/gan/xgan/data/newdata_loader.py", line 45, in get_next_iter
dataB = self.dataLoaderB.iter().next()
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 254, in next
idx, batch = self._get_batch()
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 233, in _get_batch
return self.data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 170, in rebuild_handle
new_handle = recv_handle(conn)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 85, in recv_handle
return _multiprocessing.recvfd(conn.fileno())
OSError: [Errno 4] Interrupted system call
Can anyone help solve this problem. Or give some hints on how to solve it.