Skip to content

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Jun 13, 2019

Some data loader tests are flaky on py 2 with the following error

Jun 12 22:17:31 Traceback (most recent call last):
Jun 12 22:17:31   File "test_dataloader.py", line 798, in test_iterable_dataset
Jun 12 22:17:31     fetched = sorted([d.item() for d in dataloader_iter])
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 697, in __next__
Jun 12 22:17:31     idx, data = self._get_data()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 664, in _get_data
Jun 12 22:17:31     success, data = self._try_get_data()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 617, in _try_get_data
Jun 12 22:17:31     data = self.data_queue.get(timeout=timeout)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get
Jun 12 22:17:31     res = self._recv()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
Jun 12 22:17:31     return pickle.loads(buf)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads
Jun 12 22:17:31     return Unpickler(file).load()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load
Jun 12 22:17:31     dispatch[key](self)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce
Jun 12 22:17:31     value = func(*args)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd
Jun 12 22:17:31     fd = multiprocessing.reduction.rebuild_handle(df)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 157, in rebuild_handle
Jun 12 22:17:31     new_handle = recv_handle(conn)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 83, in recv_handle
Jun 12 22:17:31     return _multiprocessing.recvfd(conn.fileno())
Jun 12 22:17:31 OSError: [Errno 4] Interrupted system call

Apparently, Python 2.7's recvfd calls recvmsg without EINTR retry: https://github.com/python/cpython/blob/2.7/Modules/_multiprocessing/multiprocessing.c#L174
So we should call it with an outer try-catch loop.

@pytorchbot pytorchbot added the module: multiprocessing Related to torch.multiprocessing label Jun 13, 2019
@soumith
Copy link
Contributor

soumith commented Jun 13, 2019

@pytorchbot merge this please

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 13, 2019

closes #4220

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 13, 2019

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jun 13, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Jun 13, 2019

This is a very nice catch :>

@ssnl
Copy link
Collaborator Author

ssnl commented Jun 13, 2019

@ezyang did something go wrong during landing? :)

@gchanan
Copy link
Contributor

gchanan commented Jun 13, 2019

@ssnl looks like a spurious failure, I'm retrying.

@gchanan
Copy link
Contributor

gchanan commented Jun 13, 2019

@pytorchbot rebase this please

@ezyang
Copy link
Contributor

ezyang commented Jun 14, 2019

For future reference: if the land fails, "rebase this please" on GitHub side is usually not the right resolution (since that means you're going to have to rerun all your tests fbcode side). The first thing you should try is selecting a different land style if the errors are spurious (as the land failure message suggests) and the second thing you should try is "Rebase & Test" on the internal diff.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in bc62810.

@ssnl ssnl deleted the reduction_ENINTR branch June 14, 2019 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged module: multiprocessing Related to torch.multiprocessing open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants