rebuild_storage_fd retry on EINTR #21723

ssnl · 2019-06-13T03:34:02Z

Some data loader tests are flaky on py 2 with the following error

Jun 12 22:17:31 Traceback (most recent call last):
Jun 12 22:17:31   File "test_dataloader.py", line 798, in test_iterable_dataset
Jun 12 22:17:31     fetched = sorted([d.item() for d in dataloader_iter])
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 697, in __next__
Jun 12 22:17:31     idx, data = self._get_data()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 664, in _get_data
Jun 12 22:17:31     success, data = self._try_get_data()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 617, in _try_get_data
Jun 12 22:17:31     data = self.data_queue.get(timeout=timeout)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get
Jun 12 22:17:31     res = self._recv()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
Jun 12 22:17:31     return pickle.loads(buf)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads
Jun 12 22:17:31     return Unpickler(file).load()
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load
Jun 12 22:17:31     dispatch[key](self)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce
Jun 12 22:17:31     value = func(*args)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd
Jun 12 22:17:31     fd = multiprocessing.reduction.rebuild_handle(df)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 157, in rebuild_handle
Jun 12 22:17:31     new_handle = recv_handle(conn)
Jun 12 22:17:31   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 83, in recv_handle
Jun 12 22:17:31     return _multiprocessing.recvfd(conn.fileno())
Jun 12 22:17:31 OSError: [Errno 4] Interrupted system call

Apparently, Python 2.7's recvfd calls recvmsg without EINTR retry: https://github.com/python/cpython/blob/2.7/Modules/_multiprocessing/multiprocessing.c#L174
So we should call it with an outer try-catch loop.

soumith · 2019-06-13T04:03:06Z

@pytorchbot merge this please

ssnl · 2019-06-13T05:24:16Z

closes #4220

ssnl · 2019-06-13T16:51:04Z

@pytorchbot merge this please

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ezyang · 2019-06-13T17:06:00Z

This is a very nice catch :>

ssnl · 2019-06-13T22:38:30Z

@ezyang did something go wrong during landing? :)

gchanan · 2019-06-13T23:32:45Z

@ssnl looks like a spurious failure, I'm retrying.

gchanan · 2019-06-13T23:33:25Z

@pytorchbot rebase this please

ezyang · 2019-06-14T14:13:27Z

For future reference: if the land fails, "rebase this please" on GitHub side is usually not the right resolution (since that means you're going to have to rerun all your tests fbcode side). The first thing you should try is selecting a different land style if the errors are spurious (as the land failure message suggests) and the second thing you should try is "Rebase & Test" on the internal diff.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-14T16:18:05Z

@ezyang merged this pull request in bc62810.

rebuild_storage_fd retry on EINTR

e9be76c

ssnl added the open source label Jun 13, 2019

pytorchbot added the module: multiprocessing Related to torch.multiprocessing label Jun 13, 2019

soumith approved these changes Jun 13, 2019

View reviewed changes

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jun 13, 2019

facebook-github-bot reviewed Jun 13, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

eaa5ab6

facebook-github-bot reviewed Jun 14, 2019

View reviewed changes

facebook-github-bot closed this in bc62810 Jun 14, 2019

facebook-github-bot added the merged label Jun 14, 2019

ssnl deleted the reduction_ENINTR branch June 14, 2019 16:44

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rebuild_storage_fd retry on EINTR #21723

rebuild_storage_fd retry on EINTR #21723

Uh oh!

ssnl commented Jun 13, 2019

Uh oh!

soumith commented Jun 13, 2019

Uh oh!

ssnl commented Jun 13, 2019

Uh oh!

ssnl commented Jun 13, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

ezyang commented Jun 13, 2019

Uh oh!

ssnl commented Jun 13, 2019

Uh oh!

gchanan commented Jun 13, 2019

Uh oh!

gchanan commented Jun 13, 2019

Uh oh!

ezyang commented Jun 14, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rebuild_storage_fd retry on EINTR #21723

rebuild_storage_fd retry on EINTR #21723

Uh oh!

Conversation

ssnl commented Jun 13, 2019

Uh oh!

soumith commented Jun 13, 2019

Uh oh!

ssnl commented Jun 13, 2019

Uh oh!

ssnl commented Jun 13, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jun 13, 2019

Uh oh!

ssnl commented Jun 13, 2019

Uh oh!

gchanan commented Jun 13, 2019

Uh oh!

gchanan commented Jun 13, 2019

Uh oh!

ezyang commented Jun 14, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants