Skip to content

Conversation

@nairbv
Copy link
Collaborator

@nairbv nairbv commented May 2, 2019

test was disabled for being flaky, re-enabled in #19421 but still occasionally failing:

https://circleci.com/gh/pytorch/pytorch/1520165?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Apr 29 19:51:58 ======================================================================
Apr 29 19:51:58 FAIL: test_proper_exit (__main__.TestDataLoader)
Apr 29 19:51:58 There might be ConnectionResetError or leaked semaphore warning (due to dirty process exit), but they are all safe to ignore
Apr 29 19:51:58 ----------------------------------------------------------------------
Apr 29 19:51:58 Traceback (most recent call last):
Apr 29 19:51:58   File "/var/lib/jenkins/workspace/test/common_utils.py", line 129, in wrapper
Apr 29 19:51:58     fn(*args, **kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 847, in test_proper_exit
Apr 29 19:51:58     self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception))
Apr 29 19:51:58 AssertionError: test_proper_exit with use_workers=True, pin_memory=False, hold_iter_reference=False, exit_method=worker_kill: loader process did not terminate, and had exception Traceback (most recent call last):
Apr 29 19:51:58   File "test_dataloader.py", line 227, in run
Apr 29 19:51:58     super(ErrorTrackingProcess, self).run()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
Apr 29 19:51:58     self._target(*self._args, **self._kwargs)
Apr 29 19:51:58   File "test_dataloader.py", line 424, in _test_proper_exit
Apr 29 19:51:58     for i, _ in enumerate(it):
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 545, in __next__
Apr 29 19:51:58     idx, batch = self._get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 522, in _get_batch
Apr 29 19:51:58     success, data = self._try_get_batch()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
Apr 29 19:51:58     data = self.data_queue.get(timeout=timeout)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get
Apr 29 19:51:58     res = self._recv()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
Apr 29 19:51:58     return pickle.loads(buf)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads
Apr 29 19:51:58     return Unpickler(file).load()
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load
Apr 29 19:51:58     dispatch[key](self)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce
Apr 29 19:51:58     value = func(*args)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd
Apr 29 19:51:58     fd = multiprocessing.reduction.rebuild_handle(df)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
Apr 29 19:51:58     conn = Client(address, authkey=current_process().authkey)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 169, in Client
Apr 29 19:51:58     c = SocketClient(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/multiprocessing/connection.py", line 304, in SocketClient
Apr 29 19:51:58     s.connect(address)
Apr 29 19:51:58   File "/opt/python/2.7.9/lib/python2.7/socket.py", line 224, in meth
Apr 29 19:51:58     return getattr(self._sock,name)(*args)
Apr 29 19:51:58 error: [Errno 111] Connection refused

@pytorchbot pytorchbot added the module: dataloader Related to torch.utils.data.DataLoader and Sampler label May 2, 2019
@nairbv nairbv requested a review from ezyang May 2, 2019 15:37
@ssnl
Copy link
Collaborator

ssnl commented May 2, 2019

I'd rather we don't skip it because

  1. (mainly) CI log is the only place we can debug this.
  2. It is really a lot more reliable now.

@nairbv
Copy link
Collaborator Author

nairbv commented May 2, 2019

  1. It is really a lot more reliable now.

I think I've been seeing it fail at least once per day just in my own PRs. Unless I'm doing something wrong, that seems insufficiently reliable.

Here's another from today:

https://circleci.com/gh/pytorch/pytorch/1556949?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

@ssnl
Copy link
Collaborator

ssnl commented May 2, 2019

Okay, are they all in py2? Can you only disable on py2?

@nairbv
Copy link
Collaborator Author

nairbv commented May 3, 2019

Okay, are they all in py2? Can you only disable on py2?

Will try running just in python3 and seeing how that goes

@ssnl
Copy link
Collaborator

ssnl commented May 3, 2019

Thanks! BTW, trying to reproduce locally have been a huge problem for me. So if you are trying to do that, be aware that it may just not fail.

@nairbv
Copy link
Collaborator Author

nairbv commented May 5, 2019

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nairbv is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@nairbv merged this pull request in 9005a2c.

facebook-github-bot pushed a commit that referenced this pull request May 8, 2019
…r_exit (#20172)

Summary:
cc nairbv
All failures I have seen are of this combination. So let's just disable it for all cases. After #20063 I find it failing for py3 once.
Pull Request resolved: #20172

Differential Revision: D15266527

Pulled By: nairbv

fbshipit-source-id: afb9389dfc54a0878d52975ffa37a0fd2aa3a735
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: dataloader Related to torch.utils.data.DataLoader and Sampler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants