Skip to content

Conversation

@kostmo
Copy link
Member

@kostmo kostmo commented Feb 14, 2019

closes #17101

@pjh5
Copy link
Contributor

pjh5 commented Feb 21, 2019

EDIT: I don't think JOB_BASE_NAME is being populated anywhere, but now that leads the test to exist b/c we predicate on -z $JOB_BASE_NAME.

I think this can by changing https://github.com/pytorch/ossci-job-dsl/blob/master/src/jobs/pytorch.groovy#L1070 to "${buildEnvironment}-test${suffix}"

@kostmo
Copy link
Member Author

kostmo commented Feb 22, 2019

I think that ${JOB_BASE_NAME} must currently still be nonempty. Take a look at the console output of the two "Test and Push" jobs from this PR's build.

In the console log of the pytorch-win-ws2016-cuda9-cudnn7-py3-test1 job, one can find:

22:50:00 + run_tests
22:50:00 + '[' -z pytorch-win-ws2016-cuda9-cudnn7-py3-test1 ']'
22:50:00 + [[ pytorch-win-ws2016-cuda9-cudnn7-py3-test1 == *-test ]]
22:50:00 + [[ pytorch-win-ws2016-cuda9-cudnn7-py3-test1 == *-test1 ]]

Similarly, in the console log of the pytorch-win-ws2016-cuda9-cudnn7-py3-test2 job, one can find:

22:56:45 + [[ pytorch-win-ws2016-cuda9-cudnn7-py3-test2 == *-test2 ]]

@pjh5
Copy link
Contributor

pjh5 commented Feb 22, 2019

Can you tell where JOB_BASE_NAME is being populated from? It shouldn't exist anymore. All info should be encoded into BUILD_ENVIRONMENT.

@peterjc123
Copy link
Collaborator

peterjc123 commented Feb 23, 2019

@kostmo Would you please skip DataLoaderTest.ChunkDataSetGetBatch to see if it fixes the timeout issue? It seems that there is a deadlock under this test.

@peterjc123
Copy link
Collaborator

Any updates on this PR? I believe we should fix this ASAP.

Skip DataLoaderTest.ChunkDataSetGetBatch due to possible deadlock

Closes #17101
ASSERT_TRUE(batch->target[0].allclose(torch::zeros(kBatchSize - 1)));
}

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a comment saying why this is commented out, and an issue tracking when we can uncomment it again.

But IIUC, you only wanted to disable this on Windows, right? Then it should be macro'ed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there pre-existing macros for this, or did you just mean to surround with #ifdef?

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with comments

@kostmo
Copy link
Member Author

kostmo commented Feb 27, 2019

Looks like disabling that test was not enough to fix it:

18:29:39 test_single_tensor (__main__.TestTensorDataset) ... ok
18:29:39 
18:29:39 ======================================================================
18:29:39 FAIL: test_proper_exit (__main__.TestDataLoader)
18:29:39 There might be ConnectionResetError or leaked semaphore warning (due to dirty process exit), but they are all safe to ignore
18:29:39 ----------------------------------------------------------------------
18:29:39 Traceback (most recent call last):
18:29:39   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\common_utils.py", line 120, in wrapper
18:29:39     fn(*args, **kwargs)
18:29:39   File "test_dataloader.py", line 757, in test_proper_exit
18:29:39     self.fail(fail_msg + ', and had exception {}'.format(loader_p.exception))
18:29:39 AssertionError: test_proper_exit with use_workers=True, pin_memory=True, hold_iter_reference=True, exit_method=None: loader process failed to setup within given time, and had exception Traceback (most recent call last):
18:29:39   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\test_dataloader.py", line 175, in run
18:29:39     super(ErrorTrackingProcess, self).run()
18:29:39   File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 93, in run
18:29:39     self._target(*self._args, **self._kwargs)
18:29:39   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\test\test_dataloader.py", line 354, in _test_proper_exit
18:29:39     for i, _ in enumerate(it):
18:29:39   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\build\win_tmp\build\torch\utils\data\dataloader.py", line 545, in __next__
18:29:39     idx, batch = self._get_batch()
18:29:39   File "C:\Jenkins\workspace\pytorch-builds\pytorch-win-ws2016-cuda9-cudnn7-py3-test2\build\win_tmp\build\torch\utils\data\dataloader.py", line 517, in _get_batch
18:29:39     raise RuntimeError('Pin memory thread exited unexpectedly')
18:29:39 RuntimeError: Pin memory thread exited unexpectedly
18:29:39 
18:29:39 
18:29:39 ----------------------------------------------------------------------
18:29:39 Ran 51 tests in 73.087s
18:29:39 
18:29:39 FAILED (failures=1, skipped=2)
18:29:39 Traceback (most recent call last):
18:29:39   File "run_test.py", line 458, in <module>
18:29:39     main()
18:29:39   File "run_test.py", line 450, in main
18:29:39     raise RuntimeError(message)
18:29:39 RuntimeError: test_dataloader failed!

@peterjc123
Copy link
Collaborator

@pytorchbot retest this please

@peterjc123
Copy link
Collaborator

peterjc123 commented Feb 28, 2019

@kostmo Looks like test_proper_exit is only a flaky test. It gets stuck again at DataLoaderTest.ChunkDataSetWithEmptyBatch now. I think that the implementation of ChunkDataSet
doesn't work well and causes deadlock on Windows. Could you please do #ifdef on all the tests for ChunkDataSet?

@peterjc123
Copy link
Collaborator

Opened a new issue for the deadlock: #17609.

@peterjc123
Copy link
Collaborator

@pytorchbot rebase this please

@peterjc123
Copy link
Collaborator

@pytorchbot rebase this please

@peterjc123
Copy link
Collaborator

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Mar 12, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

petrex pushed a commit to petrex/pytorch that referenced this pull request Mar 14, 2019
* upstream/master: (87 commits)
  Make Variable::set_data non-const; cosmetic fixes.
  remove warning for upsample code (pytorch#17921)
  Optimize TileOp (pytorch#17290)
  Optimize channel_stats_op (pytorch#16243)
  enable shape inference for elementwise operators (pytorch#17885)
  Remove remaining test jit expects redux (pytorch#17924)
  Handle Scalars Better (pytorch#17875)
  Fixed a formatting issue in doc comments (pytorch#17505)
  Add nbytes, itemsize, element_size to at::Tensor. (pytorch#17810)
  Fix lint in test_distributions.py
  Fix lint in test_jit.py
  Fix lint errors in test_autograd
  Added a few extra python bindings to help with walking the IR graph from Python (pytorch#17822)
  kthvalue consistency with sort in the presence of NaN (pytorch#17824)
  Fix minor grammatical mistakes in torch/nn/modules/loss.py (pytorch#17892)
  Remove (almost all) TensorOptions from native_functions.yaml (pytorch#17385)
  Restore full Windows tests (pytorch#17102)
  Prevent VS2017 from emitting ambiguous symbol errors (second time)
  Fix windows test hang (pytorch#17778)
  torch.btrifact for tensors with greater than 3 dimensions (pytorch#14964)
  ...
@ezyang ezyang added the merged label Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restore full Windows tests

6 participants