Skip to content

Conversation

@mruberry
Copy link
Collaborator

@mruberry mruberry commented Jul 24, 2018

This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for #8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.

The precise changes are:

  • TypeAndShape -> InputMetadata, now includes device()
  • Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
  • The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
  • Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
  • (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings

fyi @colesbury

@ezyang
Copy link
Contributor

ezyang commented Jul 25, 2018

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Jul 25, 2018

CI failure looks real, I guess?

14:30:20 /var/lib/jenkins/workspace/torch/csrc/autograd/input_metadata.h:4:35: fatal error: torch/csrc/assertions.h: No such file or directory
14:30:20  #include "torch/csrc/assertions.h"
14:30:20                                    ^
14:30:20 compilation terminated.

@ezyang
Copy link
Contributor

ezyang commented Jul 25, 2018

@pytorchbot retest this please

1 similar comment
@ezyang
Copy link
Contributor

ezyang commented Jul 25, 2018

@pytorchbot retest this please

@mruberry
Copy link
Collaborator Author

The CI failure was real; I had to merge with master on a file not presented in the web UX. My mistake submitting that earlier commit. The current CI failures appear unrelated.

@mruberry
Copy link
Collaborator Author

@pytorchbot retest this please.

@mruberry
Copy link
Collaborator Author

The failure of all 3 rocm builds is worrying; running one last retest to see if it persists.

@pytorchbot retest this please.

@mruberry
Copy link
Collaborator Author

5 failures:

pr/pytorch-linux-trust-pynightly

17:24:33 FATAL: command execution failed
17:24:33 java.nio.channels.ClosedChannelException

Seems unrelated.

pr/py2-clang3.8-rocmnightly-ubuntu16.04

17:42:06 CMake Error at caffe2/CMakeLists.txt:273 (set_target_properties):
17:42:06 set_target_properties called with incorrect number of arguments.

Seems unrelated.

pr/caffe2-py2-gcc5-ubuntu16.04-test

18:01:04 lib/python2.7/dist-packages/caffe2/python/operator_test/fc_operator_test.py::TestFcOperator::test_fc_transposed FAILED [ 80%]
18:01:05
18:01:05 =================================== FAILURES ===================================
18:01:05 ______________________ TestFcOperator.test_fc_transposed _______________________

Seems unrelated.

pr/caffe2-py2-cuda9.1-cudnn7-ubuntu16.04-test

18:07:46 Build timed out (after 45 minutes). Marking the build as failed.
18:07:46 Build was aborted

Seems unrelated.

pr/caffe2-py2-clang3.8-rocmnightly-ubuntu16.04-build

17:43:31 CMake Error at caffe2/CMakeLists.txt:273 (set_target_properties):
17:43:31 set_target_properties called with incorrect number of arguments.

Same issue as prior ROCm build.

@ezyang any idea on these set_target_properties issues?

@ailzhang
Copy link
Contributor

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Aug 1, 2018

No it's very puzzling. Even more puzzling because you don't have any cmake changes.

@ezyang
Copy link
Contributor

ezyang commented Aug 1, 2018

oh I know! the rocmnightly job is "stale": it is failing but it's not a real failure, it's just that our CI is stupid and doesn't know to clear the old failures. So I think this PR is good to go.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang
Copy link
Contributor

ezyang commented Aug 6, 2018

Sorry, this merge conflicted before it could land. I fixed the merge conflict, rerunning tests...

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Aug 7, 2018
Summary:
This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for #8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.

The precise changes are:

- TypeAndShape -> InputMetadata, now includes device()
- Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
- The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
- Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
- (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings

fyi colesbury
Pull Request resolved: pytorch/pytorch#9796

Reviewed By: goldsborough

Differential Revision: D9119325

Pulled By: ezyang

fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Aug 10, 2018
Summary:
This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for pytorch#8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.

The precise changes are:

- TypeAndShape -> InputMetadata, now includes device()
- Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
- The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
- Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
- (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings

fyi colesbury
Pull Request resolved: pytorch#9796

Reviewed By: goldsborough

Differential Revision: D9119325

Pulled By: ezyang

fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for pytorch#8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.

The precise changes are:

- TypeAndShape -> InputMetadata, now includes device()
- Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
- The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
- Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
- (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings

fyi colesbury
Pull Request resolved: pytorch#9796

Reviewed By: goldsborough

Differential Revision: D9119325

Pulled By: ezyang

fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de
@mruberry mruberry deleted the device_tracing branch September 25, 2018 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants