Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Apr 6, 2020

Stack from ghstack:

#30330 added support to abort the call to a RecvWork created by recvAnysource but there is an additional call to pg_->recv() to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in ::shutdown(), which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in ProcessGroupAgentTest to ensure that an appropriate error message is raised when this happens.

Differential Revision: D20632764

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

…ssGroupAgent::listenLoop

Added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
…() in ProcessGroupAgent::listenLoop"

ungraceful shutdown

Added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 6, 2020
…ssGroupAgent::listenLoop

Pull Request resolved: #36084

Added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101604969

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
@dr-ci
Copy link

dr-ci bot commented Apr 6, 2020

💊 CircleCI build failures summary and remediations

As of commit 7df8436 (more details on the Dr. CI page):


  • 2/3 failures introduced in this PR

  • 1/3 broken upstream at merge base 7e84a30 from Apr 06 until Apr 07 (9 commits; 2e8f954 - 444073e)

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase FETCH_HEAD
    

    Check out the recency history of this "viable master" tracking branch.


🕵️ 2 new failures recognized by patterns

The following build failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_build (1/2)

Step: "Build" (full log | pattern match details)

Apr 07 19:15:07 torch_xla/csrc/aten_xla_type_default.cpp:6:10: fatal error: 'ATen/CPUGeneratorImpl.h' file not found
ackages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/torch_util.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/torch_util.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/random.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/random.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
ges/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/ir_dump_util.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/ir_dump_util.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/pooling.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/pooling.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
kages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/tensor_impl.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/tensor_impl.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
kages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/python_util.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/python_util.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/device.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/device.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
e/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/init_python_bindings.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/init_python_bindings.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
ackages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/debug_util.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/debug_util.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/aten_xla_type_default.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/aten_xla_type_default.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
Apr 07 19:15:07 torch_xla/csrc/aten_xla_type_default.cpp:6:10: fatal error: 'ATen/CPUGeneratorImpl.h' file not found 
Apr 07 19:15:07 #include <ATen/CPUGeneratorImpl.h> 
Apr 07 19:15:07          ^~~~~~~~~~~~~~~~~~~~~~~~~ 
Apr 07 19:15:11 1 error generated. 
ackages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/batch_norm.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/batch_norm.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/layout_manager.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/layout_manager.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
on3.6/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/view.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/view.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
te-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/nll_loss.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/nll_loss.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/reduction.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/reduction.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/tensor_methods.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/tensor_methods.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 
torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.6/site-packages/torch/include/TH -I/opt/conda/lib/python3.6/site-packages/torch/include/THC -I/opt/conda/include/python3.6m -c torch_xla/csrc/xla_lower_util.cpp -o build/temp.linux-x86_64-3.6/torch_xla/csrc/xla_lower_util.o -std=c++14 -Wno-sign-compare -Wno-deprecated-declarations -Wno-return-type -Wno-macro-redefined -Wno-return-std-move -DNDEBUG -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_XLAC -D_GLIBCXX_USE_CXX11_ABI=1 

See CircleCI build pytorch_cpp_doc_push (2/2)

Step: "Doc Build and Push" (full log | pattern match details)

Apr 07 19:24:40 AttributeError: module 'sphinx.domains.c' has no attribute 'CMacroObject'
Apr 07 19:24:38     Found existing installation: Sphinx 3.0.0 
Apr 07 19:24:38     Uninstalling Sphinx-3.0.0: 
Apr 07 19:24:38       Successfully uninstalled Sphinx-3.0.0 
Apr 07 19:24:39 Successfully installed sphinx-2.4.4 
Apr 07 19:24:39 ++ make VERBOSE=1 html -j 
Apr 07 19:24:40 Running Sphinx v2.4.4 
Apr 07 19:24:40  
Apr 07 19:24:40 Exception occurred: 
Apr 07 19:24:40   File "/opt/conda/lib/python3.6/site-packages/breathe/renderer/sphinxrenderer.py", line 90, in DomainDirectiveFactory 
Apr 07 19:24:40     'define': (c.CMacroObject, 'macro'), 
Apr 07 19:24:40 AttributeError: module 'sphinx.domains.c' has no attribute 'CMacroObject' 
Apr 07 19:24:40 The full traceback has been saved in /tmp/sphinx-err-q_hw2_0r.log, if you want to report the issue to the developers. 
Apr 07 19:24:40 Please also report this if it was a user error, so that a better error message can be provided next time. 
Apr 07 19:24:40 A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks! 
Apr 07 19:24:40 make: *** [html] Error 2 
Apr 07 19:24:40 Makefile:27: recipe for target 'html' failed 

🚧 1 upstream failure:

These were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 13 times.

@rohan-varma rohan-varma requested a review from jjlilley April 6, 2020 19:00
…() in ProcessGroupAgent::listenLoop"

ungraceful shutdown
ungraceful shutdown

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
…() in ProcessGroupAgent::listenLoop"

ungraceful shutdown
ungraceful shutdown
ungraceful shutdown

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…ssGroupAgent::listenLoop

Pull Request resolved: #36084

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101645227

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
…() in ProcessGroupAgent::listenLoop"

ungraceful shutdown
ungraceful shutdown
ungraceful shutdown
ungraceful shutdown

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource` but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 7, 2020
…ssGroupAgent::listenLoop

Pull Request resolved: #36084

#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101689402

Differential Revision: [D20632764](https://our.internmc.facebook.com/intern/diff/D20632764/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D20632764/)!
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in fc5d658.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/107/head branch April 11, 2020 14:16
ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
…ssGroupAgent::listenLoop (pytorch#36084)

Summary:
Pull Request resolved: pytorch#36084

pytorch#30330 added support to abort the call to a `RecvWork` created by `recvAnysource`, but there is an additional call to `pg_->recv()` to actually get the tensor sent over the wire (the previous call is the preamble for the tensor). This adds support to be able to abort this call as well in `::shutdown()`, which can be used to avoid hangs during ungraceful shutdown.

Added an internal test case in `ProcessGroupAgentTest` to ensure that an appropriate error message is raised when this happens.
ghstack-source-id: 101689402

Test Plan:
Added test in ProcessGroupAgentTest. We also add a basic config that allows us to control whether to abort the call to `pg->recv()` and `pg->recvAnysource()` in `FailingWaitProcessGroupGloo`.

Run test binary:
```buck build mode/dev-nosan //caffe2/torch/fb/distributed/thriftRpcBackend/test:ProcessGroupAgentTest --keep-going
~/fbcode/buck-out/gen/caffe2/torch/fb/distributed/thriftRpcBackend/test/ProcessGroupAgentTest
```
P128567144

Differential Revision: D20632764

fbshipit-source-id: c0b3c391fd3e0ae711661ad99f309ee4d93f6582
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants