-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Change type of timeoutFutures_ key to time_point instead of duration #31078
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, entTime)` with `std::condition_variable::wait_until(lock, entTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
| using steady_clock_time_point = | ||
| std::chrono::time_point<std::chrono::steady_clock>; | ||
|
|
||
| static constexpr steady_clock_time_point kInfiniteTimeoutTimePoint = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go in rpc_agent.h so that other RpcAgents can have access to them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. This watch dog thread and using a const time to represent infinity are all specific to ProcessGroupAgent only.
Notice this is not the "Default RPC timeout".
| futureTimeoutCV_.wait(lock); | ||
| } else { | ||
| futureTimeoutCV_.wait_for(lock, sleepTime); | ||
| futureTimeoutCV_.wait_until(lock, minEndTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get rid of the sleepTime variable now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Let me update.
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please get a stamp from @rohan-varma as well.
There is typo in the PR description entTime -> endTime. Please make sure the final commit message has the correct spelling. Thanks!
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, entTime)` with `std::condition_variable::wait_until(lock, entTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
|
@mrshenli Thanks! updated PR description. |
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
Pull Request resolved: #31078 Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. ghstack-source-id: 95316304 Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)
rohan-varma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, please wait for tests to pass.
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
| using steady_clock_time_point = | ||
| std::chrono::time_point<std::chrono::steady_clock>; | ||
|
|
||
| constexpr steady_clock_time_point kInfiniteTimeoutTimePoint = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this in the .h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjlilley Wanted to put it in the class declaration.
Then I got this error in macos test
https://circleci.com/gh/pytorch/pytorch/3890383?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
Dec 11 01:39:38 + python -c 'import torch'
Dec 11 01:39:38 Traceback (most recent call last):
Dec 11 01:39:38 File "<string>", line 1, in <module>
Dec 11 01:39:38 File "/Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/__init__.py", line 81, in <module>
Dec 11 01:39:38 from torch._C import *
Dec 11 01:39:38 ImportError: dlopen(/Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/_C.cpython-36m-darwin.so, 9): Symbol not found: __ZN5torch11distributed3rpc17ProcessGroupAgent25kInfiniteTimeoutTimePointE
Dec 11 01:39:38 Referenced from: /Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.dylib
Dec 11 01:39:38 Expected in: flat namespace
Dec 11 01:39:38 in /Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.dylib
Let me put it into process_group_agent.cpp
| namespace distributed { | ||
| namespace rpc { | ||
|
|
||
| using steady_clock_time_point = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be a class-scope using, rather than exposed throughout all distributed::rpc scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjlilley Let me put it back under ProcessGroupAgent class declaration.
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
jjlilley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending circleci, thanks!
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
…f duration" Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/) [ghstack-poisoned]
Pull Request resolved: #31078 Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. ghstack-source-id: 95408482 Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)
CircleCI build failures summaryAs of commit 20c3823:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 1 new failure recognized by patternsThe following build failures don't appear to be due to upstream breakage:
|
|
This pull request has been merged in a3ed350. |
…ytorch#31078) Summary: Pull Request resolved: pytorch#31078 Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional. - Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`. - Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`. - Remove the unnecessary `::getRPCRemainingTime()`. ghstack-source-id: 95408482 Test Plan: # Unit tests ``` buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_rpc_timeouts buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rpc_timeouts ``` ``` buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_rpc_timeouts buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_rpc_timeouts ``` Differential Revision: D5705442 fbshipit-source-id: ba54b7bdb84bc02d05c22360b01290d044bbfcf5
Stack from ghstack:
Make
ProcessGroupAgent::pollTimedOutRPCscode more conventional.std::chrono::time_pointto representendTimeinstead ofstd::chrono::duration.std::condition_variable::wait_for(lock, endTime)withstd::condition_variable::wait_until(lock, endTime).::getRPCRemainingTime().Differential Revision: D5705442