Skip to content

Conversation

@xush6528
Copy link
Contributor

@xush6528 xush6528 commented Dec 10, 2019

Stack from ghstack:

Make ProcessGroupAgent::pollTimedOutRPCs code more conventional.

  • Use std::chrono::time_point to represent endTime instead of std::chrono::duration.
  • Replace std::condition_variable::wait_for(lock, endTime) with std::condition_variable::wait_until(lock, endTime).
  • Remove the unnecessary ::getRPCRemainingTime().

Differential Revision: D5705442

Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, entTime)` with `std::condition_variable::wait_until(lock, entTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
using steady_clock_time_point =
std::chrono::time_point<std::chrono::steady_clock>;

static constexpr steady_clock_time_point kInfiniteTimeoutTimePoint =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this go in rpc_agent.h so that other RpcAgents can have access to them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. This watch dog thread and using a const time to represent infinity are all specific to ProcessGroupAgent only.

Notice this is not the "Default RPC timeout".

futureTimeoutCV_.wait(lock);
} else {
futureTimeoutCV_.wait_for(lock, sleepTime);
futureTimeoutCV_.wait_until(lock, minEndTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of the sleepTime variable now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Let me update.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please get a stamp from @rohan-varma as well.

There is typo in the PR description entTime -> endTime. Please make sure the final commit message has the correct spelling. Thanks!

…f duration"

Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, entTime)` with `std::condition_variable::wait_until(lock, entTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
@xush6528
Copy link
Contributor Author

@mrshenli Thanks! updated PR description.

…f duration"


Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
…f duration"


Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Dec 10, 2019
Pull Request resolved: #31078

Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.
ghstack-source-id: 95316304

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)
@rohan-varma rohan-varma self-requested a review December 10, 2019 23:15
Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, please wait for tests to pass.

…f duration"


Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
using steady_clock_time_point =
std::chrono::time_point<std::chrono::steady_clock>;

constexpr steady_clock_time_point kInfiniteTimeoutTimePoint =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this in the .h?

Copy link
Contributor Author

@xush6528 xush6528 Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjlilley Wanted to put it in the class declaration.

Then I got this error in macos test
https://circleci.com/gh/pytorch/pytorch/3890383?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Dec 11 01:39:38 + python -c 'import torch'
Dec 11 01:39:38 Traceback (most recent call last):
Dec 11 01:39:38   File "<string>", line 1, in <module>
Dec 11 01:39:38   File "/Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/__init__.py", line 81, in <module>
Dec 11 01:39:38     from torch._C import *
Dec 11 01:39:38 ImportError: dlopen(/Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/_C.cpython-36m-darwin.so, 9): Symbol not found: __ZN5torch11distributed3rpc17ProcessGroupAgent25kInfiniteTimeoutTimePointE
Dec 11 01:39:38   Referenced from: /Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.dylib
Dec 11 01:39:38   Expected in: flat namespace
Dec 11 01:39:38  in /Users/distiller/project/miniconda/envs/wheel_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.dylib

Let me put it into process_group_agent.cpp

namespace distributed {
namespace rpc {

using steady_clock_time_point =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be a class-scope using, rather than exposed throughout all distributed::rpc scope?

Copy link
Contributor Author

@xush6528 xush6528 Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjlilley Let me put it back under ProcessGroupAgent class declaration.

…f duration"


Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
Copy link

@jjlilley jjlilley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending circleci, thanks!

…f duration"


Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
…f duration"


Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
…f duration"


Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Dec 12, 2019
Pull Request resolved: #31078

Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.
ghstack-source-id: 95408482

Differential Revision: [D5705442](https://our.internmc.facebook.com/intern/diff/D5705442/)
@kostmo
Copy link
Member

kostmo commented Dec 12, 2019

CircleCI build failures summary

As of commit 20c3823:

  • 1/2 broken upstream at merge base e5a550c (see grid view)
    • You may want to rebase on the viable/strict branch (see its recency history):
      • If your commit is newer than viable/strict, you can try basing on an older, stable commit:
        git fetch viable/strict
        git rebase --onto viable/strict $(git merge-base origin/master HEAD)
        
      • If your commit is older than viable/strict:
        git fetch viable/strict
        git rebase viable/strict
        
  • 1/2 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

1 new failure recognized by patterns

The following build failures don't appear to be due to upstream breakage:

See CircleCI build pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-build (1/1)

Step: "pytorch android gradle build" (details)

sys	0m0.159s
++ docker run --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -t -d -w /var/lib/jenkins 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405-20c3823db973579c386ceb6bcab53143cf9ccc00-android-x86_32
+ export id_x86_32=40c9aeb6161cf211c89cfc8d390a3bd6c7f3b436127ab1308d9bf866990f493a
+ id_x86_32=40c9aeb6161cf211c89cfc8d390a3bd6c7f3b436127ab1308d9bf866990f493a
+ export 'COMMAND=((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
+ COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace") | docker exec -u jenkins -i "$id_x86_32" bash) 2>&1'
+ echo '((echo' '"source' './workspace/env"' '&&' echo '"sudo' chown -R jenkins 'workspace")' '|' docker exec -u jenkins -i '"$id_x86_32"' 'bash)' '2>&1'
+ unbuffer bash ./command.sh
+ ts
+ docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405-20c3823db973579c386ceb6bcab53143cf9ccc00-android-arm-v7a
Error response from daemon: manifest for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c:405-20c3823db973579c386ceb6bcab53143cf9ccc00-android-arm-v7a not found

1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 1 time.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in a3ed350.

@facebook-github-bot facebook-github-bot deleted the gh/xush6528/48/head branch December 15, 2019 15:16
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
…ytorch#31078)

Summary:
Pull Request resolved: pytorch#31078

Make `ProcessGroupAgent::pollTimedOutRPCs` code more conventional.

- Use `std::chrono::time_point` to represent `endTime` instead of `std::chrono::duration`.
- Replace `std::condition_variable::wait_for(lock, endTime)` with `std::condition_variable::wait_until(lock, endTime)`.
- Remove the unnecessary `::getRPCRemainingTime()`.
ghstack-source-id: 95408482

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rpc_timeouts
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_rpc_timeouts
```

Differential Revision: D5705442

fbshipit-source-id: ba54b7bdb84bc02d05c22360b01290d044bbfcf5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants