-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[distributed] use timeout in connect function to prevent againstinfinite loop #26364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…inst infinite loop Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…inst infinite loop Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) ghstack-source-id: 90271154 Pull Request resolved: #26364
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to poll uses timeout.count() as its timeout. It should use the time remaining instead, because then it accounts for the time already spent. I'm thinking this should do the trick (I have the std::chrono docs in front of me now anyway):
const auto elapsed = std::chrono::high_resolution_clock::now() - start;
const auto remaining = std::chrono::duration_cast<std::chrono::milliseconds>(timeout) -
std::chrono::duration_cast<std::chrono::milliseconds>(elapsed);
const auto remainingMillis = std::max(0, left.count());| if (elapsedMilliseconds > timeout.count()) { | ||
| throw std::runtime_error("Connecting to TCP store timed out."); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to take time_since_epoch. You can use the std::chrono::time_point and std::chrono::duration directly without bothering with the unit (the duration class takes care of the comparison correctly if the unit is different):
const auto start = std::chrono::high_resolution_clock::now();
// ...
const auto elapsed = std::chrono::high_resolution_clock::now() - start;
if (elapsed > timeout) {
// ...
}
torch/lib/c10d/Utils.cpp
Outdated
| /* | ||
| if a timeout is specified, check time elapsed to see if we need to | ||
| timeout. A timeout is specified if timeout != kNoTimeout. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment style is // pretty much everywhere. There are some blocks with /* xyz */ but those are typically limited to large chunks of documentation outside of a function. Here you can stick to //.
torch/lib/c10d/Utils.cpp
Outdated
|
|
||
| auto elapsedMilliseconds = nowMilliseconds - startMilliseconds; | ||
| if (elapsedMilliseconds > timeout.count()) { | ||
| throw std::runtime_error("Connecting to TCP store timed out."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a different timeout error thrown above, where poll(2) returns with a timeout. Please consolidate so there is only one message for connect timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. The errors are slightly different though, one is because poll(2) timed out on a single attempt, while this error can happen because we keep getting ECONNREFUSED or ECONNRESET after trying many times. Is it worth having different error messages to help with debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to be the same and that multiple add to the confusion here.
…prevent against infinite loop" infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…prevent against infinite loop" infinite loop** infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…prevent against infinite loop" infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…prevent against infinite loop" infinite loop** infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…inst infinite loop Pull Request resolved: #26364 Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` ghstack-source-id: 90359554 Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
Sounds good, we should be careful to respect if |
…prevent against infinite loop" infinite loop** infinite loop** infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…inst infinite loop Pull Request resolved: #26364 Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` ghstack-source-id: 90371475 Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
torch/lib/c10d/Utils.cpp
Outdated
| const auto remaining = | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(timeout) - | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(elapsed); | ||
| pollTimeout = std::max(0, (int)remaining.count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have to cast this to int, or else the macOS compiler throws errors. ::poll also would truncate longs into int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should cast to int64_t (or use 0LL in std::max()), also use static_cast<int64_t> instead of C style casts.
torch/lib/c10d/Utils.cpp
Outdated
| const auto elapsed = | ||
| std::chrono::high_resolution_clock::now() - start; | ||
| if (elapsed > timeout) { | ||
| throw std::runtime_error("connect() timed out"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error msg now reads the same as the one we throw when poll times out. Should we make this a const string somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea. We like to use kSomeName for globals and this one can be kept local only in an anonymous namespace (equivalent of static global), since it's only used in this file.
|
@pytorchbot retest this please |
torch/lib/c10d/Utils.cpp
Outdated
| pfd.events = POLLOUT; | ||
|
|
||
| int numReady = ::poll(&pfd, 1, timeout.count()); | ||
| int pollTimeout = timeout.count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reads a bit weird that you first set pollTimeout and then check the other variable. Could you instead assign pollTimeout = -1 to be clear that this is the default? This makes it clear(er) IMO that it'll have a negative value if timeout == kNoTimeout. As is, we could redefine kNoTimeout to be 0 and not be aware of any harm.
pietern
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment regarding robustness. Feel free to merge once that's taken care of. Thanks, @rohan-varma!
torch/lib/c10d/Utils.cpp
Outdated
| pfd.events = POLLOUT; | ||
|
|
||
| int numReady = ::poll(&pfd, 1, timeout.count()); | ||
| int pollTimeout = timeout.count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Use int64_t since std::chrono::milliseconds can return a value larger than int.
| } | ||
| std::this_thread::sleep_for(std::chrono::seconds(1)); | ||
| anyRefused = false; | ||
| anyReset = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pietern Why do we continue retrying here if we have already exhausted all addresses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process that is supposed to start the TCP initialization daemon may not yet have started and then the host will reply with ECONNREFUSED. Somewhat similar, if the daemon has started, but 1000 processes connect at the same time (since they are typically all started at the same time), they may exhaust the listen backlog, and if full, the host replies with a RST packet (hence ECONNRESET). In both cases, we should retry until timeout.
torch/lib/c10d/Utils.cpp
Outdated
| const auto remaining = | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(timeout) - | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(elapsed); | ||
| pollTimeout = std::max(0, (int)remaining.count()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should cast to int64_t (or use 0LL in std::max()), also use static_cast<int64_t> instead of C style casts.
…prevent against infinite loop" infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…prevent against infinite loop" infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…prevent against infinite loop" infinite loop** infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…gainstinfinite loop" infinite loop** infinite loop** infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…inst infinite loop Pull Request resolved: #26364 Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` ghstack-source-id: 90465058 Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
| std::chrono::duration_cast<std::chrono::milliseconds>(timeout) - | ||
| std::chrono::duration_cast<std::chrono::milliseconds>(elapsed); | ||
| pollTimeout = std::max( | ||
| static_cast<int64_t>(0), static_cast<int64_t>(remaining.count())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ugly, but the other variants (std::max(0L, ...), std::max(0LL, ...) all caused compiler errors
…gainstinfinite loop" infinite loop** infinite loop** infinite loop** Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): connect() timed out Aborted (core dumped) ``` We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test. Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/) [ghstack-poisoned]
…inst infinite loop Pull Request resolved: #26364 Per #25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` ghstack-source-id: 90480086 Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
|
This pull request has been merged in efd933d. |
Summary: Pull Request resolved: pytorch#26364 Per pytorch#25769, we sometimes get an infinite loop when `TCPStore` calls `tcputil::connect`, and the server continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed in, we guard against this by throwing an exception once the timeout has passed. Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting `ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an exception. Steps below: 1) in TCPStore.cpp's constructor, replace the `connect` call with this line: `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));` 2) Build the `TCPStoreTest` binary. 3) Run the binary. Expected output: ``` terminate called after throwing an instance of 'std::runtime_error' what(): Connecting to TCP store timed out. Aborted (core dumped) ``` ghstack-source-id: 90480086 Test Plan: See above. Differential Revision: D17430164 fbshipit-source-id: 1482aca72fcc3ddb95ea25649ec057edda5d1934
Stack from ghstack:
infinite loop**
infinite loop**
infinite loop**
Per #25769, we sometimes get
an infinite loop when
TCPStorecallstcputil::connect, and the servercontinually returns
ECONNRESETorECONNREFUSED. If a proper timeout is passedin, we guard against this by throwing an exception once the timeout has passed.
Testing: Tested with modifying
TCPStoreto connect to an invalid port, thus gettingECONNREFUSED. If a valid timeout is passed in, the function correctly throws anexception. Steps below:
connectcall with this line:storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));TCPStoreTestbinary.We should probably add tests around this, but there are some issues. Namely, the
connectcall inTCPStore.cppdoes not pass in atimeoutparam at all right now (pytorch/torch/lib/c10d/TCPStore.cpp
Line 290 in 929764a
serverTCPStore.setTimeout()in the test.Differential Revision: D17430164