Skip to content

Conversation

@rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Sep 17, 2019

Stack from ghstack:

Per #25769, we sometimes get
an infinite loop when TCPStore calls tcputil::connect, and the server
continually returns ECONNRESET or ECONNREFUSED. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying TCPStore to connect to an invalid port, thus getting
ECONNREFUSED. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:

  1. in TCPStore.cpp's constructor, replace the connect call with this line:
    storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));
  2. Build the TCPStoreTest binary.
  3. Run the binary. Expected output:
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)

We should probably add tests around this, but there are some issues. Namely, the connect call in TCPStore.cpp does not pass in a timeout param at all right now (

storeSocket_ = tcputil::connect(tcpStoreAddr_, tcpStorePort_);
). We would need to change it to pass this param in, and then call serverTCPStore.setTimeout() in the test.

Differential Revision: D17430164

…inst

infinite loop

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 17, 2019
rohan-varma added a commit that referenced this pull request Sep 17, 2019
…inst

infinite loop

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

ghstack-source-id: 90271154
Pull Request resolved: #26364
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to poll uses timeout.count() as its timeout. It should use the time remaining instead, because then it accounts for the time already spent. I'm thinking this should do the trick (I have the std::chrono docs in front of me now anyway):

const auto elapsed = std::chrono::high_resolution_clock::now() - start;
const auto remaining = std::chrono::duration_cast<std::chrono::milliseconds>(timeout) -
  std::chrono::duration_cast<std::chrono::milliseconds>(elapsed);
const auto remainingMillis = std::max(0, left.count());

if (elapsedMilliseconds > timeout.count()) {
throw std::runtime_error("Connecting to TCP store timed out.");
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to take time_since_epoch. You can use the std::chrono::time_point and std::chrono::duration directly without bothering with the unit (the duration class takes care of the comparison correctly if the unit is different):

const auto start = std::chrono::high_resolution_clock::now();

// ...

const auto elapsed = std::chrono::high_resolution_clock::now() - start;
if (elapsed > timeout) {
  // ...
}

/*
if a timeout is specified, check time elapsed to see if we need to
timeout. A timeout is specified if timeout != kNoTimeout.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment style is // pretty much everywhere. There are some blocks with /* xyz */ but those are typically limited to large chunks of documentation outside of a function. Here you can stick to //.


auto elapsedMilliseconds = nowMilliseconds - startMilliseconds;
if (elapsedMilliseconds > timeout.count()) {
throw std::runtime_error("Connecting to TCP store timed out.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a different timeout error thrown above, where poll(2) returns with a timeout. Please consolidate so there is only one message for connect timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. The errors are slightly different though, one is because poll(2) timed out on a single attempt, while this error can happen because we keep getting ECONNREFUSED or ECONNRESET after trying many times. Is it worth having different error messages to help with debugging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to be the same and that multiple add to the confusion here.

…prevent against

infinite loop"

infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
…prevent against

infinite loop"

infinite loop**
infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
…prevent against

infinite loop"

infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
…prevent against

infinite loop"

infinite loop**
infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2019
…inst

infinite loop

Pull Request resolved: #26364

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```
ghstack-source-id: 90359554

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
@rohan-varma
Copy link
Contributor Author

The call to poll uses timeout.count() as its timeout. It should use the time remaining instead, because then it accounts for the time already spent.

Sounds good, we should be careful to respect if timeout = kNoTimeout though. I'll condition the code on this, because if we don't, then if timeout = kNoTimeout (which is aliased to -1) we will actually pass in a timeout to 0 to poll, instead of -1, changing its semantics.

…prevent against

infinite loop"

infinite loop**
infinite loop**
infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 18, 2019
…inst

infinite loop

Pull Request resolved: #26364

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```
ghstack-source-id: 90371475

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
const auto remaining =
std::chrono::duration_cast<std::chrono::milliseconds>(timeout) -
std::chrono::duration_cast<std::chrono::milliseconds>(elapsed);
pollTimeout = std::max(0, (int)remaining.count());
Copy link
Contributor Author

@rohan-varma rohan-varma Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have to cast this to int, or else the macOS compiler throws errors. ::poll also would truncate longs into int.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should cast to int64_t (or use 0LL in std::max()), also use static_cast<int64_t> instead of C style casts.

const auto elapsed =
std::chrono::high_resolution_clock::now() - start;
if (elapsed > timeout) {
throw std::runtime_error("connect() timed out");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error msg now reads the same as the one we throw when poll times out. Should we make this a const string somewhere?

Copy link
Contributor

@pietern pietern Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. We like to use kSomeName for globals and this one can be kept local only in an anonymous namespace (equivalent of static global), since it's only used in this file.

@rohan-varma rohan-varma requested a review from pietern September 18, 2019 22:35
@rohan-varma
Copy link
Contributor Author

@pytorchbot retest this please

pfd.events = POLLOUT;

int numReady = ::poll(&pfd, 1, timeout.count());
int pollTimeout = timeout.count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reads a bit weird that you first set pollTimeout and then check the other variable. Could you instead assign pollTimeout = -1 to be clear that this is the default? This makes it clear(er) IMO that it'll have a negative value if timeout == kNoTimeout. As is, we could redefine kNoTimeout to be 0 and not be aware of any harm.

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment regarding robustness. Feel free to merge once that's taken care of. Thanks, @rohan-varma!

pfd.events = POLLOUT;

int numReady = ::poll(&pfd, 1, timeout.count());
int pollTimeout = timeout.count();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Use int64_t since std::chrono::milliseconds can return a value larger than int.

}
std::this_thread::sleep_for(std::chrono::seconds(1));
anyRefused = false;
anyReset = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pietern Why do we continue retrying here if we have already exhausted all addresses?

Copy link
Contributor

@pietern pietern Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process that is supposed to start the TCP initialization daemon may not yet have started and then the host will reply with ECONNREFUSED. Somewhat similar, if the daemon has started, but 1000 processes connect at the same time (since they are typically all started at the same time), they may exhaust the listen backlog, and if full, the host replies with a RST packet (hence ECONNRESET). In both cases, we should retry until timeout.

const auto remaining =
std::chrono::duration_cast<std::chrono::milliseconds>(timeout) -
std::chrono::duration_cast<std::chrono::milliseconds>(elapsed);
pollTimeout = std::max(0, (int)remaining.count());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should cast to int64_t (or use 0LL in std::max()), also use static_cast<int64_t> instead of C style casts.

…prevent against

infinite loop"

infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
…prevent against

infinite loop"

infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
…prevent against

infinite loop"

infinite loop**
infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
@rohan-varma rohan-varma changed the title [pytorch][distributed] use timeout in connect function to prevent against infinite loop [distributed] use timeout in connect function to prevent againstinfinite loop Sep 19, 2019
…gainstinfinite loop"

infinite loop**
infinite loop**
infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 19, 2019
…inst

infinite loop

Pull Request resolved: #26364

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```
ghstack-source-id: 90465058

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
std::chrono::duration_cast<std::chrono::milliseconds>(timeout) -
std::chrono::duration_cast<std::chrono::milliseconds>(elapsed);
pollTimeout = std::max(
static_cast<int64_t>(0), static_cast<int64_t>(remaining.count()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly, but the other variants (std::max(0L, ...), std::max(0LL, ...) all caused compiler errors

…gainstinfinite loop"

infinite loop**
infinite loop**
infinite loop**

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  connect() timed out
Aborted (core dumped)
```

We should probably add tests around this, but there are some issues. Namely, the `connect` call in `TCPStore.cpp` does not pass in a `timeout` param at all right now (https://github.com/pytorch/pytorch/blob/929764ac2ad0d7c26e3defa2d4c4d39f4933a7ef/torch/lib/c10d/TCPStore.cpp#L290). We would need to change it to pass this param in, and then call `serverTCPStore.setTimeout()` in the test.

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Sep 20, 2019
…inst

infinite loop

Pull Request resolved: #26364

Per #25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```
ghstack-source-id: 90480086

Differential Revision: [D17430164](https://our.internmc.facebook.com/intern/diff/D17430164/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in efd933d.

mingbowan pushed a commit to mingbowan/pytorch that referenced this pull request Sep 23, 2019
Summary:
Pull Request resolved: pytorch#26364

Per pytorch#25769, we sometimes get
an infinite loop when `TCPStore` calls `tcputil::connect`, and the server
continually returns `ECONNRESET` or `ECONNREFUSED`. If a proper timeout is passed
in, we guard against this by throwing an exception once the timeout has passed.

Testing: Tested with modifying `TCPStore` to connect to an invalid port, thus getting
`ECONNREFUSED`. If a valid timeout is passed in, the function correctly throws an
exception. Steps below:
1) in TCPStore.cpp's constructor, replace the `connect` call with this line:
 `storeSocket_ = tcputil::connect(tcpStoreAddr_, 1, true, std::chrono::milliseconds(3000));`
2) Build the `TCPStoreTest` binary.
3) Run the binary. Expected output:

```
terminate called after throwing an instance of 'std::runtime_error'
  what():  Connecting to TCP store timed out.
Aborted (core dumped)
```
ghstack-source-id: 90480086

Test Plan: See above.

Differential Revision: D17430164

fbshipit-source-id: 1482aca72fcc3ddb95ea25649ec057edda5d1934
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/6/head branch October 28, 2019 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants