Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Oct 15, 2024

This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout.

The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc.

Example failure:

 socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out).
 socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known).
... repeats ipv4 connection failure

From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html

ETIMEDOUT
              Timeout while attempting connection.  The server may be
              too busy to accept new connections.  Note that for IP
              sockets the timeout may be very long when syncookies are
              enabled on the server.

Test plan:

CI for backwards compatibility

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o

@d4l3k d4l3k requested review from c-p-i-o and rsdcastro October 15, 2024 18:12
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138003

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit cc338bd with merge base 0b7ef19 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 15, 2024
@d4l3k d4l3k requested a review from fduwjj October 15, 2024 18:12
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds makes sense since we do have timeout check in the outside tryConnect.

@fduwjj fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 15, 2024
@d4l3k
Copy link
Member Author

d4l3k commented Oct 15, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Copy link

@rsdcastro rsdcastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@d4l3k d4l3k deleted the d4l3k/conn_timeout branch October 15, 2024 22:50
jackzhxng pushed a commit that referenced this pull request Oct 16, 2024
This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout.

The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc.

Example failure:
```
 socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out).
 socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known).
... repeats ipv4 connection failure
```

From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html
```
ETIMEDOUT
              Timeout while attempting connection.  The server may be
              too busy to accept new connections.  Note that for IP
              sockets the timeout may be very long when syncookies are
              enabled on the server.
```

Test plan:

CI for backwards compatibility

Pull Request resolved: #138003
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants