-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[c10d] socket: retry connection timeout failures #138003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138003
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit cc338bd with merge base 0b7ef19 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
fduwjj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds makes sense since we do have timeout check in the outside tryConnect.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
rsdcastro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout. The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc. Example failure: ``` socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out). socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known). ... repeats ipv4 connection failure ``` From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html ``` ETIMEDOUT Timeout while attempting connection. The server may be too busy to accept new connections. Note that for IP sockets the timeout may be very long when syncookies are enabled on the server. ``` Test plan: CI for backwards compatibility Pull Request resolved: #138003 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro
This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout.
The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc.
Example failure:
From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html
Test plan:
CI for backwards compatibility
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @c-p-i-o