-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Describe the bug
When using nodes connected by IPv6, training proceeds correctly, but emits these misleading messages at log level error:
2024-05-22 06:25:20.552227-04:00 Error From <host>: [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:22010 (errno: 97 - Address family not supported by protocol).
There's no good way I know of to filter these out, since they come from C++, and I don't want to ignore error messages.
The logic causing this is in socket.cpp: we try all possible addresses returned by getaddrinfo, logging errors if any of them fail, but returning true if any of them succeed. I think the fix is simple: just move the error message into the failure case of tryListen(int family) instead.
Versions
2.3.1 (and earlier)
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k