Skip to content

Spurious "socket cannot be initialized" error messages #128998

@mwlon

Description

@mwlon

🐛 Describe the bug

When using nodes connected by IPv6, training proceeds correctly, but emits these misleading messages at log level error:

2024-05-22 06:25:20.552227-04:00 Error From <host>: [W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:22010 (errno: 97 - Address family not supported by protocol).

There's no good way I know of to filter these out, since they come from C++, and I don't want to ignore error messages.

The logic causing this is in socket.cpp: we try all possible addresses returned by getaddrinfo, logging errors if any of them fail, but returning true if any of them succeed. I think the fix is simple: just move the error message into the failure case of tryListen(int family) instead.

Versions

2.3.1 (and earlier)

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: c10dIssues/PRs related to collective communications and process groupsoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions