Increase TCP listen queue size from 64 to 1024 #10268

adamlerer · 2018-08-06T18:33:08Z

Summary:
Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for connect()), and the listening socket only has a backlog of 64.

I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.

Differential Revision: D9182216

Summary: Pull Request resolved: pytorch#10268 Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64. I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused. Differential Revision: D9182216 fbshipit-source-id: d57c37c20fc1fdbcc24b5064a29174ca580c43d6

Summary: Pull Request resolved: pytorch#10268 Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64. I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused. Reviewed By: soumith Differential Revision: D9182216 fbshipit-source-id: 2f71c4995841db26c670cec344f1e3c7a80a7936

adamlerer requested review from apaszke, colesbury, ezyang, gchanan, soumith and zdevito as code owners August 6, 2018 18:33

adamlerer force-pushed the export-D9182216 branch from 96eb5f1 to f4d3240 Compare August 6, 2018 18:39

adamlerer force-pushed the export-D9182216 branch from f4d3240 to 6dbd552 Compare August 6, 2018 19:22

soumith approved these changes Aug 6, 2018

View reviewed changes

facebook-github-bot closed this in 18e2983 Aug 7, 2018

ezyang added the merged label Jun 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase TCP listen queue size from 64 to 1024 #10268

Increase TCP listen queue size from 64 to 1024 #10268

Uh oh!

adamlerer commented Aug 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Increase TCP listen queue size from 64 to 1024 #10268

Increase TCP listen queue size from 64 to 1024 #10268

Uh oh!

Conversation

adamlerer commented Aug 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants