Skip to content

Conversation

@adamlerer
Copy link
Contributor

Summary:
Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for connect()), and the listening socket only has a backlog of 64.

I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.

Differential Revision: D9182216

Summary:
Pull Request resolved: pytorch#10268

Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64.

I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.

Differential Revision: D9182216

fbshipit-source-id: d57c37c20fc1fdbcc24b5064a29174ca580c43d6
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Aug 10, 2018
Summary:
Pull Request resolved: pytorch#10268

Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64.

I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.

Reviewed By: soumith

Differential Revision: D9182216

fbshipit-source-id: 2f71c4995841db26c670cec344f1e3c7a80a7936
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
Pull Request resolved: pytorch#10268

Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64.

I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.

Reviewed By: soumith

Differential Revision: D9182216

fbshipit-source-id: 2f71c4995841db26c670cec344f1e3c7a80a7936
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants