Skip to content

Conversation

@d4l3k
Copy link
Member

@d4l3k d4l3k commented Jun 14, 2024

This adds better logging of errors to the socket and TCPStore classes.

All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged.

It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky.

Test plan:

python test/distributed/test_store.py

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

@d4l3k d4l3k requested review from c-p-i-o and kurman June 14, 2024 01:13
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128673

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c3fc85f with merge base 5a80d2d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 14, 2024
@d4l3k d4l3k force-pushed the d4l3k/socket_logging branch from 520eb58 to c3fc85f Compare June 14, 2024 18:46
@d4l3k
Copy link
Member Author

d4l3k commented Jun 14, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 14, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@d4l3k d4l3k deleted the d4l3k/socket_logging branch June 14, 2024 23:08
Comment on lines +237 to 241
C10_THROW_ERROR(
DistNetworkError,
fmt::format(
"failed to format addr, unknown family={}", addr.ai_family));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're seeing

failed to format addr, unknown family=0

in a job which would otherwise run fine if we disable logging by TORCH_CPP_LOG_LEVEL=WARNING.

This is probably a bug - we shall not assume we're formatting a VALID addrinfo. When connection is being established there might be invalid addrinfo to be logged.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have not identified exactly what causes the invalid addrinfo to appear, but I'm sending https://github.com/pytorch/pytorch/pull/137745/files and I believe it is a good fix regardless. In general, formatter should avoid crashing as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants