Skip to content

[RPC] ProcessGroupAgent::enqueueSend and ProcessGroupAgent::enqueueRecv should handle exceptions appropriately #25516

@pritamdamania87

Description

@pritamdamania87

The above mentioned methods don't handle exceptions correctly in the lambda functions run in the threadpool. The exceptions are swallowed by the threadpool and the RPC ends up blocking forever since the future is never satisfied. For enqueueRecv we should have a try-catch block around it and return MessageType::Exception. For enqueuSend, we can pass the future into the lambda, have a try-catch block, catch an exception and mark the future completed with MessageType::Exception.

cc @ezyang @gchanan @zou3519 @jerryzh168 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528

Metadata

Metadata

Labels

better-engineeringRelatively self-contained tasks for better engineering contributorshigh prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions