Skip to content

Make Internal RRef messages idempotent #26116

@mrshenli

Description

@mrshenli

#25499 does not provide fault tolerance yet. Any failure in internal RRef messages would mess up RRef reference counting. We should make internal RRef messages idempotent and retry on failures.

cc @ezyang @gchanan @zou3519 @jerryzh168 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528

Metadata

Metadata

Labels

better-engineeringRelatively self-contained tasks for better engineering contributorshigh prioritymodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions