-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[pytorch] process_group_agent optimizations #29324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change adds a few optimizations to process_group_agent:
1) Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
2) Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.
Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
[ghstack-poisoned]
This change adds a few optimizations to process_group_agent:
1) Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
2) Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.
Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
ghstack-source-id: 93389049
Pull Request resolved: #29324
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: if we switch from torch::save/torch::load to jit pickle/unpickle, do we still need to explicitly mark the serialization mode?
This change adds a few optimizations to process_group_agent:
1) Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
2) Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.
Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
[ghstack-poisoned]
Pull Request resolved: #29324 This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. ghstack-source-id: 93413704 Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
If we decide that there's a better pickling mechanism, we can always get rid of this. My main goal here was that in case different sorts of encoding payloads have more optimal encoding schemes, that we have an out-of-band (and effectively free) mechanism to specify this. Right now, if we have a minimally-sized response from an RPC (e.g. an ACK), without any tensors, we expend more effort/bytes encoding this than we need to. Being able to alternate between torch::save / jit::pickle / payload-only / whatever, depending on the input, seems like it could help some cases. |
This change adds a few optimizations to process_group_agent:
1) Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
2) Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.
Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
[ghstack-poisoned]
Pull Request resolved: #29324 This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. ghstack-source-id: 93489064 Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
This change adds a few optimizations to process_group_agent:
1) Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
2) Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.
Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
[ghstack-poisoned]
This change adds a few optimizations to process_group_agent:
1) Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
2) Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.
Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
[ghstack-poisoned]
Pull Request resolved: #29324 This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. ghstack-source-id: 93577550 Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)
|
Closing this PR for a bit, there's another approach that I'd like to take... |
Stack from ghstack:
This change adds a few optimizations to process_group_agent:
Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.
Differential Revision: D18352261