[pytorch] process_group_agent optimizations #29324

jjlilley · 2019-11-06T20:52:43Z

Stack from ghstack:

[pytorch] process_group_agent optimizations #29324 [pytorch] process_group_agent optimizations

This change adds a few optimizations to process_group_agent:

Don't add an extra tensor during serialization for the message id,
but instead put that in the preamble tensor. This saves maybe 15%
overhead for minimal-sized RPCs.
Add a payload-only fastpath.
a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save().
b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load.

Differential Revision: D18352261

This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/) [ghstack-poisoned]

This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/) ghstack-source-id: 93389049 Pull Request resolved: #29324

mrshenli

Question: if we switch from torch::save/torch::load to jit pickle/unpickle, do we still need to explicitly mark the serialization mode?

This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/) [ghstack-poisoned]

Pull Request resolved: #29324 This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. ghstack-source-id: 93413704 Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)

jjlilley · 2019-11-07T00:23:36Z

Question: if we switch from torch::save/torch::load to jit pickle/unpickle, do we still need to explicitly
mark the serialization mode?

If we decide that there's a better pickling mechanism, we can always get rid of this.

My main goal here was that in case different sorts of encoding payloads have more optimal encoding schemes, that we have an out-of-band (and effectively free) mechanism to specify this.

Right now, if we have a minimally-sized response from an RPC (e.g. an ACK), without any tensors, we expend more effort/bytes encoding this than we need to.

Being able to alternate between torch::save / jit::pickle / payload-only / whatever, depending on the input, seems like it could help some cases.

This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/) [ghstack-poisoned]

Pull Request resolved: #29324 This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. ghstack-source-id: 93489064 Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)

This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/) [ghstack-poisoned]

Pull Request resolved: #29324 This change adds a few optimizations to process_group_agent: 1) Don't add an extra tensor during serialization for the message id, but instead put that in the preamble tensor. This saves maybe 15% overhead for minimal-sized RPCs. 2) Add a payload-only fastpath. a) For very tiny messages (e.g. acks, status updates), this reduces RPC overhead by roughly 50%, by removing zip file overhead and related setup/copying in torch::load()/torch::save(). b) If we do end up with large non-tensor payloads, this saves us ~25% in benchmarks, from avoiding copying in torch::save/load. ghstack-source-id: 93577550 Differential Revision: [D18352261](https://our.internmc.facebook.com/intern/diff/D18352261/)

jjlilley · 2019-11-09T03:40:47Z

Closing this PR for a bit, there's another approach that I'd like to take...

jjlilley requested review from mrshenli and pietern as code owners November 6, 2019 20:52

jjlilley mentioned this pull request Nov 6, 2019

[pytorch] Support process_group_agent "sending to itself" #29253

Closed

mrshenli reviewed Nov 6, 2019

View reviewed changes

jjlilley closed this Nov 9, 2019

facebook-github-bot deleted the gh/jjlilley/21/head branch December 9, 2019 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pytorch] process_group_agent optimizations #29324

[pytorch] process_group_agent optimizations #29324

Uh oh!

jjlilley commented Nov 6, 2019 •

edited

Loading

Uh oh!

mrshenli left a comment

Uh oh!

jjlilley commented Nov 7, 2019

Uh oh!

jjlilley commented Nov 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[pytorch] process_group_agent optimizations #29324

[pytorch] process_group_agent optimizations #29324

Uh oh!

Conversation

jjlilley commented Nov 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

jjlilley commented Nov 7, 2019

Uh oh!

jjlilley commented Nov 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jjlilley commented Nov 6, 2019 •

edited

Loading