[pytorch] Pickler: convert std::stringstream cases. #29351

jjlilley · 2019-11-07T01:42:22Z

Stack from ghstack:

[pytorch] Pickler: convert std::stringstream cases. #29351 [pytorch] Pickler: convert std::stringstream cases.

When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
TorchSaveSmallTensor Pre: 13.12us
TorchSaveSmallTensor Post: 12.48us

Differential Revision: D18365066

When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]

When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) ghstack-source-id: 93422568 Pull Request resolved: #29351

driazati

The changes look fine but I'm apprehensive about increasing the complexity of Device for something like this

driazati · 2019-11-07T17:52:50Z

c10/core/Device.h

    return type_ == DeviceType::CPU;
  }

+  std::string str() const;


Adding a function (which is so similar to Device's operator<<) to something as central as Device for this case seems like overkill, I don't know if it's something we really need unless serialization time is a real bottleneck somewhere

Thanks for the feedback.

Just checking: is the push-back related to an additional function in Device, or the duplicated code? The duplicated code aspect can be addressed by implementing operator<< in terms of str(), which I can do if that's the main objection.

fwiw, the benchmark I was looking at was torch::save() on a single 64-float tensor (as we might expect in a forward pass on a PS in DPER/etc), which is slower than it should be - 13usec, or about 200nsec amortized per float (most of it is fixed cost, the marginal cost/float beyond that is roughly 50nanos/float).

Given the performance goals, I was hoping to cull the easy wins like this from torch::save() - generally std::stringstream has a high constructor cost (thanks to dynamic_cast<>/etc internally), showing up in the pprof profiles. Given how widespread and convenient std::stringstream is, it doesn't make sure to eliminate altogether, but probably shouldn't be in any of the "hot" paths.

I am starting to wonder if it might make sense for the ThriftRpcAgent/et.al. to bypass torch::save()/load() for some specific cases (e.g. float contiguous cpu-device tensors), given the python-specific overhead, but mostly trying to figure out if there are good ways to squeeze costs in the current torch::save/load codepaths first. Anyway...

If you don't think this is architecturally appropriate, it's probably ok, though if there's an alternate way of expressing this logic that would be reasonable, I'm happy to change it! That said, we're running out of the super low hanging fruit for improving serialization speed, with keeping the existing format (though that said, some of the serialization benchmarks are 5x faster than a month ago, if you include FB-only crc diff).

I think it is fine to just have operator<< for device use str(). That way we do not have divergence. Just adding str seems ok to me. It is disappointing that the idiomatic c++ string formatting is so slow.

zdevito

Let's define operator<< in terms of str() but after that this looks good.

jjlilley · 2019-11-07T21:10:31Z

Thanks for looking, uploading a version that uses device.str() for the relevant operator<<

Until this and running microbenchmarks, I wasn't aware of std::streamstring perf issues. It looks like a good chunk of the lag is related to c++ locale support (e.g. most of the basic_ostream constructor time is spent under basic_ios::_M_cache_locale), which is necessary to meet c++ spec, but something we can sidestep, because these ops don't directly need local, just want to concat some bits for another machine.

We do also have to effectively re-copy the bits out from the internal rdbuf, but for small strings, the constructor is the dominating cost.

When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]

Pull Request resolved: #29351 When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us ghstack-source-id: 93490404 Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]

Pull Request resolved: #29351 When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us ghstack-source-id: 93512172 Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]

Pull Request resolved: #29351 When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us ghstack-source-id: 93517928 Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

facebook-github-bot · 2019-11-08T23:14:05Z

This pull request has been merged in abf55eb.

jjlilley requested a review from apaszke as a code owner November 7, 2019 01:42

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Nov 7, 2019

jjlilley requested review from driazati and zdevito November 7, 2019 03:42

driazati reviewed Nov 7, 2019

View reviewed changes

zdevito approved these changes Nov 7, 2019

View reviewed changes

jjlilley mentioned this pull request Nov 8, 2019

[pytorch] In torch::save(), make padding computation faster. #29425

Closed

facebook-github-bot closed this in abf55eb Nov 8, 2019

facebook-github-bot added the merged label Nov 8, 2019

facebook-github-bot deleted the gh/jjlilley/22/head branch November 12, 2019 15:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pytorch] Pickler: convert std::stringstream cases. #29351

[pytorch] Pickler: convert std::stringstream cases. #29351

Uh oh!

jjlilley commented Nov 7, 2019 •

edited

Loading

Uh oh!

driazati left a comment

Uh oh!

driazati Nov 7, 2019

Uh oh!

jjlilley Nov 7, 2019

Uh oh!

zdevito Nov 7, 2019

Uh oh!

zdevito left a comment

Uh oh!

jjlilley commented Nov 7, 2019

Uh oh!

facebook-github-bot commented Nov 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[pytorch] Pickler: convert std::stringstream cases. #29351

[pytorch] Pickler: convert std::stringstream cases. #29351

Uh oh!

Conversation

jjlilley commented Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

driazati left a comment

Choose a reason for hiding this comment

Uh oh!

driazati Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

jjlilley Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

zdevito Nov 7, 2019

Choose a reason for hiding this comment

Uh oh!

zdevito left a comment

Choose a reason for hiding this comment

Uh oh!

jjlilley commented Nov 7, 2019

Uh oh!

facebook-github-bot commented Nov 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jjlilley commented Nov 7, 2019 •

edited

Loading