Skip to content

Conversation

@jjlilley
Copy link

@jjlilley jjlilley commented Nov 7, 2019

Stack from ghstack:

When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
TorchSaveSmallTensor Pre: 13.12us
TorchSaveSmallTensor Post: 12.48us

Differential Revision: D18365066

When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

[ghstack-poisoned]
@jjlilley jjlilley requested a review from apaszke as a code owner November 7, 2019 01:42
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Nov 7, 2019
jjlilley pushed a commit that referenced this pull request Nov 7, 2019
When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

ghstack-source-id: 93422568
Pull Request resolved: #29351
@jjlilley jjlilley requested review from driazati and zdevito November 7, 2019 03:42
Copy link
Contributor

@driazati driazati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look fine but I'm apprehensive about increasing the complexity of Device for something like this

return type_ == DeviceType::CPU;
}

std::string str() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a function (which is so similar to Device's operator<<) to something as central as Device for this case seems like overkill, I don't know if it's something we really need unless serialization time is a real bottleneck somewhere

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback.

Just checking: is the push-back related to an additional function in Device, or the duplicated code? The duplicated code aspect can be addressed by implementing operator<< in terms of str(), which I can do if that's the main objection.

fwiw, the benchmark I was looking at was torch::save() on a single 64-float tensor (as we might expect in a forward pass on a PS in DPER/etc), which is slower than it should be - 13usec, or about 200nsec amortized per float (most of it is fixed cost, the marginal cost/float beyond that is roughly 50nanos/float).

Given the performance goals, I was hoping to cull the easy wins like this from torch::save() - generally std::stringstream has a high constructor cost (thanks to dynamic_cast<>/etc internally), showing up in the pprof profiles. Given how widespread and convenient std::stringstream is, it doesn't make sure to eliminate altogether, but probably shouldn't be in any of the "hot" paths.

I am starting to wonder if it might make sense for the ThriftRpcAgent/et.al. to bypass torch::save()/load() for some specific cases (e.g. float contiguous cpu-device tensors), given the python-specific overhead, but mostly trying to figure out if there are good ways to squeeze costs in the current torch::save/load codepaths first. Anyway...

If you don't think this is architecturally appropriate, it's probably ok, though if there's an alternate way of expressing this logic that would be reasonable, I'm happy to change it! That said, we're running out of the super low hanging fruit for improving serialization speed, with keeping the existing format (though that said, some of the serialization benchmarks are 5x faster than a month ago, if you include FB-only crc diff).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to just have operator<< for device use str(). That way we do not have divergence. Just adding str seems ok to me. It is disappointing that the idiomatic c++ string formatting is so slow.

Copy link
Contributor

@zdevito zdevito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's define operator<< in terms of str() but after that this looks good.

@jjlilley
Copy link
Author

jjlilley commented Nov 7, 2019

Thanks for looking, uploading a version that uses device.str() for the relevant operator<<

Until this and running microbenchmarks, I wasn't aware of std::streamstring perf issues. It looks like a good chunk of the lag is related to c++ locale support (e.g. most of the basic_ostream constructor time is spent under basic_ios::_M_cache_locale), which is necessary to meet c++ spec, but something we can sidestep, because these ops don't directly need local, just want to concat some bits for another machine.

We do also have to effectively re-copy the bits out from the internal rdbuf, but for small strings, the constructor is the dominating cost.

When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

[ghstack-poisoned]
When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 7, 2019
Pull Request resolved: #29351

When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us
ghstack-source-id: 93490404

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)
When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 8, 2019
Pull Request resolved: #29351

When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us
ghstack-source-id: 93512172

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)
When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Nov 8, 2019
Pull Request resolved: #29351

When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.

This removes the last couple of cases. Benchmark shows ~5% improvement:
  TorchSaveSmallTensor Pre: 13.12us
  TorchSaveSmallTensor Post: 12.48us
ghstack-source-id: 93517928

Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in abf55eb.

@facebook-github-bot facebook-github-bot deleted the gh/jjlilley/22/head branch November 12, 2019 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants