-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[pytorch] Pickler: convert std::stringstream cases. #29351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]
When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) ghstack-source-id: 93422568 Pull Request resolved: #29351
driazati
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look fine but I'm apprehensive about increasing the complexity of Device for something like this
| return type_ == DeviceType::CPU; | ||
| } | ||
|
|
||
| std::string str() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a function (which is so similar to Device's operator<<) to something as central as Device for this case seems like overkill, I don't know if it's something we really need unless serialization time is a real bottleneck somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback.
Just checking: is the push-back related to an additional function in Device, or the duplicated code? The duplicated code aspect can be addressed by implementing operator<< in terms of str(), which I can do if that's the main objection.
fwiw, the benchmark I was looking at was torch::save() on a single 64-float tensor (as we might expect in a forward pass on a PS in DPER/etc), which is slower than it should be - 13usec, or about 200nsec amortized per float (most of it is fixed cost, the marginal cost/float beyond that is roughly 50nanos/float).
Given the performance goals, I was hoping to cull the easy wins like this from torch::save() - generally std::stringstream has a high constructor cost (thanks to dynamic_cast<>/etc internally), showing up in the pprof profiles. Given how widespread and convenient std::stringstream is, it doesn't make sure to eliminate altogether, but probably shouldn't be in any of the "hot" paths.
I am starting to wonder if it might make sense for the ThriftRpcAgent/et.al. to bypass torch::save()/load() for some specific cases (e.g. float contiguous cpu-device tensors), given the python-specific overhead, but mostly trying to figure out if there are good ways to squeeze costs in the current torch::save/load codepaths first. Anyway...
If you don't think this is architecturally appropriate, it's probably ok, though if there's an alternate way of expressing this logic that would be reasonable, I'm happy to change it! That said, we're running out of the super low hanging fruit for improving serialization speed, with keeping the existing format (though that said, some of the serialization benchmarks are 5x faster than a month ago, if you include FB-only crc diff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to just have operator<< for device use str(). That way we do not have divergence. Just adding str seems ok to me. It is disappointing that the idiomatic c++ string formatting is so slow.
zdevito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's define operator<< in terms of str() but after that this looks good.
|
Thanks for looking, uploading a version that uses device.str() for the relevant operator<< Until this and running microbenchmarks, I wasn't aware of std::streamstring perf issues. It looks like a good chunk of the lag is related to c++ locale support (e.g. most of the basic_ostream constructor time is spent under basic_ios::_M_cache_locale), which is necessary to meet c++ spec, but something we can sidestep, because these ops don't directly need local, just want to concat some bits for another machine. We do also have to effectively re-copy the bits out from the internal rdbuf, but for small strings, the constructor is the dominating cost. |
When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]
When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]
Pull Request resolved: #29351 When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us ghstack-source-id: 93490404 Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)
When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]
Pull Request resolved: #29351 When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us ghstack-source-id: 93512172 Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)
When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/) [ghstack-poisoned]
Pull Request resolved: #29351 When torch::save()ing a smallish tensor, we spend ~5% of the time still in std::stringstream constructors. This removes the last couple of cases. Benchmark shows ~5% improvement: TorchSaveSmallTensor Pre: 13.12us TorchSaveSmallTensor Post: 12.48us ghstack-source-id: 93517928 Differential Revision: [D18365066](https://our.internmc.facebook.com/intern/diff/D18365066/)
|
This pull request has been merged in abf55eb. |
Stack from ghstack:
When torch::save()ing a smallish tensor, we spend ~5% of the time
still in std::stringstream constructors.
This removes the last couple of cases. Benchmark shows ~5% improvement:
TorchSaveSmallTensor Pre: 13.12us
TorchSaveSmallTensor Post: 12.48us
Differential Revision: D18365066