-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[pytorch] Improve float pickling speed. #28553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/) [ghstack-poisoned]
This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/) ghstack-source-id: 92494909 Pull Request resolved: #28553
torch/csrc/jit/pickler.cpp
Outdated
|
|
||
| // Python pickle format is big endian, swap. | ||
| static_assert(sizeof(double) == 8, "double length"); // NOLINT | ||
| auto swapEndian = [](double value) -> double { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try with something like this? Compiler explorer says it gets compiled out to a single movbe instruction and I think it looks more sensible than the lambda. If it's not better, then could you add a comment to the lambda explaining why it's here.
double flipped;
char* bytes = reinterpret_cast<char*>(&in);
char* out_bytes = reinterpret_cast<char*>(&flipped);
for (size_t i = 0; i < 8; ++i) {
out_bytes[i] = bytes[8 - i - 1];
}
push<double>(flipped);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking!
-
Using 'out_bytes[i] = bytes[8 - i - 1];' ends up unrolling/inlining out to effectively the same code as std::reverse_copy() inlines out to, with the same performance. I'm happy to use whatever seems to be more clear, but chose std::reverse_copy because that's what the unpickler used.
-
The lambda was a curve-ball that I didn't expect, and to be careful of using with the compiler explorer (e.g. in your example, did you put int the downstream inlined read definition when looking at the generated code, or just the bit-flipping part? It apparently makes a big difference in how current GCC optimizes the basic blocks)
Not sure if there's an external version of "paste" to use, but here's a dump from 'objdump -d --source -C -r' from pickle.cpp.o in FB "Paste" of (1) lambda vs (2) non-lambda versions of the code: P119908107
Note how the lambda version is significantly better optimized - I have no idea why GCC does this in FB @mode/opt, but perhaps with the basic blocks separated it dives deeper there. And the benchmark results are both significant and consistent, e.g. when running my "Pickle1MFloats" benchmark in fbcode - it's roughly:
- 33msec before this change
- 25 msec with the non-lambda version
- 20msec with the lambda.
The lambda is simply equivalent to a static function, that gets inlined away by the compiler. I had an earlier version of this diff which had as a separate function, I wasn't sure which way would be perceived as more stylistically "clean" in this codebase. (If we're trying to use less "new" c++ here, I can revert back to a separate static function).
So, I can add some comments/change things, my main two questions:
- Do you prefer std::reverse_copy or bytes[8 - i - 1] (same performance either way)
- Do you prefer the lambda or separate 'static inline swapDouble()' (since separating out into a different basic block makes a perf difference)
- Regardless, adding a one-liner comment about compiler optimizing the separate function better seems to make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As defaults though, I'll upload a new revision with
- bytes[sizeof(double) - i - 1] , to keep #LOC diff lower
- separate function, rather than a lambda
The perf of this is the same, but might be clearer to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for digging into this. I think this is good to go then, I don't have a strong preference either way on anything, it all seems self contained enough.
Also fyi GitHub's gists are a decent alternative to paste
This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/) Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481) [ghstack-poisoned]
This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/) Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481) [ghstack-poisoned]
Pull Request resolved: #28553 This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. ghstack-source-id: 92585739 Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/)
|
This pull request has been merged in e212543. |
Summary: Pull Request resolved: pytorch#28553 This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. ghstack-source-id: 92585739 Test Plan: Benchmarks: buck build mode/opt experimental/jeremyl/c2:SerializationBench buck-out/opt/gen/experimental/jeremyl/c2/SerializationBench --bm_regex=.*Float.* Correctness: buck build mode/dev-nosan caffe2/test/... Differential Revision: D18089481 fbshipit-source-id: a5f39e5d38c432893844241a7cce244831037e1f
Stack from ghstack:
This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).
The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.
Unpickle is already doing something reasonable, so no need to change.
fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.
Differential Revision: D18089481
Differential Revision: D18089481