[pytorch] Improve float pickling speed. #28553

jjlilley · 2019-10-23T21:58:05Z

Stack from ghstack:

[pytorch] Improve float pickling speed. #28553 [pytorch] Improve float pickling speed.

This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.

Differential Revision: D18089481

This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/) [ghstack-poisoned]

This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/) ghstack-source-id: 92494909 Pull Request resolved: #28553

driazati · 2019-10-24T17:48:09Z

torch/csrc/jit/pickler.cpp

+
+  // Python pickle format is big endian, swap.
+  static_assert(sizeof(double) == 8, "double length");  // NOLINT
+  auto swapEndian = [](double value) -> double {


Could you try with something like this? Compiler explorer says it gets compiled out to a single movbe instruction and I think it looks more sensible than the lambda. If it's not better, then could you add a comment to the lambda explaining why it's here.

double flipped; char* bytes = reinterpret_cast<char*>(&in); char* out_bytes = reinterpret_cast<char*>(&flipped); for (size_t i = 0; i < 8; ++i) { out_bytes[i] = bytes[8 - i - 1]; } push<double>(flipped);

Thanks for looking!

Using 'out_bytes[i] = bytes[8 - i - 1];' ends up unrolling/inlining out to effectively the same code as std::reverse_copy() inlines out to, with the same performance. I'm happy to use whatever seems to be more clear, but chose std::reverse_copy because that's what the unpickler used.

The lambda was a curve-ball that I didn't expect, and to be careful of using with the compiler explorer (e.g. in your example, did you put int the downstream inlined read definition when looking at the generated code, or just the bit-flipping part? It apparently makes a big difference in how current GCC optimizes the basic blocks)

Not sure if there's an external version of "paste" to use, but here's a dump from 'objdump -d --source -C -r' from pickle.cpp.o in FB "Paste" of (1) lambda vs (2) non-lambda versions of the code: P119908107

Note how the lambda version is significantly better optimized - I have no idea why GCC does this in FB @mode/opt, but perhaps with the basic blocks separated it dives deeper there. And the benchmark results are both significant and consistent, e.g. when running my "Pickle1MFloats" benchmark in fbcode - it's roughly:

33msec before this change

25 msec with the non-lambda version

20msec with the lambda.

The lambda is simply equivalent to a static function, that gets inlined away by the compiler. I had an earlier version of this diff which had as a separate function, I wasn't sure which way would be perceived as more stylistically "clean" in this codebase. (If we're trying to use less "new" c++ here, I can revert back to a separate static function).

So, I can add some comments/change things, my main two questions:

Do you prefer std::reverse_copy or bytes[8 - i - 1] (same performance either way)

Do you prefer the lambda or separate 'static inline swapDouble()' (since separating out into a different basic block makes a perf difference)

Regardless, adding a one-liner comment about compiler optimizing the separate function better seems to make sense.

As defaults though, I'll upload a new revision with

bytes[sizeof(double) - i - 1] , to keep #LOC diff lower

separate function, rather than a lambda

The perf of this is the same, but might be clearer to read.

Nice, thanks for digging into this. I think this is good to go then, I don't have a strong preference either way on anything, it all seems self contained enough.

Also fyi GitHub's gists are a decent alternative to paste

This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/) Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481) [ghstack-poisoned]

Pull Request resolved: #28553 This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. ghstack-source-id: 92585739 Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/)

facebook-github-bot · 2019-10-25T16:13:08Z

This pull request has been merged in e212543.

Summary: Pull Request resolved: pytorch#28553 This change improves double pickling in 1M double list microbenchmark by roughly 40% (33msec -> 20msec). The main benefit is avoiding per-byte bounds checks, so we only bounds-check 2 times rather than 9 times. Unpickle is already doing something reasonable, so no need to change. fwiw, putting the swapping logic in a separate func/lambda provided roughly 20% better results, consistently when microbenchmarking. Looking at the objdump disassembly, gcc somehow generates better code when it's separated. ghstack-source-id: 92585739 Test Plan: Benchmarks: buck build mode/opt experimental/jeremyl/c2:SerializationBench buck-out/opt/gen/experimental/jeremyl/c2/SerializationBench --bm_regex=.*Float.* Correctness: buck build mode/dev-nosan caffe2/test/... Differential Revision: D18089481 fbshipit-source-id: a5f39e5d38c432893844241a7cce244831037e1f

jjlilley requested a review from apaszke as a code owner October 23, 2019 21:58

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Oct 23, 2019

jjlilley requested review from driazati and zdevito October 24, 2019 00:43

driazati approved these changes Oct 24, 2019

View reviewed changes

facebook-github-bot closed this in e212543 Oct 25, 2019

facebook-github-bot added the merged label Oct 25, 2019

facebook-github-bot deleted the gh/jjlilley/14/head branch October 28, 2019 22:16

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pytorch] Improve float pickling speed. #28553

[pytorch] Improve float pickling speed. #28553

Uh oh!

jjlilley commented Oct 23, 2019 •

edited

Loading

Uh oh!

driazati Oct 24, 2019 •

edited

Loading

Uh oh!

jjlilley Oct 24, 2019

Uh oh!

jjlilley Oct 24, 2019

Uh oh!

driazati Oct 24, 2019

Uh oh!

facebook-github-bot commented Oct 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[pytorch] Improve float pickling speed. #28553

[pytorch] Improve float pickling speed. #28553

Uh oh!

Conversation

jjlilley commented Oct 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

driazati Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjlilley Oct 24, 2019

Choose a reason for hiding this comment

Uh oh!

jjlilley Oct 24, 2019

Choose a reason for hiding this comment

Uh oh!

driazati Oct 24, 2019

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 25, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jjlilley commented Oct 23, 2019 •

edited

Loading

driazati Oct 24, 2019 •

edited

Loading