Skip to content

Conversation

@jjlilley
Copy link

@jjlilley jjlilley commented Oct 23, 2019

Stack from ghstack:

This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.

Differential Revision: D18089481

Differential Revision: D18089481

This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.

Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/)

[ghstack-poisoned]
@jjlilley jjlilley requested a review from apaszke as a code owner October 23, 2019 21:58
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Oct 23, 2019
jjlilley pushed a commit that referenced this pull request Oct 23, 2019
This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.

Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/)

ghstack-source-id: 92494909
Pull Request resolved: #28553
@jjlilley jjlilley requested review from driazati and zdevito October 24, 2019 00:43

// Python pickle format is big endian, swap.
static_assert(sizeof(double) == 8, "double length"); // NOLINT
auto swapEndian = [](double value) -> double {
Copy link
Contributor

@driazati driazati Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try with something like this? Compiler explorer says it gets compiled out to a single movbe instruction and I think it looks more sensible than the lambda. If it's not better, then could you add a comment to the lambda explaining why it's here.

double flipped;
char* bytes = reinterpret_cast<char*>(&in);
char* out_bytes = reinterpret_cast<char*>(&flipped);
for (size_t i = 0; i < 8; ++i) {
    out_bytes[i] = bytes[8 - i - 1];
}
push<double>(flipped);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking!

  1. Using 'out_bytes[i] = bytes[8 - i - 1];' ends up unrolling/inlining out to effectively the same code as std::reverse_copy() inlines out to, with the same performance. I'm happy to use whatever seems to be more clear, but chose std::reverse_copy because that's what the unpickler used.

  2. The lambda was a curve-ball that I didn't expect, and to be careful of using with the compiler explorer (e.g. in your example, did you put int the downstream inlined read definition when looking at the generated code, or just the bit-flipping part? It apparently makes a big difference in how current GCC optimizes the basic blocks)

Not sure if there's an external version of "paste" to use, but here's a dump from 'objdump -d --source -C -r' from pickle.cpp.o in FB "Paste" of (1) lambda vs (2) non-lambda versions of the code: P119908107

Note how the lambda version is significantly better optimized - I have no idea why GCC does this in FB @mode/opt, but perhaps with the basic blocks separated it dives deeper there. And the benchmark results are both significant and consistent, e.g. when running my "Pickle1MFloats" benchmark in fbcode - it's roughly:

  • 33msec before this change
  • 25 msec with the non-lambda version
  • 20msec with the lambda.

The lambda is simply equivalent to a static function, that gets inlined away by the compiler. I had an earlier version of this diff which had as a separate function, I wasn't sure which way would be perceived as more stylistically "clean" in this codebase. (If we're trying to use less "new" c++ here, I can revert back to a separate static function).

So, I can add some comments/change things, my main two questions:

  1. Do you prefer std::reverse_copy or bytes[8 - i - 1] (same performance either way)
  2. Do you prefer the lambda or separate 'static inline swapDouble()' (since separating out into a different basic block makes a perf difference)
  3. Regardless, adding a one-liner comment about compiler optimizing the separate function better seems to make sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As defaults though, I'll upload a new revision with

  1. bytes[sizeof(double) - i - 1] , to keep #LOC diff lower
  2. separate function, rather than a lambda

The perf of this is the same, but might be clearer to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for digging into this. I think this is good to go then, I don't have a strong preference either way on anything, it all seems self contained enough.

Also fyi GitHub's gists are a decent alternative to paste

This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.

Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/)

Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481)

[ghstack-poisoned]
This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.

Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/)

Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481)

[ghstack-poisoned]
jjlilley pushed a commit that referenced this pull request Oct 24, 2019
Pull Request resolved: #28553

This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.
ghstack-source-id: 92585739

Differential Revision: [D18089481](https://our.internmc.facebook.com/intern/diff/D18089481/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in e212543.

@facebook-github-bot facebook-github-bot deleted the gh/jjlilley/14/head branch October 28, 2019 22:16
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Pull Request resolved: pytorch#28553

This change improves double pickling in 1M double list
microbenchmark by roughly 40% (33msec -> 20msec).

The main benefit is avoiding per-byte bounds checks, so
we only bounds-check 2 times rather than 9 times.

Unpickle is already doing something reasonable, so no need to change.

fwiw, putting the swapping logic in a separate func/lambda provided
roughly 20% better results, consistently when microbenchmarking.
Looking at the objdump disassembly, gcc somehow generates better code
when it's separated.
ghstack-source-id: 92585739

Test Plan:
Benchmarks: buck build mode/opt experimental/jeremyl/c2:SerializationBench
               buck-out/opt/gen/experimental/jeremyl/c2/SerializationBench --bm_regex=.*Float.*
   Correctness: buck build mode/dev-nosan caffe2/test/...

Differential Revision: D18089481

fbshipit-source-id: a5f39e5d38c432893844241a7cce244831037e1f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants