ARROW-2649: [C++] Add GenerateBits() function to improve bitmap writing performance#2093
ARROW-2649: [C++] Add GenerateBits() function to improve bitmap writing performance#2093pitrou wants to merge 1 commit intoapache:masterfrom
Conversation
Also a GenerateBitsUnrolled() for higher performance where warranted. Benchmarks: - GenerateBits is 1.8x faster than BitmapWriter - GenerateBitsUnrolled is 2.9x faster than BitmapWriter - BooleanBuilder is now 3x faster than with BitmapWriter (and around 9x faster than it was with SetBitTo calls)
Codecov Report
@@ Coverage Diff @@
## master #2093 +/- ##
==========================================
+ Coverage 86.35% 86.36% +0.01%
==========================================
Files 230 230
Lines 40392 40452 +60
==========================================
+ Hits 34880 34937 +57
- Misses 5512 5515 +3
Continue to review full report at Codecov.
|
wesm
left a comment
There was a problem hiding this comment.
+1, this is awesome! I'm OK to merge whenever you're satisfied with the code enough to remove the WIP
| bit_writer.Finish(); | ||
| int64_t i = 0; | ||
| internal::GenerateBitsUnrolled(raw_data_, length_, length, | ||
| [values, &i]() -> bool { return values[i++] != 0; }); |
There was a problem hiding this comment.
I have often wondered if lambda functions have much overhead vs. inlined functions, is there a good reference on how the various compilers behave?
There was a problem hiding this comment.
A bit of Googling suggests that in instances like this (where the type of the lambda is a template argument), the lambda will be inlined https://www.quora.com/Are-C++-lambda-functions-always-inlined). If you passed a lambda into a function accepting an std::function of some kind, it wouldn't be necessarily
There was a problem hiding this comment.
Yes, apparently the recommended idiom is to let the callable argument be a template parameter so as to select a favorable specialization.
| int64_t remaining_bytes = remaining / 8; | ||
| while (remaining_bytes-- > 0) { | ||
| current_byte = 0; | ||
| current_byte = g() ? current_byte | 0x01 : current_byte; |
There was a problem hiding this comment.
Out of curiousity, would current_byte = current_byte | (0x01 * static_cast<uint8_t>(g())) have any better performance (to avoid branching)? I guess it's possible the compiler is doing some kind of optimization anyway
There was a problem hiding this comment.
I've tried it quickly and, while the BooleanBuilder benchmark isn't affected, the bit-util microbenchmark became 2x faster. I'm wondering whether in this trivial case, perhaps the whole thing is SIMDed by the compiler. I should take a closer look.
(this is with gcc 4.9 on an AMD Ryzen)
|
+1, merging this. We can do further performance explorations in follow up patches |
Also a GenerateBitsUnrolled() for higher performance where warranted.
Benchmarks:
(and around 9x faster than it was with SetBitTo calls)