ARROW-9029: [C++] Implement BitBlockCounter for much faster block popcounts of bitmaps#7346
ARROW-9029: [C++] Implement BitBlockCounter for much faster block popcounts of bitmaps#7346wesm wants to merge 6 commits intoapache:masterfrom
Conversation
|
@cyb70289 since you've been doing bitmap stuff lately I would welcome your code review of this |
emkornfield
left a comment
There was a problem hiding this comment.
Mostly changes around naming which I think will make the semantics of the class clearer and reading of its use easier.
There was a problem hiding this comment.
this is likely to be a common pattern you might consider exposing it via either callbacks on the class or with an adapter.
There was a problem hiding this comment.
Yes, I think this can be refined as it relates to kernel execution
cpp/src/arrow/util/bit_util.cc
Outdated
There was a problem hiding this comment.
will this have different semantics on big endian?
There was a problem hiding this comment.
This shift_word helper is used in other functions -- the unit tests pass on big endian it seems
There was a problem hiding this comment.
I added some more tests with random data
|
Thanks for the comments, I think this is much better now |
| class ARROW_EXPORT BitBlockCounter { | ||
| public: | ||
| struct Block { | ||
| int16_t length; |
There was a problem hiding this comment.
did you check if this affects benchmarks, I think popcount returns ints so there might be some miniscule advantage to keep it at a larger size.
There was a problem hiding this comment.
It didn't seem to affect them
emkornfield
left a comment
There was a problem hiding this comment.
Thanks, looks reasonable to me.
| total_popcount += __builtin_popcountll(shift_word(current, next, offset_)); | ||
| current = next; | ||
| next = load_word(bitmap_ + 32); | ||
| total_popcount += __builtin_popcountll(shift_word(current, next, offset_)); |
There was a problem hiding this comment.
I'm okay with the implementation.
Just thinking if it would be faster to drop "shift_word" with below steps:
- simply accumulate total pops of 4 continuous uint64
- mask first byte with offset bitmask, count pops, minus from total pops
- mask next byte after last uint64 with offset bitmask, count pops, add to total pops
I think we can leave deeper optimization as follow up tasks. And I guess we can handle larger blocks with SIMD more efficiently.
Breaking large blocks to smaller trunks to leverage non-null processing is a great idea.
There was a problem hiding this comment.
This raised an interesting question of should this be tied in with existing popcount methods and the improvements applied in both places?
There was a problem hiding this comment.
Sounds good to tackle this as follow up, but this is a good starting point in any case.
| if (block.length == 0) { | ||
| break; | ||
| } | ||
| if (block.length == block.popcount) { |
There was a problem hiding this comment.
So basically you don't need a popcount at all for this optimization. You just need to check whether all bits in the block are 1, or whether all bits in the block are 0. This should be trivial (and could perhaps be folded in VisitBits and/or VisitArrayDataInline).
There was a problem hiding this comment.
I need the popcount in order to calculate the number of "on" elements for filters, but yes for this particular optimization between "all-on" and "not-all-on" you don't need the popcount.
|
Note, I would like to have a binary version of this also. I just opened https://issues.apache.org/jira/browse/ARROW-9034 |
The purpose of this class is to scan validity bitmaps in segments of 256 bits at a time (a "run") and return the number of true values using popcount hardware instrinsics. Processing code can then switch between nullable / non-nullable processing paths -- the non-nullable paths are often much faster as they don't have to branch or check individual bits.
In the benchmark I wrote here, this strategy starts to become faster than using BitmapReader naively with a null density somewhere between 2% and 10%. I implemented a naive "sum non-null values" algorithm using BitBlockCounter versus a similarly naive version that uses BitmapReader. The benchmark state parameter is the average number of array values for each null
So we can see that the performance is around the same when 1 in 8 values is null, but when 1 out of 512 is null, the block-counter version is 3x faster. And the performance goes up from there, up to 50x faster on data that has nulls but not very many. In my experience, data with < 1% nulls is extremely common, much more so than data with 5% or more nulls. This is obviously a tradeoff but IMHO one worth making.
As a bonus,
BitBlockCounterdoesn't inline any code.The implementation of this can probably be improved and the benchmark as well so I welcome your collective help with this.