ARROW-9029: [C++] Implement BitBlockCounter for much faster block popcounts of bitmaps by wesm · Pull Request #7346 · apache/arrow

wesm · 2020-06-03T23:42:22Z

The purpose of this class is to scan validity bitmaps in segments of 256 bits at a time (a "run") and return the number of true values using popcount hardware instrinsics. Processing code can then switch between nullable / non-nullable processing paths -- the non-nullable paths are often much faster as they don't have to branch or check individual bits.

In the benchmark I wrote here, this strategy starts to become faster than using BitmapReader naively with a null density somewhere between 2% and 10%. I implemented a naive "sum non-null values" algorithm using BitBlockCounter versus a similarly naive version that uses BitmapReader. The benchmark state parameter is the average number of array values for each null

---------------------------------------------------------------------------------
Benchmark                                          Time           CPU Iterations
---------------------------------------------------------------------------------
BitBlockCounterSumNotNull/8                  1566649 ns    1566653 ns        450   638.304M items/s
BitBlockCounterSumNotNull/64                  926258 ns     926257 ns        745   1079.61M items/s
BitBlockCounterSumNotNull/512                 355224 ns     355224 ns       1978   2.74915G items/s
BitBlockCounterSumNotNull/4096                 77070 ns      77070 ns       8597   12.6711G items/s
BitBlockCounterSumNotNull/32768                20838 ns      20838 ns      32902   46.8638G items/s
BitBlockCounterSumNotNull/65536                18538 ns      18538 ns      37300   52.6791G items/s
BitBlockCounterSumNotNullWithOffset/8        1563335 ns    1563346 ns        426   639.654M items/s
BitBlockCounterSumNotNullWithOffset/64        831714 ns     831720 ns        825   1.17415G items/s
BitBlockCounterSumNotNullWithOffset/512       339929 ns     339931 ns       2035   2.87283G items/s
BitBlockCounterSumNotNullWithOffset/4096       75150 ns      75148 ns       9349   12.9952G items/s
BitBlockCounterSumNotNullWithOffset/32768      28726 ns      28727 ns      24736   33.9948G items/s
BitBlockCounterSumNotNullWithOffset/65536      26921 ns      26921 ns      26026   36.2753G items/s
BitmapReaderSumNotNull/8                     1897087 ns    1897098 ns        368   527.121M items/s
BitmapReaderSumNotNull/64                    1050133 ns    1050134 ns        669    952.26M items/s
BitmapReaderSumNotNull/512                    960722 ns     960728 ns        744   1040.88M items/s
BitmapReaderSumNotNull/4096                   949578 ns     949584 ns        727   1053.09M items/s
BitmapReaderSumNotNull/32768                  946948 ns     946955 ns        722   1056.02M items/s
BitmapReaderSumNotNull/65536                  960649 ns     960637 ns        739   1040.98M items/s
BitmapReaderSumNotNullWithOffset/8           1972476 ns    1972457 ns        350   506.982M items/s
BitmapReaderSumNotNullWithOffset/64          1131682 ns    1131691 ns        636   883.634M items/s
BitmapReaderSumNotNullWithOffset/512          991733 ns     991736 ns        729   1008.33M items/s
BitmapReaderSumNotNullWithOffset/4096         982855 ns     982862 ns        719   1017.44M items/s
BitmapReaderSumNotNullWithOffset/32768        983555 ns     983556 ns        691   1016.72M items/s
BitmapReaderSumNotNullWithOffset/65536       1005651 ns    1005658 ns        701   994.374M items/s

So we can see that the performance is around the same when 1 in 8 values is null, but when 1 out of 512 is null, the block-counter version is 3x faster. And the performance goes up from there, up to 50x faster on data that has nulls but not very many. In my experience, data with < 1% nulls is extremely common, much more so than data with 5% or more nulls. This is obviously a tradeoff but IMHO one worth making.

As a bonus, BitBlockCounter doesn't inline any code.

The implementation of this can probably be improved and the benchmark as well so I welcome your collective help with this.

wesm · 2020-06-03T23:43:56Z

@cyb70289 since you've been doing bitmap stuff lately I would welcome your code review of this

github-actions · 2020-06-03T23:46:38Z

https://issues.apache.org/jira/browse/ARROW-9029

cpp/src/arrow/util/bit_util.cc

cpp/src/arrow/util/bit_util.h

emkornfield

Mostly changes around naming which I think will make the semantics of the class clearer and reading of its use easier.

cpp/src/arrow/util/bit_util.h

cpp/src/arrow/util/bit_util_benchmark.cc

emkornfield · 2020-06-04T02:45:29Z

cpp/src/arrow/util/bit_util_benchmark.cc

this is likely to be a common pattern you might consider exposing it via either callbacks on the class or with an adapter.

Yes, I think this can be refined as it relates to kernel execution

cpp/src/arrow/util/bit_util_test.cc

cpp/src/arrow/util/bit_util.cc

emkornfield · 2020-06-04T03:18:30Z

cpp/src/arrow/util/bit_util.cc

will this have different semantics on big endian?

This shift_word helper is used in other functions -- the unit tests pass on big endian it seems

I added some more tests with random data

wesm · 2020-06-04T03:58:57Z

Thanks for the comments, I think this is much better now

emkornfield · 2020-06-04T04:03:32Z

cpp/src/arrow/util/bit_util.h

+class ARROW_EXPORT BitBlockCounter {
+ public:
+  struct Block {
+    int16_t length;


did you check if this affects benchmarks, I think popcount returns ints so there might be some miniscule advantage to keep it at a larger size.

It didn't seem to affect them

emkornfield

Thanks, looks reasonable to me.

cyb70289 · 2020-06-04T05:45:10Z

cpp/src/arrow/util/bit_util.cc

+    total_popcount += __builtin_popcountll(shift_word(current, next, offset_));
+    current = next;
+    next = load_word(bitmap_ + 32);
+    total_popcount += __builtin_popcountll(shift_word(current, next, offset_));


I'm okay with the implementation.

Just thinking if it would be faster to drop "shift_word" with below steps:

simply accumulate total pops of 4 continuous uint64

mask first byte with offset bitmask, count pops, minus from total pops

mask next byte after last uint64 with offset bitmask, count pops, add to total pops

I think we can leave deeper optimization as follow up tasks. And I guess we can handle larger blocks with SIMD more efficiently.
Breaking large blocks to smaller trunks to leverage non-null processing is a great idea.

This raised an interesting question of should this be tied in with existing popcount methods and the improvements applied in both places?

Sounds good to tackle this as follow up, but this is a good starting point in any case.

pitrou · 2020-06-04T12:58:57Z

cpp/src/arrow/util/bit_util_benchmark.cc

+      if (block.length == 0) {
+        break;
+      }
+      if (block.length == block.popcount) {


So basically you don't need a popcount at all for this optimization. You just need to check whether all bits in the block are 1, or whether all bits in the block are 0. This should be trivial (and could perhaps be folded in VisitBits and/or VisitArrayDataInline).

I need the popcount in order to calculate the number of "on" elements for filters, but yes for this particular optimization between "all-on" and "not-all-on" you don't need the popcount.

wesm · 2020-06-04T14:14:59Z

Note, I would like to have a binary version of this also. I just opened https://issues.apache.org/jira/browse/ARROW-9034

wesm requested review from fsaintjacques and pitrou June 3, 2020 23:43

emkornfield reviewed Jun 3, 2020

View reviewed changes

cpp/src/arrow/util/bit_util.cc Show resolved Hide resolved

emkornfield reviewed Jun 4, 2020

View reviewed changes

cpp/src/arrow/util/bit_util.h Outdated Show resolved Hide resolved

wesm mentioned this pull request Jun 4, 2020

ARROW-8504: [C++] Add BitRunReader and use it in parquet #7143

Closed

4 tasks

emkornfield requested changes Jun 4, 2020

View reviewed changes

emkornfield reviewed Jun 4, 2020

View reviewed changes

wesm and others added 4 commits June 3, 2020 22:44

BitmapScanner for faster processing of data without many nulls

feedc84

Review comments, update comment to be more accurate

20e7069

Fix ASAN error

3dbd47d

Code review comments. Add another unit test

af93766

wesm force-pushed the ARROW-9029 branch from a48f0da to af93766 Compare June 4, 2020 03:45

Rename class as BitBlockCounter

f9dda29

wesm changed the title ~~ARROW-9029: [C++] Implement BitmapScanner for much faster processing of data without many nulls~~ ARROW-9029: [C++] Implement BitBlockReader for much fast block popcounts of bitmaps Jun 4, 2020

wesm changed the title ~~ARROW-9029: [C++] Implement BitBlockReader for much fast block popcounts of bitmaps~~ ARROW-9029: [C++] Implement BitBlockCounter for much fast block popcounts of bitmaps Jun 4, 2020

emkornfield reviewed Jun 4, 2020

View reviewed changes

emkornfield approved these changes Jun 4, 2020

View reviewed changes

wesm changed the title ~~ARROW-9029: [C++] Implement BitBlockCounter for much fast block popcounts of bitmaps~~ ARROW-9029: [C++] Implement BitBlockCounter for much faster block popcounts of bitmaps Jun 4, 2020

Fix bug causing ASAN failure

ffff8c1

cyb70289 reviewed Jun 4, 2020

View reviewed changes

pitrou reviewed Jun 4, 2020

View reviewed changes

wesm closed this in c7baa7c Jun 4, 2020

wesm deleted the ARROW-9029 branch June 4, 2020 13:31

asfimport mentioned this pull request Jun 4, 2020

[C++] Implement BitBlockCounter interface for blockwise popcounts of validity bitmaps #25146

Closed

Conversation

wesm commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Jun 3, 2020

Uh oh!

github-actions bot commented Jun 3, 2020

Uh oh!

Uh oh!

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jun 4, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jun 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wesm commented Jun 3, 2020 •

edited

Loading

wesm commented Jun 4, 2020 •

edited

Loading