ARROW-10058: [C++] Improve repeated levels conversion without BMI2 by pitrou · Pull Request #8320 · apache/arrow

pitrou · 2020-10-01T17:49:53Z

Use a lookup table to emulate PEXT 5 bits at a time.
Remove the slow scalar path.

pitrou · 2020-10-01T17:50:33Z

See JIRA issue for benchmarks. Would be nice to have benchmarks on other machines. @emkornfield

pitrou · 2020-10-01T17:55:04Z

I also notice that we call internal::GreaterThanBitmap for each 64 levels, which always goes through the dynamic dispatch indirection (meaning two function calls, I think). We could call GreaterThanBitmapImpl but that requires compiling a specialized version of level_conversion_inc.h for AVX2, otherwise we lose performance.

github-actions · 2020-10-01T18:06:40Z

https://issues.apache.org/jira/browse/ARROW-10058

emkornfield · 2020-10-02T09:06:01Z

@pitrou I'm devoting most of my bandwidth to try to finish up the parquet read component this week, is it ok if I take a closer look next week (hopefully with enough time before an RC is cut?)

emkornfield · 2020-10-02T09:07:45Z

I also notice that we call internal::GreaterThanBitmap for each 64 levels, which always goes through the dynamic dispatch indirection (meaning two function calls, I think). We could call GreaterThanBitmapImpl but that requires compiling a specialized version of level_conversion_inc.h for AVX2, otherwise we lose performance.

yeah it isn't ideal, it is possible there is a better factoring in there but it seemed hard to do and isolate BMI2 special instructions, I guess if this isn't too much slower then BMI2 on intel we could potentially collapse everything, but I would not expect that to be the case.

pitrou · 2020-10-02T09:59:30Z

is it ok if I take a closer look next week

No problem.

but I would not expect that to be the case.

Right. The emulation is probably much slower.

Use a lookup table to emulate PEXT 5 bits at a time. Remove the slow scalar path.

pitrou · 2020-10-06T18:18:56Z

Updated benchmarks on AMD Ryzen:

                         benchmark         baseline         contender  change %                                                                                                                                           counters
0     BM_ReadListOfStructColumn/50  392.881 MiB/sec   564.029 MiB/sec    43.562     {'run_name': 'BM_ReadListOfStructColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 23}
10            BM_ReadListColumn/50  485.560 MiB/sec   675.023 MiB/sec    39.019             {'run_name': 'BM_ReadListColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 42}
7     BM_ReadStructOfListColumn/50  341.782 MiB/sec   462.097 MiB/sec    35.202     {'run_name': 'BM_ReadStructOfListColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 20}
3       BM_ReadListOfListColumn/50  447.657 MiB/sec   566.594 MiB/sec    26.569       {'run_name': 'BM_ReadListOfListColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 39}
23            BM_ReadListColumn/99    1.168 GiB/sec     1.365 GiB/sec    16.883            {'run_name': 'BM_ReadListColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 102}
4     BM_ReadListOfStructColumn/99  975.429 MiB/sec     1.095 GiB/sec    14.925     {'run_name': 'BM_ReadListOfStructColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 56}
9     BM_ReadStructOfListColumn/99  798.058 MiB/sec   896.789 MiB/sec    12.371     {'run_name': 'BM_ReadStructOfListColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 46}
22      BM_ReadListOfListColumn/99    1.050 GiB/sec     1.168 GiB/sec    11.159       {'run_name': 'BM_ReadListOfListColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 94}
1      BM_ReadListOfStructColumn/1  654.576 MiB/sec   725.676 MiB/sec    10.862      {'run_name': 'BM_ReadListOfStructColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 38}
19             BM_ReadListColumn/1  919.949 MiB/sec  1005.740 MiB/sec     9.326              {'run_name': 'BM_ReadListColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 81}
11     BM_ReadListOfStructColumn/0  835.259 MiB/sec   908.920 MiB/sec     8.819      {'run_name': 'BM_ReadListOfStructColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 49}
17     BM_ReadStructOfListColumn/1  605.129 MiB/sec   649.556 MiB/sec     7.342      {'run_name': 'BM_ReadStructOfListColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 35}
8              BM_ReadListColumn/0    1.067 GiB/sec     1.145 GiB/sec     7.334              {'run_name': 'BM_ReadListColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 92}
5      BM_ReadStructOfListColumn/0  700.157 MiB/sec   740.414 MiB/sec     5.750      {'run_name': 'BM_ReadStructOfListColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 41}
6        BM_ReadListOfListColumn/0  929.109 MiB/sec   966.896 MiB/sec     4.067        {'run_name': 'BM_ReadListOfListColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 82}
14  BM_ReadStructOfStructColumn/50    1.537 GiB/sec     1.595 GiB/sec     3.772   {'run_name': 'BM_ReadStructOfStructColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 45}
13          BM_ReadStructColumn/99    4.211 GiB/sec     4.330 GiB/sec     2.835          {'run_name': 'BM_ReadStructColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 252}
20          BM_ReadStructColumn/50    1.155 GiB/sec     1.187 GiB/sec     2.755           {'run_name': 'BM_ReadStructColumn/50', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 69}
15   BM_ReadStructOfStructColumn/1    1.802 GiB/sec     1.849 GiB/sec     2.566    {'run_name': 'BM_ReadStructOfStructColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 53}
12           BM_ReadStructColumn/1    1.798 GiB/sec     1.843 GiB/sec     2.521           {'run_name': 'BM_ReadStructColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 110}
2   BM_ReadStructOfStructColumn/99    3.464 GiB/sec     3.530 GiB/sec     1.898  {'run_name': 'BM_ReadStructOfStructColumn/99', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 100}
16   BM_ReadStructOfStructColumn/0    6.021 GiB/sec     6.065 GiB/sec     0.724   {'run_name': 'BM_ReadStructOfStructColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 179}
21           BM_ReadStructColumn/0    6.821 GiB/sec     6.812 GiB/sec    -0.137           {'run_name': 'BM_ReadStructColumn/0', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 410}
18       BM_ReadListOfListColumn/1  805.644 MiB/sec   801.480 MiB/sec    -0.517        {'run_name': 'BM_ReadListOfListColumn/1', 'run_type': 'iteration', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 70}

emkornfield · 2020-10-06T18:50:45Z

sorry some personal issues came up. hope to have time tonight to review this and other parquet related CLs

pitrou · 2020-10-06T19:04:50Z

For the record, if I profile BM_ReadStructOfListColumn/50, I get the following hot spots (in cycles spent):

~19% in DefRepLevelsToListInfo
~15% in DelimitRecords
~11% in BitRunReader::NextRun
~10% in SpacedExpand
~6% in DictionaryConverter<int>::Copy
~5% in PathWriteContext::AppendRepLevels

And ExtractBitsSoftware (the PEXT emulation) only takes ~1.1%, which seems good enough for now.

emkornfield · 2020-10-07T04:46:06Z

+1. Thanks.

pitrou marked this pull request as ready for review October 5, 2020 15:52

pitrou mentioned this pull request Oct 5, 2020

ARROW-10120: [C++] Add two-level nested Parquet read to Arrow benchmarks #8342

Closed

ARROW-10058: [C++] Improve repeated levels conversion without BMI2

482797c

Use a lookup table to emulate PEXT 5 bits at a time. Remove the slow scalar path.

pitrou force-pushed the ARROW-10058-faster-sw-pext branch from cd01f19 to 482797c Compare October 6, 2020 18:19

emkornfield closed this in e9a12fa Oct 7, 2020

pitrou deleted the ARROW-10058-faster-sw-pext branch October 7, 2020 09:13

asfimport mentioned this pull request Aug 2, 2021

[C++] Investigate performance of LevelsToBitmap without BMI2 #26079

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10058: [C++] Improve repeated levels conversion without BMI2#8320

ARROW-10058: [C++] Improve repeated levels conversion without BMI2#8320
pitrou wants to merge 1 commit intoapache:masterfrom
pitrou:ARROW-10058-faster-sw-pext

pitrou commented Oct 1, 2020

Uh oh!

pitrou commented Oct 1, 2020

Uh oh!

pitrou commented Oct 1, 2020

Uh oh!

github-actions bot commented Oct 1, 2020

Uh oh!

emkornfield commented Oct 2, 2020

Uh oh!

emkornfield commented Oct 2, 2020

Uh oh!

pitrou commented Oct 2, 2020

Uh oh!

pitrou commented Oct 6, 2020

Uh oh!

emkornfield commented Oct 6, 2020

Uh oh!

pitrou commented Oct 6, 2020

Uh oh!

emkornfield commented Oct 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pitrou commented Oct 1, 2020

Uh oh!

pitrou commented Oct 1, 2020

Uh oh!

pitrou commented Oct 1, 2020

Uh oh!

github-actions bot commented Oct 1, 2020

Uh oh!

emkornfield commented Oct 2, 2020

Uh oh!

emkornfield commented Oct 2, 2020

Uh oh!

pitrou commented Oct 2, 2020

Uh oh!

pitrou commented Oct 6, 2020

Uh oh!

emkornfield commented Oct 6, 2020

Uh oh!

pitrou commented Oct 6, 2020

Uh oh!

emkornfield commented Oct 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants