ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead#9280
ARROW-8928: [C++] Add microbenchmarks to help measure ExecBatchIterator overhead#9280wesm wants to merge 6 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
"items" is a bit ambiguous in this benchmark, but I would expect something else than the number of iterations. Perhaps the number of arrays yielded in the inner loop above?
There was a problem hiding this comment.
I added a comment to explain that iterations-per-second gives easier interpretation of the input-splitting overhead (so 300 iterations/second would mean 3.33ms of input splitting overhead for each use)
d4608a9 to
356c300
Compare
a3c2798 to
12aacf4
Compare
12aacf4 to
8c890fb
Compare
|
Some updated performance (gcc 9.3 locally on x86): The way to read this is that breaking
So if you wanted to break a batch with 1M elements into batches of size 1024 for finer-grained parallel processing, you would pay 2900 microseconds to do so. On this same machine, I have: This seems problematic if we wish to enable array expression evaluation on smaller batch sizes to keep more data in CPU caches. I'll bring this up on the mailing list to see what people think. |
These are only preliminary benchmarks but may help in examining microperformance overhead related to
ExecBatchand its implementation (as avector<Datum>).It may be desirable to devise an "array reference" data structure with few or no heap-allocated data structures and no
shared_ptrinteractions required to obtain memory addresses and other array information.On my test machine (macOS i9-9880H 2.3ghz), I see about 472 CPU cycles per field overhead for each ExecBatch produced. These benchmarks take a record batch with 1M rows and 10 columns/fields and iterates through the rows in smaller ExecBatches of the indicated sizes
So for the 1024 case, it takes 2,055,369 ns to iterate through all 1024 batches. That seems a bit expensive to me (?) — I suspect we can do better while also improving compilation times and reducing generated code size by using simpler data structures in our compute internals.