-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
GenericInProgressArray is almost entirely dominated by copying strings to new buffers. It copies to new buffers to avoid accumulating large numbers of buffers that each have only a small number of rows pointed at
It has several optimizations to avoid copying and optimizing this copy when possible
The coalesce kernel has special logic to recycle string view buffers when they are not used much (TODO link)
I have a as yet unproven thesis that we could speed up the coalesce kernel by special casing when the underlying buffer is the same.
The high level idea is that in the case of reading from Parquet the same string buffer will be used for several batches, so if the coalesce kernel detected this maybe we could avoid some copies. I intend to use the coalesce kernel to make parquet reading faster
Describe the solution you'd like
Make benchmarks kernel faster
Describe alternatives you've considered
The first thing I would do is check an actual parquet benchmark that the same Buffers are used for multiple RecordBatches that come out of the reader:
cargo bench --features=arrow,async --bench arrow_reader_clickbenchIf that is the case, then I would then make a benchmark that replicates the pattern (e.g. create a record batch with 32K rows, and then slice it up and send it in 8k row chunks)
Then I would try and optimize it. For example check pointer equality and delay the string copies until it saw a new buffer pointer
Additional context