Skip to content

[coalesce] Special case BatchCoalescer / GenericInProgressArray when multiple batches are pushed in with the same buffer #7765

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

GenericInProgressArray is almost entirely dominated by copying strings to new buffers. It copies to new buffers to avoid accumulating large numbers of buffers that each have only a small number of rows pointed at

It has several optimizations to avoid copying and optimizing this copy when possible
The coalesce kernel has special logic to recycle string view buffers when they are not used much (TODO link)

I have a as yet unproven thesis that we could speed up the coalesce kernel by special casing when the underlying buffer is the same.

The high level idea is that in the case of reading from Parquet the same string buffer will be used for several batches, so if the coalesce kernel detected this maybe we could avoid some copies. I intend to use the coalesce kernel to make parquet reading faster

Describe the solution you'd like

Make benchmarks kernel faster

Describe alternatives you've considered

The first thing I would do is check an actual parquet benchmark that the same Buffers are used for multiple RecordBatches that come out of the reader:

cargo bench --features=arrow,async --bench arrow_reader_clickbench

If that is the case, then I would then make a benchmark that replicates the pattern (e.g. create a record batch with 32K rows, and then slice it up and send it in 8k row chunks)

Then I would try and optimize it. For example check pointer equality and delay the string copies until it saw a new buffer pointer

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions