Skip to content

[C++][Parquet] RecordReader does not correctly reserve memory for BYTE_ARRAY and FLBA #47012

@pitrou

Description

@pitrou

Describe the enhancement requested

When reading a Parquet leaf column as Arrow, we presize the Arrow builder so as to avoid spurious reallocations during incremental Parquet decoding calls.

However, the Reserve method on RecordReader will only properly reserve values for non-FLBA non-BYTE_ARRAY physical types.

The result is that, on some of our micro-benchmarks, we spend a significant amount of time reallocating data on the ArrayBuilder. For example, here is a flamegraph of BM_ReadColumnPlain<false,Float16LogicalType>/null_probability:-1:

Image

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions