Skip to content

[C++][Parquet] Fast Random Rowgroup Reads #39676

@corwinjoy

Description

@corwinjoy

Describe the enhancement requested

  • Background:
    For parquet files that have a large number of rowgroups and columns, reading the full file metadata is prohibitively expensive when you just want a sample from a table. (Our customers are using parquet files via Arrow which contains > 10k columns and thousands of rowgroups). For the case where you just want to read a few rowgroups and/or columns we would like to have a fast random access reader.

  • Idea:
    Read only the minimal metadata from the parquet file to establish columns and column types. Require that the file contain an OffsetIndex section and use the offset index to directly access the required data pages and columns. Preliminary work indicates that this can give a 2x or 3x speedup with even a modest number of columns and rowgroups with the existing parquet format. With some minor parquet format changes, I believe this could be 100x faster.

  • Related Work:
    There has been some similar work done in this direction, but I think this is more at the interface level rather than direct performance tuning:
    [C++][Parquet] Support read by row ranges #39392
    [C++][Parquet] support passing a RowRange to RecordBatchReader  #38865
    Jira: Selective reading of rows for parquet file

And a previous discussion around this with additional benchmarks:
#38149

Having a fast random access reader would also be beneficial for fast reading of a file with predicate pushdowns or other applications where specific rows and columns are desired.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions