[C++][Parquet] Fast Random Rowgroup Reads

### Describe the enhancement requested

- Background:
For parquet files that have a large number of rowgroups and columns, reading the full file metadata is prohibitively expensive when you just want a sample from a table. (Our customers are using parquet files via Arrow which contains > 10k columns and thousands of rowgroups). For the case where you just want to read a few rowgroups and/or columns we would like to have a fast random access reader.

- Idea:
Read only the minimal metadata from the parquet file to establish columns and column types. Require that the file contain an [OffsetIndex](https://github.com/apache/parquet-format/blob/master/PageIndex.md) section and use the offset index to directly access the required data pages and columns. Preliminary work indicates that this can give a 2x or 3x speedup with even a modest number of columns and rowgroups with the existing parquet format. With some minor parquet format changes, I believe this could be 100x faster.

- Related Work:
There has been some similar work done in this direction, but I think this is more at the interface level rather than direct performance tuning:
https://github.com/apache/arrow/issues/39392
https://github.com/apache/arrow/issues/38865
[Jira: Selective reading of rows for parquet file](https://issues.apache.org/jira/browse/ARROW-13517)

And a previous discussion around this with additional benchmarks:
https://github.com/apache/arrow/issues/38149

Having a fast random access reader would also be beneficial for fast reading of a file with predicate pushdowns or other applications where specific rows and columns are desired.


### Component(s)

C++

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Fast Random Rowgroup Reads #39676

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] Fast Random Rowgroup Reads #39676

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions