Skip to content

Pushdown predictions to Parquet in-memory row group fetches #7348

@ethe

Description

@ethe

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

So far, the Parquet Arrow reader provides two kinds of conditional predictions:

  • Row selection: Offers select and skip methods based on row offsets, which can be pushed down to in-memory row group fetches.
  • Row filter: Only applies to record batches that have been read into memory and cannot be pushed down to in-memory row group fetches. Therefore, it cannot be used to skip fetching column chunks / pages that do not match the filter conditions.

Although the Parquet format's statistics include min/max values, and optionally enabled sparse index can be used to accelerate random reads and avoid unnecessary disk fetches, the row selection mechanism only supports operations related to row offsets. It lacks an API that allows users to declare filter conditions that can be pushed down into the fetch behavior, and the actual implementation of skipping column chunks / pages that do not match the filter conditions based on the index has not been realized.

Describe the solution you'd like
Add the third kind of method in addition to selection and filter. This new method allows users to specify an exact match for a column's value or a range of values, and to utilize indexing during in-memory row group fetches. This will reduce the reading of column chunks / pages that do not meet the filter conditions and improve random read efficiency.

Are there alternatives?
Probably there aren't, whether changing row selection or row filter to support value matching will brings breaking change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions