Skip to content

Changes to ParquetRecordBatchStream to support row filtering in DataFusion #2270

@thinkharderdev

Description

@thinkharderdev

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The current (non-public) API for skipping records has some shortcomings when trying to implement predicate pushdown from DataFusion. It assumes that we have the entire VecDequeue<RowSelection> for an entire row group when construction the ParquetRecordBatchReader which makes doing streaming filtering a challenge.

Describe the solution you'd like

I wrote a prototype of row-level filtering in DataFusion that required a few changes to the API here. Two main things:

  1. Add a method to ParquetRecordBatchStream
pub fn next_selection(
        &mut self,
        selection: &mut VecDeque<RowSelection>,
    ) -> Option<ArrowResult<RecordBatch>>

which mimics poll_next but applies a selection to the next batch.

  1. Add a method to ParquetRecordBatchReader
pub fn next_selection(
        &mut self,
        selection: &mut VecDeque<RowSelection>,
    ) -> Option<ArrowResult<RecordBatch>>

which mimics next but applies a selection and, importantly, emits a single batch containing all selected rows in selection. The current implementation just emits the next RowSelection that selects and stops.

See https://github.com/coralogix/arrow-datafusion/blob/row-filtering/datafusion/core/src/physical_plan/file_format/parquet.rs for the (very hacky) protoype in DataFusion built on https://github.com/coralogix/arrow-rs/tree/row-filtering.

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions