Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current (non-public) API for skipping records has some shortcomings when trying to implement predicate pushdown from DataFusion. It assumes that we have the entire VecDequeue<RowSelection> for an entire row group when construction the ParquetRecordBatchReader which makes doing streaming filtering a challenge.
Describe the solution you'd like
I wrote a prototype of row-level filtering in DataFusion that required a few changes to the API here. Two main things:
- Add a method to
ParquetRecordBatchStream
pub fn next_selection(
&mut self,
selection: &mut VecDeque<RowSelection>,
) -> Option<ArrowResult<RecordBatch>>
which mimics poll_next but applies a selection to the next batch.
- Add a method to
ParquetRecordBatchReader
pub fn next_selection(
&mut self,
selection: &mut VecDeque<RowSelection>,
) -> Option<ArrowResult<RecordBatch>>
which mimics next but applies a selection and, importantly, emits a single batch containing all selected rows in selection. The current implementation just emits the next RowSelection that selects and stops.
See https://github.com/coralogix/arrow-datafusion/blob/row-filtering/datafusion/core/src/physical_plan/file_format/parquet.rs for the (very hacky) protoype in DataFusion built on https://github.com/coralogix/arrow-rs/tree/row-filtering.
Describe alternatives you've considered
Additional context