Skip to content

[C++][Parquet] support passing a RowRange to RecordBatchReader  #38865

@binmahone

Description

@binmahone

Describe the enhancement requested

Currently GetRecordBatchReader API accepts row_group_indices and column_indices. It would be nice to extend the API to accept one more parameter: A row_ranges indicating a subset of rows to be retrieved. With the provided row_ranges, RecordBatchReader can skip unnecessary pages (by comparing the row_ranges with the might-exist page index) as well as unwanted rows.

  • original:
  ::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                       const std::vector<int>& column_indices,
                                       std::shared_ptr<::arrow::RecordBatchReader>* out);
  • proposal:
  ::arrow::Status GetRecordBatchReader(
      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
      const std::shared_ptr<std::map<int, RowRangesPtr>>& row_ranges_map,  # a row_ranges per Row Group
      std::shared_ptr<::arrow::RecordBatchReader>* out);

API clients can query page index or other kinds of index (e.g. external secondary index) to construct the row_ranges.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions