Enable serialized_reader read specific Page by passing row ranges. by Ted-Jiang · Pull Request #1977 · apache/arrow-rs

Ted-Jiang · 2022-06-30T12:38:09Z

Which issue does this PR close?

Closes #1976.

Rationale for this change

Part support #1792
if we want to use page index get row ranges , first use SerializedFileReader get pageIndex info, then use this index get
row_ranges like below:

        //filter `x < 11`
        let filter =
            |page: &PageIndex<i32>| page.max.as_ref().map(|&x| x < 11).unwrap_or(false);

        let mask = index.indexes.iter().map(filter).collect::<Vec<_>>();

        let row_ranges = compute_row_ranges(&mask, locations, total_rows).unwrap();

Finally we can pass the row_ranges to new API to read parquet file(datafusion use this way but without row_ranges)

fn get_record_reader_by_columns_and_row_ranges(
        &mut self,
        mask: ProjectionMask,
        row_ranges: &RowRanges,
        batch_size: usize,
    ) -> Result<ParquetRecordBatchReader> {

What changes are included in this PR?

One example: if we read col1, col2 and apply filter get the result we need read row_ranges[20, 80],
For col1:
we need all data from page1, page2, page3.
For col2:
after this PR, we will filter page2 and keep page0, page1
as for page1: need all data
as for page0: we need part of its row_range(need row align TODO)

 * rows   col1   col2   col3
 *      ┌──────┬──────┬──────┐
 *   0  │  p0  │      │      │
 *      ╞══════╡  p0  │  p0  │
 *  20  │ p1(X)│------│------│
 *      ╞══════╪══════╡      │
 *  40  │ p2(X)│      │------│
 *      ╞══════╡ p1(X)╞══════╡
 *  60  │ p3(X)│      │------│
 *      ╞══════╪══════╡      │
 *  80  │  p4  │      │  p1  │
 *      ╞══════╡  p2  │      │
 * 100  │  p5  │      │      │
 *      └──────┴──────┴──────┘
 *

Are there any user-facing changes?

Ted-Jiang · 2022-06-30T12:39:26Z

parquet/src/arrow/array_reader/builder.rs

For now just support primitive_reader

Ted-Jiang · 2022-06-30T12:43:02Z

parquet/src/arrow/async_reader.rs

will support in InMemoryReader

Ted-Jiang · 2022-06-30T12:51:38Z

parquet/src/file/metadata.rs

Need 3 level vec: row_group -> column chunck -> page

liukun4515 · 2022-07-01T01:46:39Z

I'm confused about the design of the new API described above

fn get_record_reader_by_columns_and_row_ranges(
        &mut self,
        mask: ProjectionMask,
        row_ranges: &RowRanges,
        batch_size: usize,
    ) -> Result<ParquetRecordBatchReader> {

I think column index reader should be a function for parquet reader or parquet-rs, any one who call the parquet reader should get the benefit from this optimization with a filter.

From your implementation, I find user need to call the lower api and use the column index to calculate the ranges. If so, Any user who want to use the column index of the parquet should add complex custom logic to fit this lower interface.
What is your option?
@sunchao @tustvold @viirya

sunchao · 2022-07-01T06:44:08Z

Yes, I think the row ranges are internal to parquet-rs and should be calculated during the predicate pushdown.

Ted-Jiang · 2022-07-01T08:02:06Z

should be calculated during the predicate pushdown.

Yes, i agree it's better keep it private.
But in current code base we not have a filter struct like in java to provide predicate pushdown.
Like datafusion use 'mask' (level for column), 'row ranges' is like a page_mask

  fn get_record_reader_by_columns(
        &mut self,
        mask: ProjectionMask,
        batch_size: usize,
    ) -> Result<ParquetRecordBatchReader> {

I think for now, we make it pub, after full support for predicate pushdown, we will do like in java.

Ted-Jiang · 2022-07-01T08:03:33Z

@tustvold @viirya others may interests in this, PTAL😊

Ted-Jiang · 2022-07-01T08:05:20Z

parquet/src/column/reader.rs

    }

+    pub(crate) fn set_row_ranges(&mut self, row_ranges: RowRanges) {
+        self.selected_row_ranges = Some(row_ranges);


need this row_ranges for row align in future.

Ted-Jiang · 2022-07-01T08:06:54Z

parquet/src/file/serialized_reader.rs

        Ok(Box::new(page_reader))
    }

+    fn get_column_page_reader_with_offset_index(


Cause of lack test data in sub project in parquet-testing, will add end to end test after add test file in it.

related apache/parquet-testing#25

We can add UT using the writer API.
#1935 has been merged, the parquet file contains the column index by default.

parquet/src/file/page_index/filer_offset_index.rs

parquet/src/file/reader.rs

liukun4515 · 2022-07-01T12:50:57Z

parquet/src/arrow/array_reader/mod.rs

-        let iterator = FilePageIterator::new(column_index, Arc::clone(self))?;
+    fn column_chunks(
+        &self,
+        i: usize,


Suggested change

i: usize,

column_index: usize,

parquet/src/file/serialized_reader.rs

liukun4515 · 2022-07-01T13:05:24Z

parquet/src/file/serialized_reader.rs

+            let mut columns_indexes = vec![];
+            let mut offset_indexes = vec![];
+            for rg in &filtered_row_groups {
+                let c = index_reader::read_columns_indexes(&chunk_reader, rg.columns())?;


If a schema has co1,col2,col3.....col8, and we just need the col1 and col3, do we need to load other useless index data？

tustvold

I've had a quick review, unfortunately I think this is missing a key detail. In particular the arrow writer must read the same records from each of its columns. As written this simply skips reading pruned pages from columns. There is no relationship between the page boundaries across columns within a parquet, and therefore this will return different rows for each of the columns.

As described in #1791 (review), you will need to extract the row selection in addition to the page selection, and push this into RecordReader and ColumnValueDecoder. This will also make the API clearer, as we aren't going behind their back and skipping pages at the block-level

tustvold · 2022-07-01T13:07:38Z

parquet/src/arrow/arrow_reader.rs

+        self.skip_arrow_metadata = skip_arrow_metadata;
+        self


Suggested change

self.skip_arrow_metadata = skip_arrow_metadata;

self

{

skip_arrow_metadata,

..self

}

And same below

tustvold · 2022-07-01T13:08:58Z

parquet/src/arrow/arrow_reader.rs

+        if self.options.selected_rows.is_some() {
+            let ranges = &self.options.selected_rows.as_ref().unwrap().clone();


Suggested change

if self.options.selected_rows.is_some() {

let ranges = &self.options.selected_rows.as_ref().unwrap().clone();

if let Some(ranges) = self.options.selected_rows.as_ref()

tustvold · 2022-07-01T13:10:21Z

parquet/src/arrow/arrow_reader.rs

        batch_size: usize,
    ) -> Result<Self::RecordReader>;
+
+    fn get_record_reader_by_columns_and_row_ranges(


Do we need this, or is the ArrowReaderOptions sufficient?

tustvold · 2022-07-01T13:12:29Z

parquet/src/file/metadata.rs

    row_groups: Vec<RowGroupMetaData>,
-    page_indexes: Option<Vec<Index>>,
-    offset_indexes: Option<Vec<Vec<PageLocation>>>,
+    page_indexes: Option<Vec<Vec<Index>>>,


Suggested change

page_indexes: Option<Vec<Vec<Index>>>,

/// Page index for all pages in each column chunk

page_indexes: Option<Vec<Vec<Index>>>,

Or something like that, same for the below

tustvold · 2022-07-01T13:14:13Z

parquet/src/file/reader.rs

+
+    /// get a serially readable slice of the current reader
+    /// This should fail if the slice exceeds the current bounds
+    fn get_multi_range_read(


As discussed on #1955 I'm not a fan of this, I would much rather the page reader reads pages, than skipping byte ranges behind its back.

It also changes the semantics of how a column chunk is read, as it now buffers in memory an extra time

tustvold · 2022-07-01T13:15:34Z

parquet/src/file/serialized_reader.rs

+            for (start, length) in start_list.into_iter().zip(length_list.into_iter()) {
+                combine_vec.extend(self.slice(start..start + length).to_vec());
+            }
+            let reader = Bytes::copy_from_slice(combine_vec.as_slice()).reader();


This adds an additional copy of all the page bytes, which is definitely not ideal...

tustvold · 2022-07-01T13:27:44Z

parquet/src/file/page_index/filer_offset_index.rs

+    // read from parquet file which before the footer.
+    offset_index: Vec<PageLocation>,
+
+    // use to keep needed page index.


Why is this necessary?

We pass all PageLocation and RowRanges in this struct then do the filter logic.

if we have 5 pages, in try_new, we filter 2 pages and keep these 3 pages index_numbers in this index_map for final calculate_offset_range.

Ted-Jiang · 2022-07-01T15:44:09Z

I've had a quick review, unfortunately I think this is missing a key detail. In particular the arrow writer must read the same records from each of its columns. As written this simply skips reading pruned pages from columns. There is no relationship between the page boundaries across columns within a parquet, and therefore this will return different rows for each of the columns.

Thanks @tustvold, your are right. Maybe I made the title confusing😭. as you mentioned in [#1791 (review)]. (#1791 (review)):

Pass row selection down to RecordReader
Add a skip_next_page to PageReader
Add a skip_values to ColumnValueDecoder

This pr is only about the skip_next_page part, we will only return the needed page metadata in iterator. As make the same records from each of its columns (row align), i prefer support in next pr. I prefer to separate them to avoid huge PR and conflict. If you prefer to combine them, I will make this in progress and keep developing.

As described in #1791 (review), you will need to extract the row selection in addition to the page selection, and push this into RecordReader and ColumnValueDecoder. This will also make the API clearer, as we aren't going behind their back and skipping pages at the block-level

As above, need pass the row_ranges to ColumnValueReader in future.

tustvold · 2022-07-01T16:04:57Z

I think small incremental PRs is a good approach. However, I have concerns with this specific PR:

It introduces public APIs that don't have clear semantics (the skipped rows are somewhat arbitrary)
I would prefer an approach that collocates the page and row skipping logic, instead of treating them as separate concerns. Once RecordReader is skipping rows it will be incredibly confusing if pages are being skipped somewhere else in addition

I wonder if a plan of attack akin to the following might work:

Add a skip page function to SerializedPageReader that uses the column index to skip the next page without reading it (we may need to change it to take ChunkReader instead of Read)
Same as the above for InMemoryPageReader
Add the ability to skip decoding rows from a page to ColumnValueDecoder, potentially each impl as a separate PR
Pass index and row selection down to RecordReader
Perform skipping

Currently it feels like we're adding the high-level functionality before the necessary lower level functionality exists, and this means low-level details, like the page delimiting, leak out of the high-level APIs.

Edit: I'll try and stub out some APIs for what I mean over the next couple of days. This will also help me validate my mental model checks out 😅

Ted-Jiang · 2022-07-01T16:55:42Z

Edit: I'll try and stub out some APIs for what I mean over the next couple of days. This will also help me validate my mental model checks out 😅

Got it, i will delete the get_record_reader_by_columns_and_row_ranges and use options avoid public APIs

tomorrow i will start with

Add a skip page function to SerializedPageReader that uses the column index to skip the next page without reading it (we may need to change it to take ChunkReader instead of Read)

After read the code, i think if we want to skip page in SerializedPageReader we need get the page meta, but in SerializedPageReader it only care about the decode work. the pages offset already set to buf: T in SerializedRowGroupReader so i try to pass column index to SerializedRowGroupReader to change pages offset.
So one question 🤔
So if i want to add skip page i need add owned page location and selected row to SerializedPageReader?

I got worried about where should i pass the column index info.

tustvold · 2022-07-01T16:57:14Z

I got worried about where should i pass the column index info

Give me a day or so and I'll get a PR up with some stuff stubbed out, I think this exercise will help us both 😄

Ted-Jiang · 2022-07-01T16:59:09Z

I got worried about where should i pass the column index info

Give me a day or so and I'll get a PR up with some stuff stubbed out, I think this exercise will help us both 😄

Sure ! this really need some time! 💪

tustvold · 2022-07-05T13:18:13Z

Marking as a draft, as I think the approach in #1998 is what we will take forward

tustvold · 2022-07-14T18:08:40Z

@Ted-Jiang Can this be closed now?

Ted-Jiang marked this pull request as draft June 30, 2022 12:38

github-actions bot added the parquet Changes to the parquet crate label Jun 30, 2022

Ted-Jiang commented Jun 30, 2022

View reviewed changes

parquet/src/arrow/array_reader/builder.rs Outdated

Copy link

Member Author

Ted-Jiang Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now just support primitive_reader

Ted-Jiang commented Jun 30, 2022

View reviewed changes

parquet/src/arrow/async_reader.rs Outdated

Copy link

Member Author

Ted-Jiang Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will support in InMemoryReader

Ted-Jiang commented Jun 30, 2022

View reviewed changes

parquet/src/file/metadata.rs Outdated

Copy link

Member Author

Ted-Jiang Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need 3 level vec: row_group -> column chunck -> page

Ted-Jiang force-pushed the page_filter branch from 8895689 to c0bc361 Compare July 1, 2022 07:32

Ted-Jiang marked this pull request as ready for review July 1, 2022 08:02

Ted-Jiang commented Jul 1, 2022

View reviewed changes

Ted-Jiang mentioned this pull request Jul 1, 2022

add test file for page index filter. apache/parquet-testing#25

Merged