feat:Add function for row alignment with page mask#1791
feat:Add function for row alignment with page mask#1791tustvold merged 13 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1791 +/- ##
==========================================
+ Coverage 83.39% 83.42% +0.02%
==========================================
Files 198 199 +1
Lines 56142 56427 +285
==========================================
+ Hits 46821 47075 +254
- Misses 9321 9352 +31
Continue to review full report at Codecov.
|
Co-authored-by: Liang-Chi Hsieh <[email protected]>
|
Thank you, I will review this tomorrow (GMT) |
| } | ||
|
|
||
| /// Return the row ranges `Vec(start, len)` of all the selected pages | ||
| pub fn compute_row_ranges( |
There was a problem hiding this comment.
How will this be used?
Do you need to make something like with_predicate in ReadOptionsBuilder to take a closure for filtering pages based on the ranges?
There was a problem hiding this comment.
Yes, basically add a closure called page_predicates in ReadOptionsBuilder like
Box<dyn FnMut(&[pageIndex], &[pageLocation], usize) -> &[bool]>
to generate mask then call this function compute_row_ranges .
I'm thinking about doing this at the end for testing with datafusion.
tustvold
left a comment
There was a problem hiding this comment.
Looking good, left some high level comments. Will review more thoroughly tomorrow
| num_rows: i64, | ||
| total_byte_size: i64, | ||
| schema_descr: SchemaDescPtr, | ||
| // Todo add filter result -> row range |
There was a problem hiding this comment.
The more I think about this the more I wonder whether the metadata structs are the right place to put the index information. They're parsed and interpreted separately from the main metadata, and so I think it makes sense for them to be stored separately?
There was a problem hiding this comment.
Yes, you are right. page index stored in file-meta level.
My thought is read less pageIndex after rowgroup filter
arrow-rs/parquet/src/file/serialized_reader.rs
Lines 211 to 224 in be38829
arrow-rs/parquet/src/file/serialized_reader.rs
Lines 246 to 249 in be38829
So i want to read index here and insert it into RowGroupMetaData.
It was just a simple idea at first, maybe we can find a better way in the process of implementation
There was a problem hiding this comment.
page index stored in file-meta level.
It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅
It was just a simple idea at first, maybe we can find a better way in the process of implementation
Provided we take care to ensure we keep things pub(crate) so we don't break APIs, this seems like a good strategy 👍
There was a problem hiding this comment.
It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅
yes separately from RowGroup, before the footer !😂
There was a problem hiding this comment.
I think this looks good, I am, however, a bit confused as to how it will be used... I had expected something like the following.
For each not-pruned row group:
- Use
Indexto identify covering set of rows based on predicates - Pass row selection down to RecordReader
- Add a
skip_next_pagetoPageReader - Add a
skip_valuestoColumnValueDecoder - Have RecordReader use a combination of the above to skip pages and rows during decode based on the row selection
This would also naturally extend to #1191
I'm not entirely sure where the datastructure added in this PR would fit into this?
Edit: is the intention to use this for the first step?
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Basically, yes ! 👍
First step contains:
@tustvold I will follow up in PageReader |
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
Closes #1790.
Rationale for this change
For now row group filter in datafusion pass a closure to arrow-rs
https://github.com/apache/arrow-datafusion/blob/585bc3a629b92ea7a86ebfe8bf762dbef4155710/datafusion/core/src/physical_plan/file_format/parquet.rs#L559-L562
So for page filter in datafusion, define filter_predicate
datafusion will send a
mask(&[bool]) to arrow-rs,then use mask call
compute_row_rangesto constructRowRanges: row ranges in a row-group (one col) if col is sorted vec size will be 1.For multi filter combine:
if there are two filters use
andconnect,useRowRanges::intersectionto get the final rowRange; two filters useorconnect,useRowRanges::unionto get the final rowRange.After this PR: i will working on
column_page_reader, enable read specific row ranges record.What changes are included in this PR?
Are there any user-facing changes?