Support reading PageIndex from parquet metadata, prepare for skipping pages at reading#1762
Support reading PageIndex from parquet metadata, prepare for skipping pages at reading#1762tustvold merged 13 commits intoapache:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1762 +/- ##
==========================================
+ Coverage 83.27% 83.40% +0.12%
==========================================
Files 195 198 +3
Lines 55896 56145 +249
==========================================
+ Hits 46549 46825 +276
+ Misses 9347 9320 -27
Continue to review full report at Codecov.
|
| ///``` | ||
| // | ||
| fn test_page_index_reader() { | ||
| let test_file = get_test_file("data_index_bloom_encoding_stats.parquet"); |
There was a problem hiding this comment.
Will add more test after add file in parquet-testing
|
Awesome, will review tomorrow 😀 |
tustvold
left a comment
There was a problem hiding this comment.
This is a really nice start, awesome work, I've left a few thoughts 😄
parquet/src/file/page_index/index.rs
Outdated
| } else { | ||
| let min = min.as_slice(); | ||
| let max = max.as_slice(); | ||
| (Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max))) |
There was a problem hiding this comment.
I can't find any documentation on what the endianess is supposed to be, but I suspect little endian. Using native endian feels off?
There was a problem hiding this comment.
I found Int96 only support from_ne_slice
arrow-rs/parquet/src/data_type.rs
Lines 1197 to 1214 in 90cf78c
I will add more test in future PR after generate test data.
There was a problem hiding this comment.
Aah - should be easy enough to fix FWIW
There was a problem hiding this comment.
Thanks fix in d8bd4ce.
Could you where should i get the info should use from_le_bytes?
Is it all java app should use little_endian for deserialize.
Co-authored-by: Raphael Taylor-Davies <[email protected]>
|
|
||
| #[derive(Debug, Clone, PartialEq)] | ||
| pub enum Index { | ||
| BOOLEAN(BooleanIndex), |
There was a problem hiding this comment.
CamelCase is generally preferred to shouty case
There was a problem hiding this comment.
I think it should align with Type
Lines 45 to 53 in 90cf78c
There was a problem hiding this comment.
Fair, I mean to change that but it has been low on my priority list 😅
parquet/src/file/page_index/index.rs
Outdated
| } else { | ||
| let min = min.as_slice(); | ||
| let max = max.as_slice(); | ||
| (Some(from_ne_slice::<T>(min)), Some(from_ne_slice::<T>(max))) |
There was a problem hiding this comment.
Aah - should be easy enough to fix FWIW
|
|
||
| /// Creates file reader from a Parquet file with page Index. | ||
| /// Returns error if Parquet file does not exist or is corrupt. | ||
| pub fn new_with_page_index(chunk_reader: R) -> Result<Self> { |
There was a problem hiding this comment.
I see you've removed the new_with_options approach, I think I would prefer you revert that and remove this additional constructor instead. With this I'm not sure how you would set ReadOptions and also enable the page index.
| let mut data = vec![0; length]; | ||
| reader.read_exact(&mut data)?; | ||
|
|
||
| let mut start = 0; |
There was a problem hiding this comment.
In read_pages_locations you simply reuse the same cursor to continue where the previous one left off, here you instead explicitly slice the data buffer and feed this into TCompactInputProtocol. I couldn't see a particular reason why there were two different approaches to reading the same "style" of data
tustvold
left a comment
There was a problem hiding this comment.
I think this is a good step forward, and can be iterated on in subsequent PRs. Thank you 😀
parquet/src/data_type.rs
Outdated
| fn from_le_bytes(_bs: Self::Buffer) -> Self { | ||
| unimplemented!() | ||
| fn from_le_bytes(bs: Self::Buffer) -> Self { | ||
| Self::from_ne_bytes(bs) |
There was a problem hiding this comment.
This will be incorrect on any platform that isn't little endian...
There was a problem hiding this comment.
I am not familiar with this, is this 2b5264b , how should i test this
There was a problem hiding this comment.
It looks correct to me, it is a bit funky because Int96 is internally represented as a [u32; 3] with the least significant u32 first (i.e. little endian).
So on a big endian machine, you need to decode to big endian u32, which are then stored in a little endian array 🤯
In practice Int96 is deprecated, and I'm not sure there are any big endian platforms supported by this crate, but good to be thorough 👍.
how should i test this
It should be covered by the existing tests, I'll create a ticket for clarifying how this is being handled crate-wide
|
|
||
| #[derive(Debug, Clone, PartialEq)] | ||
| pub enum Index { | ||
| BOOLEAN(BooleanIndex), |
There was a problem hiding this comment.
Fair, I mean to change that but it has been low on my priority list 😅
| let mut data = vec![0; length]; | ||
| reader.read_exact(&mut data)?; | ||
|
|
||
| let mut start = 0; |
There was a problem hiding this comment.
But surely the thrift decoder will only read what it needs, i.e. regardless of the variable length nature of the messages it will only read the corresponding bytes? It's not a big deal, I just think we should consistently either rely on the data to be "self-delimiting" or use the lengths to decode slices
|
I intend to merge this once CI clears |
alamb
left a comment
There was a problem hiding this comment.
I only skimmed this PR but it looks really nice -- thank you @Ted-Jiang and @tustvold
|
🎉 |
Which issue does this PR close?
Closes #1761 .
Rationale for this change
Get this info in memory then we can apply page-level filter in future.
What changes are included in this PR?
Add an option to read page index in
parquet/src/file/serialized_reader.rs.Are there any user-facing changes?
In
parquet-testingonlydata_index_bloom_encoding_stats.parquethas onerowgroupwith pageIndex.I will generate test file base on
alltypes_plain.parquet(this file not contains any pageindex) in repoparquet-testing, and support multi-RG in Next Pr.