Use ParquetMetaDataReader to load page indexes in SerializedFileReader::new_with_options#6506
Merged
alamb merged 1 commit intoapache:masterfrom Oct 7, 2024
Merged
Conversation
etseidl
commented
Oct 3, 2024
| #[test] | ||
| fn test_file_reader_filter_row_groups_and_range() -> Result<()> { | ||
| let test_file = get_test_file("alltypes_plain.parquet"); | ||
| let test_file = get_test_file("alltypes_tiny_pages.parquet"); |
Contributor
Author
There was a problem hiding this comment.
Switch to a test file that contains page indexes.
etseidl
commented
Oct 3, 2024
|
|
||
| let mut metadata = metadata_builder.build(); | ||
|
|
||
| // If page indexes are desired, build them with the filtered set of row groups |
Contributor
Author
There was a problem hiding this comment.
The key here is that we're only pulling page indexes for row groups that survived the predicate filtering above.
mbrobbel
approved these changes
Oct 4, 2024
alamb
approved these changes
Oct 6, 2024
| metadata.row_group(0).column(0).data_page_offset() | ||
| ); | ||
|
|
||
| // read non-contiguous row groups |
Contributor
There was a problem hiding this comment.
this is great -- than you for adding these tests for predicate
| metadata_builder = metadata_builder | ||
| .set_column_index(Some(columns_indexes)) | ||
| .set_offset_index(Some(offset_indexes)); | ||
| let mut reader = |
Contributor
There was a problem hiding this comment.
this makes sense to switch the metadata.
Contributor
|
Thanks again @etseidl |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #6491.
Rationale for this change
Reduce the file I/O needed to read in the page indexes.
What changes are included in this PR?
If page indexes are needed, create a new
ParquetMetaDataReaderusing the filtered set of row groups. This will do a single read to get the bytes necessary to instantiate the needed page index structures, rather than two reads per row group. An alternative approach would be to read all the needed structures up front and then prune them, but that would mean reading page indexes that will just be thrown away.Are there any user-facing changes?
No