-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As @etseidl pointed out in https://github.com/apache/arrow-rs/pull/6466/files#r1778966728
This should be a bit more efficient since read_page_indexes will fetch the necessary bytes from the file in a single read, rather than 2 reads per row group.
We can use the new ParquetMetaDataLoader API to read the page indexes in more efficiently (fewer IOs for example)
However, when I tried to implement it, we caught what appears to be a subtle bug -- specifically that the predicates would have been ignored: https://github.com/apache/arrow-rs/pull/6466/files#r1783526090 -- no tests failed.
Describe the solution you'd like
I would like to:
- Reduce the IO's needed to read page indexes in
SerializedFileReader::new_with_options, and clean up the code to use the new ParquetMetaDataReader - Add test coverage for reader predicates and page index
Describe alternatives you've considered
leave as is
Additional context