Use Parquet OffsetIndex to prune IO with RowSelection#2473
Use Parquet OffsetIndex to prune IO with RowSelection#2473tustvold merged 8 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pretty sure this was a bug
There was a problem hiding this comment.
Yup, I remember fixing it in a PR that got abandoned at some point
|
Thank you for this, I'll review first thing tomorrow. I like that you've found a way to allow sharing the page muxing logic in SerializedPageReader 👍 |
tustvold
left a comment
There was a problem hiding this comment.
Looking good, will review again once rebased, but mostly minor nits
object_store/src/local.rs
Outdated
There was a problem hiding this comment.
Sorry, this test fails on my machine because of a permissions issue. Meant to revert before submitting.
There was a problem hiding this comment.
This doesn't appear to be being used, and so I think can go. I've been trying to avoid exposing the internal layout of this type externally
There was a problem hiding this comment.
Could we get a test where the final PageLocation is selected?
There was a problem hiding this comment.
| pub fn page_mask( | |
| pub(crate) fn page_mask( |
I don't think this likely to be useful outside the crate
There was a problem hiding this comment.
It seems strange to me that this method would return Vec<Range<usize>> when it is called page_mask, and the caller clearly already has &[PageLocation] that can easily be combined with the mask...
There was a problem hiding this comment.
The idea was to just do it it one shot to avoid iterating over the locations again to get the ranges, but perhaps it's better to avoid overloading
Edit: Looking at this again, the mask was part of a previous design that is no longer relevant, so I think we can just rename this and only return the ranges.
parquet/src/arrow/async_reader.rs
Outdated
There was a problem hiding this comment.
A comment explaining what these are would go a long way
parquet/src/arrow/async_reader.rs
Outdated
There was a problem hiding this comment.
The line above allows offset to be greater than start, but this won't return the correct slice in such a case?
parquet/src/arrow/async_reader.rs
Outdated
There was a problem hiding this comment.
Perhaps we should do an exact match? I think this should work?
There was a problem hiding this comment.
I wasn't sure whether there is ever a case in which we fetch some subset of the page. Thinking about it more I don't believe that would ever be a valid use case.
parquet/src/arrow/async_reader.rs
Outdated
There was a problem hiding this comment.
As data is sorted, you could consider https://doc.rust-lang.org/std/primitive.slice.html#method.binary_search or friends
parquet/src/arrow/async_reader.rs
Outdated
There was a problem hiding this comment.
It is worth noting this will currently represent a performance regression, as avoided a copy - https://github.com/apache/arrow-rs/pull/2473/files#diff-f6b1a106d47a16504d4a16d57a6632872ddf596f337ac0640a13523dccc2d4d4L615
I will add a get_bytes method to ChunkReader to avoid this
90d57fa to
b28ea09
Compare
|
Docs appear to be failing |
|
Benchmark runs are scheduled for baseline = 42e9531 and contender = 2185ce2. 2185ce2 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #2426.
Rationale for this change
When we have a
RowSelectionand anOffsetIndexwe can reduce IO by fetching only the pages selected.This also builds on #2464 to remove
InMemoryColumnChunkand unify everything to useSerializedPageReaderWhat changes are included in this PR?
We can represent pre-fetched column chunks as either a "dense" encoding (just
Bytes) or a "sparse" encoding which contains only the pages relevant to a givenRowSelection.Also remove
InMemoryColumnChunkto help unify the sync and async parquet paths.Are there any user-facing changes?