-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PoC: Add Predicate Pushdown to Parquet Reader for Optimized Query Performance #7360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ndemir (and @ethe ) -- I think the idea in this PR of evaluating predicates on the metadata is a good one and important for performance
Instead of adding a new API, would you be willign to make the code in this PR into an example of how to apply predicates to parquet metadata?
I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs , for example by filtering the row groups via ArrowReaderBuilder::with_row_groups and pages with ArrowReaderBuilder::with_row_selection
This is what DataFusion does, for example -- you can read more about how it works in https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
It looks to me like PredicatePushdown contains a basic expression evaluator -- but to use this a consuming library would have to translate their predicates into this predicate format and would be restricted to the predicates supported by this library
That being said, as you show here it is non trivial to implement row group / page filtering.
| } | ||
| } | ||
|
|
||
| /// Convert a Parquet statistic to an Arrow array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to implement this in third-party libs, but arrow-rs lacks enough public APIs (for example, users can not construct
That's what I want to point out, this demand is general enough to lots of users, but it is not that easy to be realized, and also exposes lots of internal details, if parquet contains a first-party BTW, I'd love to contribute to, if we agree on a solution. |
You can certainly access and use Sbbf outside the parquet crate, for example Datafusion does to to prune out row groups and data pages here: What is the use case for constructing
Yes indeed it is not trivial to implement a fast parquet reader integrated with a query engine
What do you mean by "TableProvider" ? If you are using DataFusion already, perhaps you can use the built in parquet reader ( |
|
My biggest concern here is adding more code to maintain as part of this crate that may not be widely used |
I can not find the way to get
I do not use datafusion (not yet), if there is a first-party scan method of parquet async reader with prediction/projection/limitation pushdown, that is what I need. I'd like to say
Both Chroma(@HammadB) and Tonbo run into this issue. |
Perhaps you can propose an API to do so (perhaps on ParquetMetadataReader)?
I agree implementing a table provider like interface in the parquet crate is likely not a good idea
If the issue is that the public API of the parquet-rs crate doesn't allow you to implement pushdowns I agree we should extend the API to address whatever you are having trouble doing If the issue is that it is complex to implement parquet predicate pushdown, I am not sure that is a great fit for this crate because the details of implementing predicate pushdown vary significantly from system to system. For example
It isn't clear to me where to draw the line between predicate evaluation and a full query engine. Maybe you and @HammadB can create some other crate (parquet-predicate-pushdown) implementing the specific pushdown APIs that you need. |
|
@alamb I agree with the desire to separate these things, thanks for explaining. |
|
Thanks @HammadB -- I tried to improve the documentation here too, let me know what you think: |
Which issue does this PR close?
This is NOT closing an issue yet.
I am opening this PR to illustrate how the #7348 can be solved. I also added some details in #7348.
Rationale for this change
I run tests arrow::arrow_reader::tests::test_predicate_pushdown_vs_row_filter and I can see the performance increase clearly.
TEST.1
=== PERFORMANCE COMPARISON ===
TEST.2
=== PERFORMANCE COMPARISON ===
What changes are included in this PR?
In this PR, we have a) filtering out the row groups that we do not need b) then creating the predicates and adding them to RowFilter
Are there any user-facing changes?
Yes, we will have, once the full implementation is completed.
This PR is just to show a possible solution for #7348