PoC: Add Predicate Pushdown to Parquet Reader for Optimized Query Performance #7360

ndemir · 2025-03-30T23:33:47Z

Which issue does this PR close?

This is NOT closing an issue yet.
I am opening this PR to illustrate how the #7348 can be solved. I also added some details in #7348.

Rationale for this change

I run tests arrow::arrow_reader::tests::test_predicate_pushdown_vs_row_filter and I can see the performance increase clearly.

TEST.1

=== PERFORMANCE COMPARISON ===

Filter type	Time	Row count
Row Filter	21.144189ms	590
Predicate Pushdown	15.067947ms	590
Improvement	1.40x

TEST.2
=== PERFORMANCE COMPARISON ===

Filter type	Time	Row count
Row Filter	26.991091ms	80
Predicate Pushdown	7.604246ms	80
Improvement	3.55x

What changes are included in this PR?

In this PR, we have a) filtering out the row groups that we do not need b) then creating the predicates and adding them to RowFilter

Are there any user-facing changes?

Yes, we will have, once the full implementation is completed.
This PR is just to show a possible solution for #7348

alamb

Thank you @ndemir (and @ethe ) -- I think the idea in this PR of evaluating predicates on the metadata is a good one and important for performance

Instead of adding a new API, would you be willign to make the code in this PR into an example of how to apply predicates to parquet metadata?

I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs , for example by filtering the row groups via ArrowReaderBuilder::with_row_groups and pages with ArrowReaderBuilder::with_row_selection

This is what DataFusion does, for example -- you can read more about how it works in https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/

It looks to me like PredicatePushdown contains a basic expression evaluator -- but to use this a consuming library would have to translate their predicates into this predicate format and would be restricted to the predicates supported by this library

That being said, as you show here it is non trivial to implement row group / page filtering.

alamb · 2025-03-31T14:21:06Z

parquet/src/arrow/arrow_reader/mod.rs

+    }
+}
+
+/// Convert a Parquet statistic to an Arrow array


I think this is largely duplicated with https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/statistics/struct.StatisticsConverter.html

ethe · 2025-03-31T15:16:01Z

I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs

I have tried to implement this in third-party libs, but arrow-rs lacks enough public APIs (for example, users can not construct Sbbf outside of parquet), also the related APIs is not convenient enough to be used in public at the moment.

That being said, as you show here it is non trivial to implement row group / page filtering.

That's what I want to point out, this demand is general enough to lots of users, but it is not that easy to be realized, and also exposes lots of internal details, if parquet contains a first-party TableProvider implementation, it is good to me.

BTW, I'd love to contribute to, if we agree on a solution.

alamb · 2025-03-31T15:32:32Z

I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs

I have tried to implement this in third-party libs, but arrow-rs lacks enough public APIs (for example, users can not construct Sbbf outside of parquet), also the related APIs is not convenient enough to be used in public at the moment.

You can certainly access and use Sbbf outside the parquet crate, for example Datafusion does to to prune out row groups and data pages here:

https://github.com/apache/datafusion/blob/6d5e00ad3f8e53f7252cb1d3c72a6c7f28c1aed6/datafusion/datasource-parquet/src/row_group_filter.rs#L236-L235

What is the use case for constructing Sbbf? I think it would be find to make that public in the crate

That being said, as you show here it is non trivial to implement row group / page filtering.

That's what I want to point out, this demand is general enough to lots of users, but it is not that easy to be realized, and also exposes lots of internal details,

Yes indeed it is not trivial to implement a fast parquet reader integrated with a query engine

if parquet contains a first-party TableProvider implementation, it is good to me.

What do you mean by "TableProvider" ? If you are using DataFusion already, perhaps you can use the built in parquet reader (ListingTableProvider) that already has all these optimizations

alamb · 2025-03-31T15:33:53Z

My biggest concern here is adding more code to maintain as part of this crate that may not be widely used

ethe · 2025-03-31T15:56:22Z

What is the use case for constructing Sbbf?

I can not find the way to get Sbbf instances in the async read path of parquet crates, this only works with SerializedRowGroupReader, but it is synchronous, so I have to construct it manually from bytes::Bytes.

If you are using DataFusion already

I do not use datafusion (not yet), if there is a first-party scan method of parquet async reader with prediction/projection/limitation pushdown, that is what I need. I'd like to say TableProvider provides similar semantics to the above API, but I'm not sure it is the best choice to be the first-party implementation in parquet.

My biggest concern here is adding more code to maintain as part of this crate that may not be widely used

Both Chroma(@HammadB) and Tonbo run into this issue.

alamb · 2025-03-31T16:28:46Z

What is the use case for constructing Sbbf?

I can not find the way to get Sbbf instances in the async read path of parquet crates, this only works with SerializedRowGroupReader, but it is synchronous, so I have to construct it manually from bytes::Bytes.

Perhaps you can propose an API to do so (perhaps on ParquetMetadataReader)?

I do not use datafusion (not yet), if there is a first-party scan method of parquet async reader with prediction/projection/limitation pushdown, that is what I need. I'd like to say TableProvider provides similar semantics to the above API, but I'm not sure it is the best choice to be the first-party implementation in parquet.

I agree implementing a table provider like interface in the parquet crate is likely not a good idea

My biggest concern here is adding more code to maintain as part of this crate that may not be widely used

Chroma(@HammadB) and also Tonbo both run into this issue.

If the issue is that the public API of the parquet-rs crate doesn't allow you to implement pushdowns I agree we should extend the API to address whatever you are having trouble doing

If the issue is that it is complex to implement parquet predicate pushdown, I am not sure that is a great fit for this crate because the details of implementing predicate pushdown vary significantly from system to system. For example

What predicates are supported ( do you support predicates like prefix matching, user defined functions, etc).
How do you evaluate predicates when there are multiple files (with potentially different but compatible schemas)
How do you evaluate predicates using information from an external metadata catalog (e.g. iceberg or similar)
How do you interleave fetching metadata, evaluating predicates, and scanning files

It isn't clear to me where to draw the line between predicate evaluation and a full query engine.

Maybe you and @HammadB can create some other crate (parquet-predicate-pushdown) implementing the specific pushdown APIs that you need.

HammadB · 2025-04-01T15:57:54Z

@alamb I agree with the desire to separate these things, thanks for explaining.

alamb · 2025-04-01T17:20:37Z

Thanks @HammadB -- I tried to improve the documentation here too, let me know what you think:

Improve documentation on implementing Parquet predicate pushdown #7370

initial commit

8d13060

github-actions bot added the parquet Changes to the parquet crate label Mar 30, 2025

removed a func that is not used for testing during development

60920af

alamb reviewed Mar 31, 2025

View reviewed changes

alamb mentioned this pull request Mar 31, 2025

Pushdown predictions to Parquet in-memory row group fetches #7348

Closed

ndemir marked this pull request as draft March 31, 2025 18:00

alamb mentioned this pull request Apr 1, 2025

Improve documentation on implementing Parquet predicate pushdown #7370

Merged

alamb closed this in #7370 Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PoC: Add Predicate Pushdown to Parquet Reader for Optimized Query Performance #7360

PoC: Add Predicate Pushdown to Parquet Reader for Optimized Query Performance #7360

Uh oh!

ndemir commented Mar 30, 2025 •

edited

Loading

Uh oh!

alamb left a comment •

edited

Loading

Uh oh!

alamb Mar 31, 2025

Uh oh!

ethe commented Mar 31, 2025 •

edited

Loading

Uh oh!

alamb commented Mar 31, 2025

Uh oh!

alamb commented Mar 31, 2025

Uh oh!

ethe commented Mar 31, 2025 •

edited

Loading

Uh oh!

alamb commented Mar 31, 2025 •

edited

Loading

Uh oh!

HammadB commented Apr 1, 2025

Uh oh!

alamb commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PoC: Add Predicate Pushdown to Parquet Reader for Optimized Query Performance #7360

PoC: Add Predicate Pushdown to Parquet Reader for Optimized Query Performance #7360

Uh oh!

Conversation

ndemir commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

ethe commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 31, 2025

Uh oh!

alamb commented Mar 31, 2025

Uh oh!

ethe commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HammadB commented Apr 1, 2025

Uh oh!

alamb commented Apr 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ndemir commented Mar 30, 2025 •

edited

Loading

alamb left a comment •

edited

Loading

ethe commented Mar 31, 2025 •

edited

Loading

ethe commented Mar 31, 2025 •

edited

Loading

alamb commented Mar 31, 2025 •

edited

Loading