Skip to content

Conversation

@ndemir
Copy link

@ndemir ndemir commented Mar 30, 2025

Which issue does this PR close?

This is NOT closing an issue yet.
I am opening this PR to illustrate how the #7348 can be solved. I also added some details in #7348.

Rationale for this change

I run tests arrow::arrow_reader::tests::test_predicate_pushdown_vs_row_filter and I can see the performance increase clearly.

TEST.1

=== PERFORMANCE COMPARISON ===

Filter type Time Row count
Row Filter 21.144189ms 590
Predicate Pushdown 15.067947ms 590
Improvement 1.40x

TEST.2
=== PERFORMANCE COMPARISON ===

Filter type Time Row count
Row Filter 26.991091ms 80
Predicate Pushdown 7.604246ms 80
Improvement 3.55x

What changes are included in this PR?

In this PR, we have a) filtering out the row groups that we do not need b) then creating the predicates and adding them to RowFilter

Are there any user-facing changes?

Yes, we will have, once the full implementation is completed.
This PR is just to show a possible solution for #7348

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 30, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ndemir (and @ethe ) -- I think the idea in this PR of evaluating predicates on the metadata is a good one and important for performance

Instead of adding a new API, would you be willign to make the code in this PR into an example of how to apply predicates to parquet metadata?

I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs , for example by filtering the row groups via ArrowReaderBuilder::with_row_groups and pages with ArrowReaderBuilder::with_row_selection

This is what DataFusion does, for example -- you can read more about how it works in https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/

It looks to me like PredicatePushdown contains a basic expression evaluator -- but to use this a consuming library would have to translate their predicates into this predicate format and would be restricted to the predicates supported by this library

That being said, as you show here it is non trivial to implement row group / page filtering.

}
}

/// Convert a Parquet statistic to an Arrow array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethe
Copy link
Contributor

ethe commented Mar 31, 2025

I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs

I have tried to implement this in third-party libs, but arrow-rs lacks enough public APIs (for example, users can not construct Sbbf outside of parquet), also the related APIs is not convenient enough to be used in public at the moment.

That being said, as you show here it is non trivial to implement row group / page filtering.

That's what I want to point out, this demand is general enough to lots of users, but it is not that easy to be realized, and also exposes lots of internal details, if parquet contains a first-party TableProvider implementation, it is good to me.

BTW, I'd love to contribute to, if we agree on a solution.

@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

I think it is possible to implement this feature without modifing the parquet reader and using the currently available APIs

I have tried to implement this in third-party libs, but arrow-rs lacks enough public APIs (for example, users can not construct Sbbf outside of parquet), also the related APIs is not convenient enough to be used in public at the moment.

You can certainly access and use Sbbf outside the parquet crate, for example Datafusion does to to prune out row groups and data pages here:

https://github.com/apache/datafusion/blob/6d5e00ad3f8e53f7252cb1d3c72a6c7f28c1aed6/datafusion/datasource-parquet/src/row_group_filter.rs#L236-L235

What is the use case for constructing Sbbf? I think it would be find to make that public in the crate

That being said, as you show here it is non trivial to implement row group / page filtering.

That's what I want to point out, this demand is general enough to lots of users, but it is not that easy to be realized, and also exposes lots of internal details,

Yes indeed it is not trivial to implement a fast parquet reader integrated with a query engine

if parquet contains a first-party TableProvider implementation, it is good to me.

What do you mean by "TableProvider" ? If you are using DataFusion already, perhaps you can use the built in parquet reader (ListingTableProvider) that already has all these optimizations

@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

My biggest concern here is adding more code to maintain as part of this crate that may not be widely used

@ethe
Copy link
Contributor

ethe commented Mar 31, 2025

What is the use case for constructing Sbbf?

I can not find the way to get Sbbf instances in the async read path of parquet crates, this only works with SerializedRowGroupReader, but it is synchronous, so I have to construct it manually from bytes::Bytes.

If you are using DataFusion already

I do not use datafusion (not yet), if there is a first-party scan method of parquet async reader with prediction/projection/limitation pushdown, that is what I need. I'd like to say TableProvider provides similar semantics to the above API, but I'm not sure it is the best choice to be the first-party implementation in parquet.

My biggest concern here is adding more code to maintain as part of this crate that may not be widely used

Both Chroma(@HammadB) and Tonbo run into this issue.

@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

What is the use case for constructing Sbbf?

I can not find the way to get Sbbf instances in the async read path of parquet crates, this only works with SerializedRowGroupReader, but it is synchronous, so I have to construct it manually from bytes::Bytes.

Perhaps you can propose an API to do so (perhaps on ParquetMetadataReader)?

I do not use datafusion (not yet), if there is a first-party scan method of parquet async reader with prediction/projection/limitation pushdown, that is what I need. I'd like to say TableProvider provides similar semantics to the above API, but I'm not sure it is the best choice to be the first-party implementation in parquet.

I agree implementing a table provider like interface in the parquet crate is likely not a good idea

My biggest concern here is adding more code to maintain as part of this crate that may not be widely used

Chroma(@HammadB) and also Tonbo both run into this issue.

If the issue is that the public API of the parquet-rs crate doesn't allow you to implement pushdowns I agree we should extend the API to address whatever you are having trouble doing

If the issue is that it is complex to implement parquet predicate pushdown, I am not sure that is a great fit for this crate because the details of implementing predicate pushdown vary significantly from system to system. For example

  1. What predicates are supported ( do you support predicates like prefix matching, user defined functions, etc).
  2. How do you evaluate predicates when there are multiple files (with potentially different but compatible schemas)
  3. How do you evaluate predicates using information from an external metadata catalog (e.g. iceberg or similar)
  4. How do you interleave fetching metadata, evaluating predicates, and scanning files

It isn't clear to me where to draw the line between predicate evaluation and a full query engine.

Maybe you and @HammadB can create some other crate (parquet-predicate-pushdown) implementing the specific pushdown APIs that you need.

@ndemir ndemir marked this pull request as draft March 31, 2025 18:00
@HammadB
Copy link

HammadB commented Apr 1, 2025

@alamb I agree with the desire to separate these things, thanks for explaining.

@alamb
Copy link
Contributor

alamb commented Apr 1, 2025

Thanks @HammadB -- I tried to improve the documentation here too, let me know what you think:

@alamb alamb closed this in #7370 Apr 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants