feat: add method for async read bloom filter#4917
Conversation
tustvold
left a comment
There was a problem hiding this comment.
This looks good, left some minor comments, but I think all this needs is a test
| let buffer = self | ||
| .input | ||
| .0 | ||
| .get_bytes(offset..offset + SBBF_HEADER_SIZE_ESTIMATE) |
There was a problem hiding this comment.
There is a new bloom_filter_length that may be present and would avoid needing to guess here
There was a problem hiding this comment.
Thanks, i checked the module bloom_filter and then updated this part.
| let bitset = self | ||
| .input | ||
| .0 | ||
| .get_bytes(bitset_offset..bitset_offset + length) |
There was a problem hiding this comment.
I think it would be ideal if we could avoid this extra roundtrip in the common case, by fetching enough data in the first call
There was a problem hiding this comment.
The first call is used to parse bloom_filter_length, and the second call is used to parse bloom_filter_data, We can reduce one call if we know the bloom_filter_length, Thanks, I updated. Can you help review again?
|
@tustvold Sure, I will try to add two test cases:
|
|
@tustvold Can i create two test parquet files and commit to https://github.com/apache/parquet-testing/ ? |
|
You could, but I don't have merge rights there so it may take some time. A quicker option might be to use an existing file for 1., and to write a file to a |
|
@tustvold Thanks, i will use |
|
Would you mind take a look at |
tustvold
left a comment
There was a problem hiding this comment.
Looks good to me, thank you
Which issue does this PR close?
Impl #3851
We want to filter
row_groupsin Datafusion but there is no async API for readingbloom filter.What changes are included in this PR?
Implemented a function
get_row_group_column_bloom_filterforParquetRecordBatchStreamBuilderto support readingbloom filteroutside arrow.Are there any user-facing changes?
Add an function
get_row_group_column_bloom_filterinParquetRecordBatchStreamBuilder