Support bloom filter reading and writing for parquet

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

There are usecases where one wants to search a large amount of parquet data for a relatively small number of rows. For example, if you have distributed tracing data stored as parquet files and want to find the data for a particular trace.

In general, the pattern is "needle in a haystack type query" -- specifically a very selective predicate (passes on only a few rows) on high cardinality (many distinct values) columns. 

The rust  parquet crate has fairly advanced support for [row group pruning](https://docs.rs/parquet/26.0.0/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection), [page level indexes](https://docs.rs/parquet/26.0.0/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_selection), and [filter pushdown](https://docs.rs/parquet/26.0.0/parquet/arrow/arrow_reader/struct.ArrowReaderBuilder.html#method.with_row_filter). These techniques are quite effective when data is sorted and large contiguous ranges of rows can be skipped. 

However, doing needle in the haystack queries still often requires substantial amounts of CPU and IO 

One challenge is that for typical high cardinality columns such as ids, they often (by design) span the entire range of values of the data type

For example, given the best case when the data is "optimally sorted" by id within a row group,  min/max statistics can not help skip row groups or pages. Instead the entire column must be decoded to search for a particular value 

```
┌──────────────────────────┐                WHERE                 
│            id            │       ┌─────── id = 54322342343      
├──────────────────────────┤       │                              
│       00000000000        │       │                              
├──────────────────────────┤       │    Selective predicate on a  
│       00054542543        │       │    high cardinality column   
├──────────────────────────┤       │                              
│           ...            │       │                              
├──────────────────────────┤       │                              
│        ??????????        │◀──────┘                              
├──────────────────────────┤          Can not rule out ranges     
│           ...            │            using min/max values      
├──────────────────────────┤                                      
│       99545435432        │                                      
├──────────────────────────┤                                      
│       99999999999        │                                      
└──────────────────────────┘                                      
                                                                  
  High cardinality column:                                        
    many distinct values                                          
          (sorted)                                                
                                                                  
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐                                           
   min: 00000000000                                               
│  max: 99999999999   │                                           
                                                                  
│       Metadata      │                                           
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                                                                                     
```

**Describe the solution you'd like**
The parquet file format has support for bloom filters: https://github.com/apache/parquet-format/blob/master/BloomFilter.md

A bloom filter is a space efficient structure that allows determining if a value is not in a set quickly. So for a parquet file with bloom filters for `id` in the metadata, the entire row group can be skipped if the id is not present:


```
┌──────────────────────────┐                WHERE                
│            id            │      ─ ─ ─ ─ ─ id = 54322342343     
├──────────────────────────┤     │                               
│       00000000000        │           Can quickly check if      
├──────────────────────────┤     │    the value  54322342343     
│       00054542543        │             is not present by       
├──────────────────────────┤     │     consulting the Bloom      
│           ...            │                  Filter             
├──────────────────────────┤     │                               
│        ??????????        │                                     
├──────────────────────────┤     │                               
│           ...            │                                     
├──────────────────────────┤     │                               
│       99545435432        │                                     
├──────────────────────────┤     │                               
│       99999999999        │                                     
└──────────────────────────┘     │                               
  High cardinality column:                                       
    many distinct values         │                               
          (sorted)                                               
                                 │                               
                                                                 
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─      │                               
                           │                                     
│    bloom_filter: ....  ◀ ─ ─ ─ ┘                               
                           │                                     
│  min: 00000000000                                              
   max: 99999999999        │                                     
│                                                                
        Metadata           │                                     
└ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─                                      
```

I would like the parquet crate to
1.  support optionally writing Parquet bloom filters into the metadata 
2. support using parquet bloom filters during read to make "needle in the haystack" type queries go quickly by skipping entire row groups if the item is not in the bloom filter. 


The format support is here
https://docs.rs/parquet/latest/parquet/format/struct.BloomFilterHeader.html?search=Bloom


**Describe alternatives you've considered**


**Additional context**
There is some code for parquet bloom filters in https://github.com/jorgecarleitao/parquet2/tree/main/src/bloom_filter from @jorgecarleitao. I am not sure how mature it is, but perhaps we can use/repurpose some of that

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support bloom filter reading and writing for parquet #3023

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support bloom filter reading and writing for parquet #3023

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions