Skip to content

Bloom filters for high-cardinality column lookups #113

@bluestreak01

Description

@bluestreak01

Summary

Bloom filters for both native and Parquet storage to efficiently search high-cardinality columns (order IDs, user IDs, transaction references) across unlimited time ranges without scanning all data.

The Problem

-- Find one order across 5 years of data
SELECT * FROM trades WHERE order_id = 'ORD-2024-ABC123';
  • High-cardinality columns can't use traditional indexes efficiently
  • Without filtering, must scan all partitions
  • Needle-in-haystack queries become prohibitively expensive

The Solution

Query: WHERE order_id = 'ORD-2024-ABC123'

Partition 2024-Q1: Bloom says "not here"     → SKIP ✓
Partition 2024-Q2: Bloom says "not here"     → SKIP ✓
Partition 2024-Q3: Bloom says "maybe here"   → SCAN
Partition 2024-Q4: Bloom says "not here"     → SKIP ✓
...

How Bloom Filters Work

Property Description
No false negatives If filter says "not here", value is guaranteed absent
Rare false positives Occasionally says "maybe" when value isn't present (tunable)
Compact Minimal storage overhead per partition/row group
Fast O(k) lookups where k is number of hash functions

Features

  • Native storage support — Bloom filter per partition for indexed columns
  • Parquet support — Bloom filter per row group, compatible with Parquet spec
  • Configurable — Tune false positive rate vs storage trade-off
  • Automatic maintenance — Filters built during ingestion and compaction

Benefits

  • Skip 99%+ of data — Only scan partitions that might contain the value
  • Unlimited time range — Search across years without proportional cost
  • High-cardinality friendly — Works where traditional indexes don't scale

Use Cases

  • Find specific order/trade by ID across historical data
  • Lookup user activity across all time
  • Audit trails and compliance queries
  • Any needle-in-haystack search pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions