Skip to content

parquet::ArrowWriter show allow writing Bloom filters before the end of the file #5859

@progval

Description

@progval

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
https://parquet.apache.org/docs/file-format/bloomfilter/#file-format mentions two ways to layout Bloom Filters: either write each Bloom Filter after its row group, or write all Bloom Filters at the end; then write pointers to Bloom Filters in the footer.

The parquet crate opts for writing all Bloom Filters at the end, while computing them while each row group is being written. This means Bloom Filters need to be kept in memory while files are being written, which can take significant space. In my use case, ~4TB of Bloom filters while writing a 20TB table.

Describe the solution you'd like
Either switch to the other layout (interleaved with row groups), provide an option to switch between the two, or allow users to flush when they want to.

Describe alternatives you've considered
Expecting users to close the ArrowWriter from time to time, and opening a new one. This would mitigate the memory usage, but not entirely remove it. When writing in parallel from multiple threads, it also means they need to stagger re-openings when writing many files in parallel in order to avoid spikes in RAM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions