Skip to content

Improve memory overhead of parquet dictionary encoder #5828

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As part of #5770, @XiangpengHao has been creating parquet files with large numbers of columns. He was not able to create a file with 10,000 columns and 1M rows in each row group with a single floating pint value (42.0), due to running out of memory

We believe we see that the parquet encoder requires at minimum 8 bytes per value regardless of the actual value.

This is substantial when writing large numbers of columns. For example, writing 10,000 columns to 1M row row groups, requires 80GB of memory (8 * 10,000 * 1,000,000 = 80,000,000,000)

His initial analysis showed that significant memory consumption is from the dictionary encoder's indices: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/dict_encoder.rs#L80 (permalink) , where each value consume 8 bytes of memory. This is evidenced by the observation that changing from f64 to f32 does not reduce memory usage.

To put it another way, the dictionary indices of a row group are kept in memory, which takes row_count*num_column*8 bytes, regardless what actual values are

Describe the solution you'd like

Some way to write large / wide parquet files with less memory required

Describe alternatives you've considered

@XiangpengHao reported he tried to disable dictionary encoding and directly use RLE encoding, but RLE only supports boolean values: https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding/mod.rs#L196

Additional context
Given we are simply encoding the same value over and over again, maybe this is not a important case to optimize

However, the same thing might apply for very sparse columns (e.g. if you are encoding 1M values and all but two of them are NULL 🤔 )

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions