-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While looking into parquet loading performance, I noticed that the definition levels for simple nullable columns often contain very short rle runs. These seem to mess with branch prediction when decoding and decrease performance.
I managed to reproduce this in a benchmark by creating a custom RleEncoder for a bitwidth of 1, which only writes bitpacked runs. With 90% of bits being set, these are the results:
rle_decoder/decode_bitpacked
time: [121.83 µs 121.98 µs 122.13 µs]
thrpt: [8.5857 Gelem/s 8.5966 Gelem/s 8.6068 Gelem/s]
rle_decoder/decode_hybrid
time: [661.33 µs 662.31 µs 663.56 µs]
thrpt: [1.5802 Gelem/s 1.5832 Gelem/s 1.5856 Gelem/s]
Note the decoder in this benchmark is the same for both cases, the only difference is how the levels are serialized.
I also hacked some statistics into the decoder. For the above benchmark these are the distributions of rle vs bitpacked encoded runs. On average, the rle runs are only 16 elements long, and the bitpacked only encoding even seems to be shorter:
[parquet/benches/rle.rs:292:9] bitpacked_data.len() = 135169
[parquet/benches/rle.rs:292:9] &bitpacked_stats = HybridRleStats {
num_values: 1048576,
num_rle_values: 0,
num_rle_runs: 0,
num_bitpacked_values: 1048576,
num_bitpacked_runs: 2048,
}
[parquet/benches/rle.rs:296:9] hybrid_data.len() = 157050
[parquet/benches/rle.rs:296:9] &hybrid_stats = HybridRleStats {
num_values: 1048576,
num_rle_values: 501915,
num_rle_runs: 29541,
num_bitpacked_values: 546661,
num_bitpacked_runs: 29541,
}
Describe the solution you'd like
I'd like the RleEncoder to be smarter about switching between bitpacked and rle runs. Maybe this is a one-line change in encodings/rle.rs, but I find the logic there a bit hard to follow. Alternatives could be a specialized implementation for single-bit levels, or making this configurable for the user.