-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
The documentation for the parquet BIT_PACKED encoding says:
For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, which is not the same as the RLE/bit-packing hybrid.
Followed by an example that is clearly different than the example for the RLE encoding. The documentation there also says
The bit-packing here is done in a different order than the one in the deprecated bit-packing encoding
However, in the arrow-rs/parquet code base, I see both encodings use the same BitReader::get_batch implementation. For bitpacked it is used directly, while for rle indirectly via RleDecoder::get_batch. I think parquet2 is doing similar reuse of the bitpacking logic.
As far as I know, both rust parquet implementations pass the integration test suite, so there are multiple options to describe this discrepancy:
- The documentation is wrong, maybe confusing bit order with little-/big-endian byte order
- The rust code is wrong, but bitpacked encoding is not used in practice, not even in the test suite
- The difference only shows in big-endian machines (I don't think this can be the case since the examples show bytes)