Skip to content

Confused about bit order in parquet BIT_PACKED encoding #5338

@jhorstmann

Description

@jhorstmann

The documentation for the parquet BIT_PACKED encoding says:

For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, which is not the same as the RLE/bit-packing hybrid.

Followed by an example that is clearly different than the example for the RLE encoding. The documentation there also says

The bit-packing here is done in a different order than the one in the deprecated bit-packing encoding

However, in the arrow-rs/parquet code base, I see both encodings use the same BitReader::get_batch implementation. For bitpacked it is used directly, while for rle indirectly via RleDecoder::get_batch. I think parquet2 is doing similar reuse of the bitpacking logic.

As far as I know, both rust parquet implementations pass the integration test suite, so there are multiple options to describe this discrepancy:

  • The documentation is wrong, maybe confusing bit order with little-/big-endian byte order
  • The rust code is wrong, but bitpacked encoding is not used in practice, not even in the test suite
  • The difference only shows in big-endian machines (I don't think this can be the case since the examples show bytes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions