ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md#673
ARROW-1011: [FORMAT] fix typo and mistakes in Layout.md#673cloud-fan wants to merge 1 commit intoapache:masterfrom
Conversation
| | 00001011 | 0 (padding) | unspecified | | ||
| |Byte 0 (validity bitmap) | Bytes 1-63 | | ||
| |-------------------------|-----------------------| | ||
| | 00001011 | 0 (padding) | |
There was a problem hiding this comment.
IIRC this was a deliberate choice so that one can rely upon accurate null counts using a byte-wise hardware popcount instruction. Though I'm not sure how safe that is in general since we can do:
array->Slice(0, 1)
which would yield a slice of length 1, whose additional bytes should not be inspected in algorithms. In an IPC setting, the padding to byte 63 should be all zeros (see e.g. https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L220), so I think this is fine. We will want to make sure in the IPC writers that we do not write possibly unspecified in-memory bytes from the validity bitmap, instead truncating the buffer to the effective byte range and writing zero padding. It might be worth adding some language to this document about this.
Per our prior discussion about padding -- since this document is about what data gets written to stream or file format, it may be good to be more rigid about zero-ing out the padding bytes that go on the wire. This places a slight additional burden on the writers, though
There was a problem hiding this comment.
I think ideally we only need to use zero padding for null bitmap in a word(8 bytes) boundary, which means the previous code is accurate.
However, to make the rule simpler, I think it also makes sense to require all the padding bytes to be 0 for null bitmap. This is also what we specified in the "Null bitmaps" section:
A 1 (set bit) for index j indicates that the value is not null, while a 0 (bit not set) indicates that it is null. Bitmaps are to be initialized to be all unset at allocation time (this includes padding).
|
Can you open a JIRA? The build failure is unrelated here |
|
We use the "ARROW-1011:" style for the PR titles in our merge script, if you could change that. Thanks! |
wesm
left a comment
There was a problem hiding this comment.
+1. I will comment in the other JIRAs about padding when I can to clarify my concerns there
according to the specification in the `Null bitmaps` section: > Bitmaps are to be initialized to be all unset at allocation time (this includes padding). null bitmaps should always be padded with 0 Author: Wenchen Fan <[email protected]> Closes apache#673 from cloud-fan/first and squashes the following commits: 8566f7c [Wenchen Fan] fix typo and mistakes in Layout.md
according to the specification in the
Null bitmapssection:null bitmaps should always be padded with 0