Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions format/puffin-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,61 @@ The blob metadata for this blob may include following properties:

- `ndv`: estimate of number of distinct values, derived from the sketch.

#### `deletion-vector-v1` blob type

A serialized delete vector (bitmap) that represents the positions of rows in a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A serialized delete vector (bitmap) that represents the positions of rows in a
A serialized deletion vector (bitmap) that represents the positions of rows in a

file that are deleted. A set bit at position P indicates that the row at
position P is deleted.

The vector supports positive 64-bit positions (the most significant bit must be
0), but is optimized for cases where most positions fit in 32 bits by using a
collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a
32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
the least significant 4 bytes. For each key in the set of positions, a 32-bit
Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
key.

To test whether a certain position is set, its most significant 4 bytes (the
key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
for the key, then it is not set.

The serialized blob contains:
* Combined length of the vector and magic bytes stored as 4 bytes, big-endian
* A 4-byte magic sequence, `D1 D3 39 64`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is a magic number here if this is already versioned by the blob type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this was discussed in #11238 (comment)

IMO, I think trying to converge the specs by taking slightly questionable designs from Delta lake and propagating them to Iceberg seems like not a great choice. It makes more sense to me to call the existing Delta legacy, and converge onto a representation that both can share. It seems the core payload is easy enough to transfer and recompute CRC as new data is written.

Copy link
Contributor

@aokolnychyi aokolnychyi Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to do it from scratch, I personally would skip the length field and use consistent byte order for the magic number, bitmap, and checksum. The magic number is needed as a sanity check because readers will directly jump to an offset without reading the Puffin footer. Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. Using portable serialization for 64-bit Roaring bitmaps is the only reasonable option, so that wouldn't change too. CRC helps catch bit flips if the underlying storage becomes faulty, but I wish we used little endian as for the bitmap and magic number.

It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that.

@emkornfield, what are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer.

I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid. Its 4 bytes, so at the end of the day there are likely bigger fish to fry.

It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that.

My (non-binding) vote would be -0.1, simply copy and pasting the delta lake spec. Given the relatively small differences, on the surface it seems that the Delta Lake spec could be revised in a manner to converge with the Iceberg spec without replicating a technical decision. The likelihood that these actually have any impact, as long as the spec is clear enough is probably minimal. CRC version probably has the most impact (and I don't have any stats but I assume this would be small).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the choice of little endian to match most processors' architecture to avoid byte swapping on read?

Agree that it would be nice to use consistent byte ordering (little endian). Parquet and Orc seem to use little endian to encode integers too.

I don't fully understand the compatibility part with Delta lake. It is not like Delta lake would use Puffin file directly, right? I assume some conversion is still needed. those meta fields are cheap to convert byte order if needed, as long as we keep the big bitmap vector compatible with the same portable serialization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the intent here is to be byte for byte compatible with Delta Lake deletion vectors so both can be easily read simply given a tuple of: (file URI, offset, length), so no rewrites would be necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid.

The CRC is computed on the bytes so it can only catch random bit flips. I do think checking the magic number is a reasonable sanity check which is fairly common in the industry. I'd be curious to hear alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRC is computed on the bytes so it can only catch random bit flips.

I'd flip this on its head and I'd like to understand what use-cases we expect it to sanity check that would fail? The CRC doesn't just detect bit-flips, it also detects for accidental truncation/enlargement of the range (e.g. if you leave of the last byte, it is very unlikely the CDC check would pass). If the argument is the CRC will not be checked on most reads, then the consistency of the vectors provides another sanity check (I think it would be quite hard to get random sampled bytes that would fully parse, and implementations should be careful in parsing it here).

I do think checking the magic number is a reasonable sanity check which is fairly common in the industry.

I'm not saying in general it is not useful and even if this case it might be, but magic numbers are typically a file level concept so that readers can determine if the file-type is something they can in fact read (and disambiguate between multiple types of binary formatted data). In this case this information is redundant with the information in the footer. For direct reads based on manifest data, I think a reasonable design option would be to introduce a new content type value that reflected 'deletion-vector-v1', which would allow readers to fail earlier if the type is not recognized.

As another example, theta sketches in puffin don't include a magic number (and as far as I can tell the library by itself doesn't add one for serialization).

Ultimately it is just 4 bytes, so it might not pay to argue too much about it, especially if the community agrees that the compatibility is worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible. Having the magic number allows us to cross check the serialization format without reading the footer and may help debug issues with offsets. It will also be useful if we add more serialization formats in the future. I agree it is unlikely we will be able to successfully deserialize the rest of the content if the offset is invalid, but still. If we end up in that situation, it would mean there was an ugly bug and having more metadata will only help. Overall, it does seem like a reasonable sanity check to me, similar to magic numbers in zstd and gzip.

We once had to debug issues with bit flips while reading manifests. There was no easy way to prove we didn't corrupt the files and it was a faulty disk. The CRC check would catch those and save a ton of time.

I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout. The length and byte orders are controversial. There is no merit in those beyond compatibility with Delta Lake.

Copy link
Contributor

@emkornfield emkornfield Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible.

There is obviously a balance here (for instance we aren't also adding a magic footer, nor are we using error correcting codes). Like I said, probably not worth debating too much as it is 4 bytes.

I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout.

I agree a digest is important, I think the main question is which one. We should probably limit that discussion to the other thread, and I think it was probably a miss not to include this at the puffin format level or at least directly on the theta sketch.

* The vector, serialized as described below
* A CRC-32 checksum of the magic bytes and serialized vector as 4 bytes, big-endian

The position vector is serialized using the Roaring bitmap
["portable" format][roaring-bitmap-portable-serialization]. This representation
consists of:

* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys:
- The key stored as 4 bytes, little-endian
- A [32-bit Roaring bitmap][roaring-bitmap-general-layout]

Note that the length and CRC fields are stored using big-endian, but the
Roaring bitmap format uses little-endian values. Big endian values were chosen
for compatibility with existing deletion vectors in Delta tables.

The blob's `properties` must:

* Include `referenced-data-file`, the location of the data file the delete
vector applies to; must be equal to the data file's `location` in table
Comment on lines +166 to +167
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Include `referenced-data-file`, the location of the data file the delete
vector applies to; must be equal to the data file's `location` in table
* Include `referenced-data-file`, the location of the data file the deletion
vector applies to; must be equal to the data file's `location` in table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are synonyms and I don't think that this is confusing. Is there a concern that you have about using both terms?

metadata
* Include `cardinality`, the number of deleted rows (set positions) in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document that this should be equal to the number of rows listed in the manifest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec is independent of manifests, so I don't want to add that here. I also think that it's clear in the table spec that record_count for a delete vector is its cardinality.

delete vector
* Omit `compression-codec`; `deletion-vector-v1` is not compressed

Snapshot ID and sequence number are not known at the time the Puffin file is
created. `snapshot-id` and `sequence-number` must be set to -1 in blob metadata
for Puffin v1.


[roaring-bitmap-portable-serialization]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations
[roaring-bitmap-general-layout]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout

### Compression codecs

The data can also be uncompressed. If it is compressed the codec should be one of
Expand Down