-
Notifications
You must be signed in to change notification settings - Fork 3k
Puffin: Add delete-vector-v1 blob type #11238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d37f200
4480f5f
ea6cbc8
6b1afae
58d83fc
1b7a24e
d676615
3bb25b3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -123,6 +123,61 @@ The blob metadata for this blob may include following properties: | |||||||||
|
|
||||||||||
| - `ndv`: estimate of number of distinct values, derived from the sketch. | ||||||||||
|
|
||||||||||
| #### `deletion-vector-v1` blob type | ||||||||||
|
|
||||||||||
| A serialized delete vector (bitmap) that represents the positions of rows in a | ||||||||||
| file that are deleted. A set bit at position P indicates that the row at | ||||||||||
| position P is deleted. | ||||||||||
|
|
||||||||||
| The vector supports positive 64-bit positions (the most significant bit must be | ||||||||||
| 0), but is optimized for cases where most positions fit in 32 bits by using a | ||||||||||
| collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a | ||||||||||
| 32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using | ||||||||||
| the least significant 4 bytes. For each key in the set of positions, a 32-bit | ||||||||||
| Roaring bitmap is maintained to store a set of 32-bit sub-positions for that | ||||||||||
| key. | ||||||||||
|
|
||||||||||
| To test whether a certain position is set, its most significant 4 bytes (the | ||||||||||
| key) are used to find a 32-bit bitmap and the least significant 4 bytes (the | ||||||||||
| sub-position) are tested for inclusion in the bitmap. If a bitmap is not found | ||||||||||
| for the key, then it is not set. | ||||||||||
|
|
||||||||||
| The serialized blob contains: | ||||||||||
| * Combined length of the vector and magic bytes stored as 4 bytes, big-endian | ||||||||||
| * A 4-byte magic sequence, `D1 D3 39 64` | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is a magic number here if this is already versioned by the blob type?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see this was discussed in #11238 (comment) IMO, I think trying to converge the specs by taking slightly questionable designs from Delta lake and propagating them to Iceberg seems like not a great choice. It makes more sense to me to call the existing Delta legacy, and converge onto a representation that both can share. It seems the core payload is easy enough to transfer and recompute CRC as new data is written.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we were to do it from scratch, I personally would skip the length field and use consistent byte order for the magic number, bitmap, and checksum. The magic number is needed as a sanity check because readers will directly jump to an offset without reading the Puffin footer. Therefore, checking the magic number ensures we are reading what we expect without an extra request to the footer. Using portable serialization for 64-bit Roaring bitmaps is the only reasonable option, so that wouldn't change too. CRC helps catch bit flips if the underlying storage becomes faulty, but I wish we used little endian as for the bitmap and magic number. It is still a question for the Iceberg community to whether not including the length and using consistent byte order is significant enough to break the compatibility with the existing Delta spec. This layout is a bit tricky understand, I will admit that. @emkornfield, what are your thoughts?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm a little skeptical on value here, given we are already labelling this content elsewhere. The internal consistency of the data structures need to be vetted anyways and probably unlikely to get valid ones if the offset is incorrect. The CRC32 provides further sanity checks if the structures do appear to be valid. Its 4 bytes, so at the end of the day there are likely bigger fish to fry.
My (non-binding) vote would be -0.1, simply copy and pasting the delta lake spec. Given the relatively small differences, on the surface it seems that the Delta Lake spec could be revised in a manner to converge with the Iceberg spec without replicating a technical decision. The likelihood that these actually have any impact, as long as the spec is clear enough is probably minimal. CRC version probably has the most impact (and I don't have any stats but I assume this would be small).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the choice of little endian to match most processors' architecture to avoid byte swapping on read? Agree that it would be nice to use consistent byte ordering (little endian). Parquet and Orc seem to use little endian to encode integers too. I don't fully understand the compatibility part with Delta lake. It is not like Delta lake would use Puffin file directly, right? I assume some conversion is still needed. those meta fields are cheap to convert byte order if needed, as long as we keep the big bitmap vector compatible with the same portable serialization.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe the intent here is to be byte for byte compatible with Delta Lake deletion vectors so both can be easily read simply given a tuple of: (file URI, offset, length), so no rewrites would be necessary.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The CRC is computed on the bytes so it can only catch random bit flips. I do think checking the magic number is a reasonable sanity check which is fairly common in the industry. I'd be curious to hear alternatives.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'd flip this on its head and I'd like to understand what use-cases we expect it to sanity check that would fail? The CRC doesn't just detect bit-flips, it also detects for accidental truncation/enlargement of the range (e.g. if you leave of the last byte, it is very unlikely the CDC check would pass). If the argument is the CRC will not be checked on most reads, then the consistency of the vectors provides another sanity check (I think it would be quite hard to get random sampled bytes that would fully parse, and implementations should be careful in parsing it here).
I'm not saying in general it is not useful and even if this case it might be, but magic numbers are typically a file level concept so that readers can determine if the file-type is something they can in fact read (and disambiguate between multiple types of binary formatted data). In this case this information is redundant with the information in the footer. For direct reads based on manifest data, I think a reasonable design option would be to introduce a new content type value that reflected 'deletion-vector-v1', which would allow readers to fail earlier if the type is not recognized. As another example, theta sketches in puffin don't include a magic number (and as far as I can tell the library by itself doesn't add one for serialization). Ultimately it is just 4 bytes, so it might not pay to argue too much about it, especially if the community agrees that the compatibility is worth it.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While I don’t think the magic number check is critical, I do believe it is beneficial. If things start to fail, we would want to have as much helpful information as possible. Having the magic number allows us to cross check the serialization format without reading the footer and may help debug issues with offsets. It will also be useful if we add more serialization formats in the future. I agree it is unlikely we will be able to successfully deserialize the rest of the content if the offset is invalid, but still. If we end up in that situation, it would mean there was an ugly bug and having more metadata will only help. Overall, it does seem like a reasonable sanity check to me, similar to magic numbers in zstd and gzip. We once had to debug issues with bit flips while reading manifests. There was no easy way to prove we didn't corrupt the files and it was a faulty disk. The CRC check would catch those and save a ton of time. I’d propose to keep the magic number and CRC independently of whether we decide to follow the Delta Lake blob layout. The length and byte orders are controversial. There is no merit in those beyond compatibility with Delta Lake.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There is obviously a balance here (for instance we aren't also adding a magic footer, nor are we using error correcting codes). Like I said, probably not worth debating too much as it is 4 bytes.
I agree a digest is important, I think the main question is which one. We should probably limit that discussion to the other thread, and I think it was probably a miss not to include this at the puffin format level or at least directly on the theta sketch. |
||||||||||
| * The vector, serialized as described below | ||||||||||
| * A CRC-32 checksum of the magic bytes and serialized vector as 4 bytes, big-endian | ||||||||||
|
|
||||||||||
| The position vector is serialized using the Roaring bitmap | ||||||||||
| ["portable" format][roaring-bitmap-portable-serialization]. This representation | ||||||||||
| consists of: | ||||||||||
|
|
||||||||||
| * The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian | ||||||||||
| * For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys: | ||||||||||
| - The key stored as 4 bytes, little-endian | ||||||||||
| - A [32-bit Roaring bitmap][roaring-bitmap-general-layout] | ||||||||||
rdblue marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
|
||||||||||
| Note that the length and CRC fields are stored using big-endian, but the | ||||||||||
| Roaring bitmap format uses little-endian values. Big endian values were chosen | ||||||||||
| for compatibility with existing deletion vectors in Delta tables. | ||||||||||
|
|
||||||||||
| The blob's `properties` must: | ||||||||||
rdblue marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
|
|
||||||||||
| * Include `referenced-data-file`, the location of the data file the delete | ||||||||||
| vector applies to; must be equal to the data file's `location` in table | ||||||||||
|
Comment on lines
+166
to
+167
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are synonyms and I don't think that this is confusing. Is there a concern that you have about using both terms? |
||||||||||
| metadata | ||||||||||
| * Include `cardinality`, the number of deleted rows (set positions) in the | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should document that this should be equal to the number of rows listed in the manifest.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This spec is independent of manifests, so I don't want to add that here. I also think that it's clear in the table spec that |
||||||||||
| delete vector | ||||||||||
rdblue marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| * Omit `compression-codec`; `deletion-vector-v1` is not compressed | ||||||||||
|
|
||||||||||
| Snapshot ID and sequence number are not known at the time the Puffin file is | ||||||||||
| created. `snapshot-id` and `sequence-number` must be set to -1 in blob metadata | ||||||||||
| for Puffin v1. | ||||||||||
|
|
||||||||||
|
|
||||||||||
| [roaring-bitmap-portable-serialization]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations | ||||||||||
| [roaring-bitmap-general-layout]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout | ||||||||||
|
|
||||||||||
| ### Compression codecs | ||||||||||
|
|
||||||||||
| The data can also be uncompressed. If it is compressed the codec should be one of | ||||||||||
|
|
||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.