apache · aokolnychyi · Nov 2, 2024 · Sep 30, 2024 · Sep 30, 2024 · Sep 30, 2024
diff --git a/format/puffin-spec.md b/format/puffin-spec.md
@@ -123,6 +123,61 @@ The blob metadata for this blob may include following properties:
 
 - `ndv`: estimate of number of distinct values, derived from the sketch.
 
+#### `deletion-vector-v1` blob type
+
+A serialized delete vector (bitmap) that represents the positions of rows in a
-A serialized delete vector (bitmap) that represents the positions of rows in a
+A serialized deletion vector (bitmap) that represents the positions of rows in a
-A serialized delete vector (bitmap) that represents the positions of rows in a
+A serialized deletion vector (bitmap) that represents the positions of rows in a
+file that are deleted. A set bit at position P indicates that the row at
+position P is deleted.
+
+The vector supports positive 64-bit positions (the most significant bit must be
+0), but is optimized for cases where most positions fit in 32 bits by using a
+collection of 32-bit Roaring bitmaps. 64-bit positions are divided into a
+32-bit "key" using the most significant 4 bytes and a 32-bit sub-position using
+the least significant 4 bytes. For each key in the set of positions, a 32-bit
+Roaring bitmap is maintained to store a set of 32-bit sub-positions for that
+key.
+
+To test whether a certain position is set, its most significant 4 bytes (the
+key) are used to find a 32-bit bitmap and the least significant 4 bytes (the
+sub-position) are tested for inclusion in the bitmap. If a bitmap is not found
+for the key, then it is not set.
+
+The serialized blob contains:
+* Combined length of the vector and magic bytes stored as 4 bytes, big-endian
+* A 4-byte magic sequence, `D1 D3 39 64`
+* The vector, serialized as described below
+* A CRC-32 checksum of the magic bytes and serialized vector as 4 bytes, big-endian
+
+The position vector is serialized using the Roaring bitmap
+["portable" format][roaring-bitmap-portable-serialization]. This representation
+consists of:
+
+* The number of 32-bit Roaring bitmaps, serialized as 8 bytes, little-endian
+* For each 32-bit Roaring bitmap, ordered by unsigned comparison of the 32-bit keys:
+    - The key stored as 4 bytes, little-endian
+    - A [32-bit Roaring bitmap][roaring-bitmap-general-layout]
+
+Note that the length and CRC fields are stored using big-endian, but the
+Roaring bitmap format uses little-endian values. Big endian values were chosen
+for compatibility with existing deletion vectors in Delta tables.
+
+The blob's `properties` must:
+
+* Include `referenced-data-file`, the location of the data file the delete
+  vector applies to; must be equal to the data file's `location` in table
-* Include `referenced-data-file`, the location of the data file the delete
-  vector applies to; must be equal to the data file's `location` in table
+* Include `referenced-data-file`, the location of the data file the deletion
+  vector applies to; must be equal to the data file's `location` in table
-* Include `referenced-data-file`, the location of the data file the delete
-  vector applies to; must be equal to the data file's `location` in table
+* Include `referenced-data-file`, the location of the data file the deletion
+  vector applies to; must be equal to the data file's `location` in table
+  metadata
+* Include `cardinality`, the number of deleted rows (set positions) in the
+  delete vector
+* Omit `compression-codec`; `deletion-vector-v1` is not compressed
+
+Snapshot ID and sequence number are not known at the time the Puffin file is
+created. `snapshot-id` and `sequence-number` must be set to -1 in blob metadata
+for Puffin v1.
+
+
+[roaring-bitmap-portable-serialization]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#extension-for-64-bit-implementations
+[roaring-bitmap-general-layout]: https://github.com/RoaringBitmap/RoaringFormatSpec?tab=readme-ov-file#general-layout
+
 ### Compression codecs
 
 The data can also be uncompressed. If it is compressed the codec should be one of