Finish Adding Default Implementations for V1→V2#1579
Conversation
Fixes apache#1576 ## Summary This PR implements `Default` trait implementations for `DataContentType` and `ManifestContentType` to enable proper V1→V2 manifest projection as specified in the Apache Iceberg format specification. As part of solving ManifestList projection, I did a full audit of ManifestData, ManifestList, ManifestEntry and Snapshot files. All five file types are now covered for V1 to V2 projection. ## Changes Made - Added `Default` for `DataContentType`**: Returns `DataContentType::Data` (value 0) for V1 compatibility - Added `Default` for `ManifestContentType`**: Returns `ManifestContentType::Data` (value 0) for V1 compatibility ### New Tests - ManifestFile V1→V2 projection test**: Validates `ManifestFileV1::try_into()` properly sets defaults for missing V2 fields - ManifestEntry V1→V2 projection test**: Validates `ManifestEntryV1::try_into()` correctly applies sequence number defaults - DataFile V1→V2 projection test**: Validates `DataFileSerde::try_into()` handles V1 field defaults correctly - Snapshot V1→V2 projection tests**: Two tests covering both scenarios (with/without optional summary field) ## Technical Details ### V1→V2 Field Mapping According to the Iceberg spec, when reading V1 manifests as V2, the following defaults are applied: | Component | Field | V1 Value | V2 Default | |-----------|-------|----------|------------| | ManifestFile | `content` | absent | `0` (Data) | | ManifestFile | `sequence_number` | absent | `0` | | ManifestFile | `min_sequence_number` | absent | `0` | | ManifestEntry | `sequence_number` | absent | `0` | | ManifestEntry | `file_sequence_number` | absent | `0` | | DataFile | `content` | absent | `0` (Data) | | DataFile | `equality_ids` | absent | `[]` (empty) | | Snapshot | `sequence_number` | absent | `0` | ### Existing Infrastructure The V1→V2 projection was already working through existing conversion logic: - `ManifestFileV1::try_into()` - converts V1 manifest list entries to V2 - `ManifestEntryV1::try_into()` - converts V1 manifest entries to V2 - `DataFileSerde` with `#[serde(default)]` - handles V1 data file defaults - `SnapshotV1::try_into()` - converts V1 snapshots to V2 This PR adds the missing `Default` implementations and comprehensive tests to validate the behavior. ## Testing ## All new tests pass and validate: - ✅ V1→V2 conversion applies correct defaults for missing fields - ✅ Existing V1 fields are preserved during conversion - ✅ Different scenarios (with/without optional fields) work correctly - ✅ Round-trip compatibility is maintained
|
Working on trying to figure out the failing check. Can't recreate on my Linux computer, but seems to be failing on my Mac. Will send a new commit once I have resolved the issue. |
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @Kurtiscwright for this pr, LGTM!
|
Thnks for working on this @Kurtiscwright. The changes make sense to me, but I would also expect seting more |
Thank you for taking the time to review the PR @liurenjie1024 and @Fokko. My understanding (could be completely incorrect) is that with the annotations of #default where the schema is declared always provides that value unless overridden. For example, whenever ManifestContentType is called you always get a value of Documentation: https://doc.rust-lang.org/std/default/trait.Default.html |
Hi, @Fokko I'm also confusing about this question. Are you talking about scanning table with fields that has |
|
Thanks @liurenjie1024 for pinging me here. I think I did not a great job at explaining my original intent at the issue. This PR is great, and allows converting the V1 structs into V2, which I think is also valuable. My goal was to have an Avro reader that both can read V1 and V2, and read them into a V2 struct right away. This way we don't have to go through the hassle of looking up the format-version in the Avro metadata, and then pick the right reader. For example, if the I'm okay merging this one, and then we can close the gap in a separate PR. WDYT? |
|
Thanks for working on this @Kurtiscwright, and thanks @liurenjie1024 for the review. I'll do another pass on #2004 later today to identify the missing parts 👍 |
|
Thank you both for the reviews. I started a slack thread to continue the discussion around a 1 Avro reader to rule them all task. I would like to pick it up, at minimum adding more unit tests to make sure it works. |
## Which issue does this PR close? - Closes apache#1576 ## What changes are included in this PR? This PR implements `Default` trait implementations for `DataContentType` and `ManifestContentType` to enable proper V1→V2 manifest projection as specified in the Apache Iceberg format specification. As part of solving ManifestList projection, I did a full audit of ManifestData, ManifestList, ManifestEntry and Snapshot files. All five file types are now covered for V1 to V2 projection. ## Changes Made - Added `Default` for `DataContentType`**: Returns `DataContentType::Data` (value 0) for V1 compatibility - Added `Default` for `ManifestContentType`**: Returns `ManifestContentType::Data` (value 0) for V1 compatibility ## Are these changes tested? - ManifestFile V1→V2 projection test**: Validates `ManifestFileV1::try_into()` properly sets defaults for missing V2 fields - ManifestEntry V1→V2 projection test**: Validates `ManifestEntryV1::try_into()` correctly applies sequence number defaults - DataFile V1→V2 projection test**: Validates `DataFileSerde::try_into()` handles V1 field defaults correctly - Snapshot V1→V2 projection tests**: Two tests covering both scenarios (with/without optional summary field)
Which issue does this PR close?
What changes are included in this PR?
This PR implements
Defaulttrait implementations forDataContentTypeandManifestContentTypeto enable proper V1→V2 manifest projection as specified in the Apache Iceberg format specification. As part of solving ManifestList projection, I did a full audit of ManifestData, ManifestList, ManifestEntry and Snapshot files. All five file types are now covered for V1 to V2 projection.Changes Made
DefaultforDataContentType**: ReturnsDataContentType::Data(value 0) for V1 compatibilityDefaultforManifestContentType**: ReturnsManifestContentType::Data(value 0) for V1 compatibilityAre these changes tested?
ManifestFileV1::try_into()properly sets defaults for missing V2 fieldsManifestEntryV1::try_into()correctly applies sequence number defaultsDataFileSerde::try_into()handles V1 field defaults correctly