feat: TableMetadata Statistic Files#799
Conversation
| #[serde(with = "_serde_set_statistics")] | ||
| SetStatistics { | ||
| /// File containing the statistics | ||
| statistics: StatisticsFile, | ||
| }, |
There was a problem hiding this comment.
In IRC and Java SetStatistics has an additional field snapshot_id (link).
This field is redundant with StatisticsFile.snapshot_id and is only used as an assertion in Java.
I removed the redundancy for rust and will start a discussion on the mailing List how to handle this.
As we still need to be spec compliant, we need custom serializer / deserializer.
There was a problem hiding this comment.
Thanks for raising this on the dev-list: https://lists.apache.org/thread/6fnrp73wokfqlt5vormpjyjmtvl29ll1
| /// even if the refs map is null. | ||
| pub(crate) refs: HashMap<String, SnapshotReference>, | ||
| /// Mapping of snapshot ids to statistics files. | ||
| pub(crate) statistics: HashMap<i64, StatisticsFile>, |
There was a problem hiding this comment.
Can we ever have more than one StatisticsFile for a snapshot_id?
In java this is modeled as a mapping snapshot_id : Vec<StatisticsFile>, however I couldn't find a way to get more than one element into the Vec for a snapshot_id other than deserializing.
https://github.com/apache/iceberg/blob/540d6a6251e31b232fe6ed2413680621454d107a/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1310-L1320
fc26394 to
f1c166c
Compare
| pub fn set_statistics(mut self, statistics: StatisticsFile) -> Self { | ||
| self.metadata | ||
| .statistics | ||
| .insert(statistics.snapshot_id, statistics.clone()); |
There was a problem hiding this comment.
Do we want to check if a snapshot-id exists?
There was a problem hiding this comment.
I thought about this as well, but then followed java.
Currently statistics and snapshots are quite separate from each other. If we implement your check (which I like), I think we should eventually also implement:
- Upon deserialization discard statistics that belong to nonexistant snapshots
- When a snapshot is removed delete the statistics for it as well
This would result in snapshots for statistics not missing. It is unclear however what should happen to the puffin files in these cases. We would have coherent metadata, but probably also orphan files.
Do we know why the check is not there in Java?
Adds
StatisticFileandPartitionStatisticsFileto spec, builder and REST TableUpdate.By far most of the code are tests. I hope that the size of the PR is OK.