Add partition summaries in SnapshotSummary builder#1367
Add partition summaries in SnapshotSummary builder#1367aokolnychyi merged 4 commits intoapache:masterfrom
Conversation
|
@aokolnychyi, you might be interested in this. I think we've talked in the past about adding partition-level summaries to snapshot metadata. |
|
Let me take a look at this now. |
|
Do we have an estimate on the impact this will have on the size of metadata files? Will it make sense to enable/configure this in table properties? For example, we could configure the max number of partitions there. Technically, one could issue a query against metadata tables to find out which partitions were affected by a specific snapshot. That's what we use internally to find what partitions to compact. |
|
Can we also add a test for new params? |
|
Also, what about disabling this by default to keep the existing behavior? I feel like it may add some pressure when we have thousands of snapshots in the metadata file. |
I agree. I'll update this to have a threshold for partition summaries that defaults to 0. |
8700a38 to
8bb3c48
Compare
| case DELETES: | ||
| this.addedDeleteFiles += manifest.addedFilesCount(); | ||
| this.removedDeleteFiles += manifest.deletedFilesCount(); | ||
| this.trustSizeAndDeleteCounts = false; |
There was a problem hiding this comment.
Is this because we don't know how many records an equality delete removes?
There was a problem hiding this comment.
Actually, no. It is because we don't have row counts for the number of equality and positional deletes in delete manifests.
| @@ -35,6 +36,8 @@ public class SnapshotSummary { | |||
| public static final String ADDED_RECORDS_PROP = "added-records"; | |||
| public static final String DELETED_RECORDS_PROP = "deleted-records"; | |||
There was a problem hiding this comment.
We still interpret this as the number of records in data files that were removed?
This adds partition-level summaries to snapshot summaries.
Summary counters are refactored into a private
UpdateMetricsclass that is used to track both partition and snapshot metrics. Metrics for changed partitions are added to snapshot summaries by merging the metrics into a string that can be parsed back into a map.If manifests are appended in a snapshot, some of the summary information is not valid. This keeps track of when the summary metrics are valid and only adds valid metrics.