Skip to content

Add partition summaries in SnapshotSummary builder#1367

Merged
aokolnychyi merged 4 commits intoapache:masterfrom
rdblue:add-partition-level-summaries
Oct 6, 2020
Merged

Add partition summaries in SnapshotSummary builder#1367
aokolnychyi merged 4 commits intoapache:masterfrom
rdblue:add-partition-level-summaries

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Aug 22, 2020

This adds partition-level summaries to snapshot summaries.

Summary counters are refactored into a private UpdateMetrics class that is used to track both partition and snapshot metrics. Metrics for changed partitions are added to snapshot summaries by merging the metrics into a string that can be parsed back into a map.

If manifests are appended in a snapshot, some of the summary information is not valid. This keeps track of when the summary metrics are valid and only adds valid metrics.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 23, 2020

@aokolnychyi, you might be interested in this. I think we've talked in the past about adding partition-level summaries to snapshot metadata.

@aokolnychyi
Copy link
Contributor

Let me take a look at this now.

@aokolnychyi
Copy link
Contributor

Do we have an estimate on the impact this will have on the size of metadata files? Will it make sense to enable/configure this in table properties? For example, we could configure the max number of partitions there. Technically, one could issue a query against metadata tables to find out which partitions were affected by a specific snapshot. That's what we use internally to find what partitions to compact.

@aokolnychyi
Copy link
Contributor

Can we also add a test for new params?

@aokolnychyi
Copy link
Contributor

Also, what about disabling this by default to keep the existing behavior? I feel like it may add some pressure when we have thousands of snapshots in the metadata file.

@rdblue
Copy link
Contributor Author

rdblue commented Aug 25, 2020

What about disabling this by default to keep the existing behavior? I feel like it may add some pressure when we have thousands of snapshots in the metadata file.

I agree. I'll update this to have a threshold for partition summaries that defaults to 0.

@rdblue rdblue force-pushed the add-partition-level-summaries branch from 8700a38 to 8bb3c48 Compare October 2, 2020 20:49
case DELETES:
this.addedDeleteFiles += manifest.addedFilesCount();
this.removedDeleteFiles += manifest.deletedFilesCount();
this.trustSizeAndDeleteCounts = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because we don't know how many records an equality delete removes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no. It is because we don't have row counts for the number of equality and positional deletes in delete manifests.

@@ -35,6 +36,8 @@ public class SnapshotSummary {
public static final String ADDED_RECORDS_PROP = "added-records";
public static final String DELETED_RECORDS_PROP = "deleted-records";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still interpret this as the number of records in data files that were removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants