Add Snapshot logic and Summary generation#61
Conversation
| removed_pos_deletes: int | ||
| added_eq_deletes: int | ||
| removed_eq_deletes: int | ||
|
|
There was a problem hiding this comment.
In Java Implementation I saw a flag named trustSizeAndDeleteCounts, which is set to false when we add a DELETES manifest. Based on my understanding, the purpose of this flag is to let us skip reporting size and delete counts related metrics when we add one or more DELETES manifest since we do not know the exact number of rows deleted in the manifest.
Do we want to add the flag in this PR or in the future?
There was a problem hiding this comment.
I was hoping to get some insights from @rdblue on this one. When we Operation.APPEND a table we can add existing DELETE manifests.
There was a problem hiding this comment.
The flag is needed because we don't have the correct counts for the eq and pos deletes in delete manifests. I don't think that we need to add whole manifests in Python so I'd skip it.
There was a problem hiding this comment.
Two things that I like to avoid; complexity and trust issues!
pyiceberg/table/snapshots.py
Outdated
| ADDED_RECORDS = 'added-records' | ||
| DELETED_DATA_FILES = 'deleted-data-files' | ||
| DELETED_RECORDS = 'deleted-records' | ||
| EQUALITY_DELETE_FILES = 'added-equality-delete-files' |
There was a problem hiding this comment.
Should this have the ADDED_ prefix like the others?
| ADDED_POSITION_DELETE_FILES = f'{ADDED_POSITION_DELETES}-files' | ||
| ADDED_RECORDS = 'added-records' | ||
| DELETED_DATA_FILES = 'deleted-data-files' | ||
| DELETED_RECORDS = 'deleted-records' |
There was a problem hiding this comment.
DELETED_ properties look correct.
| ADDED_DELETE_FILES = 'added-delete-files' | ||
| ADDED_EQUALITY_DELETES = 'added-equality-deletes' | ||
| ADDED_FILE_SIZE = 'added-files-size' | ||
| ADDED_POSITION_DELETES = 'added-position-deletes' |
pyiceberg/table/snapshots.py
Outdated
|
|
||
| def __getitem__(self, __key: str) -> Optional[Any]: # type: ignore | ||
| """Return a key as it is a map.""" | ||
| if __key == 'operation': |
There was a problem hiding this comment.
I meant that we have a constant defined and don't need to embed the string. Not a big deal.
pyiceberg/table/snapshots.py
Outdated
| properties: Dict[str, str] = {} | ||
| set_when_positive(properties, self.added_size, ADDED_FILE_SIZE) | ||
| set_when_positive(properties, self.removed_size, REMOVED_FILE_SIZE) | ||
| set_when_positive(properties, self.added_files, ADDED_DATA_FILES) |
There was a problem hiding this comment.
Minor: it would be better to use self.added_data_files and self.removed_data_files since those are the properties that we're tracking.
There was a problem hiding this comment.
That's not a minor, thanks!
pyiceberg/table/snapshots.py
Outdated
| return summary | ||
|
|
||
|
|
||
| def _merge_snapshot_summaries( |
There was a problem hiding this comment.
Is this really a merge? To me, a merge is like adding two summaries together. This is actually updating from a previous snapshot summary.
There was a problem hiding this comment.
Changed it to _update_snapshot_summaries
| ADDED_RECORDS = 'added-records' | ||
| DELETED_DATA_FILES = 'deleted-data-files' | ||
| DELETED_RECORDS = 'deleted-records' | ||
| ADDED_EQUALITY_DELETE_FILES = 'added-equality-delete-files' |
There was a problem hiding this comment.
Odd that this is here rather than with the other ADDED_ properties.
|
Thanks, @Fokko! Looks great. |

This is a slimmed down version of Java, but I'm not sure if we need everything on the Python side.
I left out a lot of the stuff on the Java side because: