Add Snapshot logic and Summary generation by Fokko · Pull Request #61 · apache/iceberg-python

Fokko · 2023-10-12T21:26:03Z

This is a slimmed down version of Java, but I'm not sure if we need everything on the Python side.

I left out a lot of the stuff on the Java side because:

The Java side is rather complex, and I'm not sure if we want to port it one-on-one.
I'm not sure if we're going to write delete files anytime soon, because I would expect to people pull in a dataframe, mangle it, and then write it back (overwrite operation).

…hots

HonahX

This is great! @Fokko. It is much simpler compared to the one in Java. I have some questions related to the design and concepts. Thanks in advance for your help!

pyiceberg/table/snapshots.py

HonahX · 2023-10-25T03:59:10Z

pyiceberg/table/snapshots.py

+    removed_pos_deletes: int
+    added_eq_deletes: int
+    removed_eq_deletes: int
+


In Java Implementation I saw a flag named trustSizeAndDeleteCounts, which is set to false when we add a DELETES manifest. Based on my understanding, the purpose of this flag is to let us skip reporting size and delete counts related metrics when we add one or more DELETES manifest since we do not know the exact number of rows deleted in the manifest.

Do we want to add the flag in this PR or in the future?

ref: apache/iceberg#1367 (comment)

I was hoping to get some insights from @rdblue on this one. When we Operation.APPEND a table we can add existing DELETE manifests.

The flag is needed because we don't have the correct counts for the eq and pos deletes in delete manifests. I don't think that we need to add whole manifests in Python so I'd skip it.

Two things that I like to avoid; complexity and trust issues!

pyiceberg/table/snapshots.py

…hots

pyiceberg/table/snapshots.py

…hots

pyiceberg/table/snapshots.py

rdblue · 2023-12-05T21:53:38Z

pyiceberg/table/snapshots.py

+ADDED_RECORDS = 'added-records'
+DELETED_DATA_FILES = 'deleted-data-files'
+DELETED_RECORDS = 'deleted-records'
+EQUALITY_DELETE_FILES = 'added-equality-delete-files'


Should this have the ADDED_ prefix like the others?

pyiceberg/table/snapshots.py

rdblue · 2023-12-05T21:55:33Z

pyiceberg/table/snapshots.py

+ADDED_POSITION_DELETE_FILES = f'{ADDED_POSITION_DELETES}-files'
+ADDED_RECORDS = 'added-records'
+DELETED_DATA_FILES = 'deleted-data-files'
+DELETED_RECORDS = 'deleted-records'


DELETED_ properties look correct.

rdblue · 2023-12-05T21:58:11Z

pyiceberg/table/snapshots.py

+ADDED_DELETE_FILES = 'added-delete-files'
+ADDED_EQUALITY_DELETES = 'added-equality-deletes'
+ADDED_FILE_SIZE = 'added-files-size'
+ADDED_POSITION_DELETES = 'added-position-deletes'


These first 5 look correct.

rdblue · 2023-12-06T00:02:28Z

pyiceberg/table/snapshots.py


+    def __getitem__(self, __key: str) -> Optional[Any]:  # type: ignore
+        """Return a key as it is a map."""
+        if __key == 'operation':


Should this be OPERATION?

It seems to be lower-case here:

I can make it case-insensitive

I meant that we have a constant defined and don't need to embed the string. Not a big deal.

rdblue · 2023-12-06T00:11:06Z

pyiceberg/table/snapshots.py

+        properties: Dict[str, str] = {}
+        set_when_positive(properties, self.added_size, ADDED_FILE_SIZE)
+        set_when_positive(properties, self.removed_size, REMOVED_FILE_SIZE)
+        set_when_positive(properties, self.added_files, ADDED_DATA_FILES)


Minor: it would be better to use self.added_data_files and self.removed_data_files since those are the properties that we're tracking.

That's not a minor, thanks!

rdblue · 2023-12-06T00:21:11Z

pyiceberg/table/snapshots.py

+    return summary
+
+
+def _merge_snapshot_summaries(


Is this really a merge? To me, a merge is like adding two summaries together. This is actually updating from a previous snapshot summary.

Changed it to _update_snapshot_summaries

rdblue

Looks very close, but a couple property names are incorrect. Thanks @Fokko!

…hots

rdblue · 2023-12-07T18:59:30Z

pyiceberg/table/snapshots.py

+ADDED_RECORDS = 'added-records'
+DELETED_DATA_FILES = 'deleted-data-files'
+DELETED_RECORDS = 'deleted-records'
+ADDED_EQUALITY_DELETE_FILES = 'added-equality-delete-files'


Odd that this is here rather than with the other ADDED_ properties.

rdblue · 2023-12-07T21:44:00Z

Thanks, @Fokko! Looks great.

Add Snapshot logic and Summary generation

50575a8

Fokko added this to the PyIceberg 0.6.0 release milestone Oct 13, 2023

Fokko added 3 commits October 13, 2023 23:33

Cleanup

580c824

Merge branch 'main' of github.com:apache/iceberg-python into fd-snaps…

760c0d4

…hots

Refactor it a bit

3dba41a

HonahX reviewed Oct 25, 2023

View reviewed changes

Fokko added 2 commits October 25, 2023 13:34

Merge branch 'main' of github.com:apache/iceberg-python into fd-snaps…

3309129

…hots

Comments

12c4699