Python: Implement Hive create and load table by Fokko · Pull Request #5447 · apache/iceberg

Fokko · 2022-08-05T13:27:11Z

No description provided.

Missing parts of creating and loading Hive tables, including reading/writing the table metadata. Tested against local metastore (with map/list/struct types)

python/pyiceberg/catalog/hive.py

rdblue · 2022-08-07T16:21:39Z

python/pyiceberg/catalog/hive.py

+        return f"struct<{', '.join(field_results)}>"
+
+    def field(self, field: NestedField, field_result: str) -> str:
+        return f"{field.name}: {field_result}"


I'm not sure that there can be a space here. The Java converter doesn't have a space and I seem to remember Hive expecting a certain format in some cases.

python/pyiceberg/catalog/hive.py

rdblue · 2022-08-07T16:28:00Z

python/pyiceberg/catalog/hive.py

+            if database_location := database_properties.get(LOCATION):
+                database_location = database_location.rstrip("/")
+                return f"{database_location}/{table_name}/"
+        raise ValueError("Cannot determine location from warehouse, please provide an explicit location")


This should respect the warehouse property used to construct the catalog rather than failing with ValueError

Nice, added it, including a test

python/pyiceberg/catalog/hive.py

rdblue · 2022-08-07T16:31:54Z

python/pyiceberg/catalog/hive.py

+        metadata_location = f"{location}metadata/{uuid.uuid4()}.metadata.json"
+        metadata = TableMetadataV2(
+            location=location,
+            schemas=[schema],


The schema field IDs and partition field IDs need to be reassigned to ensure that they are consistent because we don't trust that users pass them in correctly. Passing them directly is okay for this PR, but we should not release until we have reassignment done.

Good one, we do have validators on them when we initialize the Pydantic models, but there is no re-assignment being done. Do you want to set them to zero? We could also set the schema_id by default to zero?

Here's what we do in Java: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L88-L133

When the table metadata builder adds a schema, spec, or order, the builder will check whether it is an existing order and find out the ID to assign. We should do something similar here.

This probably isn't something to do in this PR (since this is focused on Hive) but introducing the builder would probably be a good idea. That's how we also accumulate the change set that we pass to the REST catalog when committing table changes. I'd say let's move forward with what you have here (except passing the IDs rather than hard-coding to 0) and we can add reassignment later.

Thanks for pointing out the Java code. I like it and agree that it is best to do it in separate PR.

python/pyiceberg/catalog/hive.py

python/pyiceberg/schema.py

python/pyiceberg/serializers.py

python/pyiceberg/table/metadata.py

…ad-metadata

rdblue · 2022-08-10T16:58:19Z

python/pyiceberg/catalog/hive.py

+        metadata = FromInputFile.table_metadata(file)
+        return Table(identifier=(table.dbName, table.tableName), metadata=metadata, metadata_location=metadata_location)
+
+    def _write_metadata(self, metadata: TableMetadataV2, io: FileIO, metadata_path: str):


Why does this accept metadata v2 instead of either v1 or v2? I don't think that we want to assume that we will only write v2 metadata. We can't upgrade tables automatically.

I've updated the signature. This is because we don't have any logic that allows updating a table. This is only used for creating new tables, and I think it makes sense to only allow V2 tables (or at least push the user into that direction).

python/pyiceberg/catalog/hive.py

rdblue · 2022-08-10T17:51:49Z

python/pyiceberg/catalog/hive.py

+            if partition_spec.fields
+            else DEFAULT_LAST_PARTITION_ID,
+        )
+        io = load_file_io({**self.properties})


If we're always using self.properties then I think it makes sense to have an io instance for the catalog. That can be used here rather than creating a new one per table create.

I actually think that we should mix in the properties from the table in there as well. This way you can override this on a table level.

python/pyiceberg/catalog/hive.py

rdblue · 2022-08-10T17:55:15Z

python/pyiceberg/catalog/hive.py

            raise TableAlreadyExistsError(f"Table {database_name}.{table_name} already exists") from e
-        return self._convert_hive_into_iceberg(hive_table)
+
+        return self._convert_hive_into_iceberg(hive_table, io, metadata_location)


Why pass metadata_location here? Since it is set on the table, I think it would make sense to always pull it from table parameters. That way we never create a situation where we've forgotten to set the table parameter, but successfully returned a table instance. Or one where we've forgotten to update it and returned an updated table.

I like that a lot. I think that the metadata_location was still there for historical reasons. Thanks!

…ad-metadata

rdblue · 2022-08-12T15:43:03Z

python/tests/catalog/test_hive.py

-    catalog._client.__enter__().get_table.assert_called_with(dbname="default", tbl_name="table")
+    table = catalog.load_table(("default", "new_tabl2e"))
+
+    catalog._client.__enter__().get_table.assert_called_with(dbname="default", tbl_name="new_tabl2e")


Did you intend to have 2 in the table name?

rdblue · 2022-08-12T15:49:10Z

@Fokko, looks like this is ready except that it conflicts with the order by PR that was just merged. I tried to fix it, but I didn't get it right.

Fokko · 2022-08-15T08:33:39Z

Awesome @rdblue just resolved the pre-commit error 👍🏻

…ad-metadata

rdblue · 2022-08-15T20:40:31Z

Thanks, @Fokko! Great to have this working.

github-actions bot added the python label Aug 5, 2022

Fokko changed the title ~~Python: Read Metadata~~ Python: Implement Hive create and load Hive table Aug 5, 2022

Fokko changed the title ~~Python: Implement Hive create and load Hive table~~ Python: Implement Hive create and load table Aug 5, 2022

Python: Implement Hive create and load table

0d0919a

Missing parts of creating and loading Hive tables, including reading/writing the table metadata. Tested against local metastore (with map/list/struct types)

Fokko force-pushed the fd-read-metadata branch from d322741 to 0d0919a Compare August 5, 2022 13:40

Fokko mentioned this pull request Aug 5, 2022

Python: Fix typo metadataLocation → metadata-location #5448

Merged

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/schema.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/serializers.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 7, 2022

View reviewed changes

python/pyiceberg/table/metadata.py Show resolved Hide resolved

Fokko added 3 commits August 8, 2022 16:38

Comments

dd47274

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

55e7bc1

…ad-metadata

Cleanup

0f3ecfd

Fokko force-pushed the fd-read-metadata branch from 91cb7fe to 75c2af7 Compare August 8, 2022 18:31

Cleanup

dc7d33d

Fokko force-pushed the fd-read-metadata branch from 75c2af7 to dc7d33d Compare August 8, 2022 18:33

Respect catalog warehouse property

b6ee89a

This was referenced Aug 8, 2022

Python: Re-assign IDs in when creating a table #5468

Closed

Python: Add a CLI to go through the catalog #5417

Merged

rdblue reviewed Aug 10, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Show resolved Hide resolved

rdblue reviewed Aug 10, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 10, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 10, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 10, 2022

View reviewed changes

python/pyiceberg/catalog/hive.py Outdated Show resolved Hide resolved

rdblue reviewed Aug 10, 2022

View reviewed changes

Fokko added 4 commits August 11, 2022 13:49

Comments

48840fe

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

9e30caf

…ad-metadata

Rewrote the tests a bit

a0c8e16

Address comments

6f112cc

rdblue reviewed Aug 12, 2022

View reviewed changes

rdblue approved these changes Aug 12, 2022

View reviewed changes

Merge branch 'master' into fd-read-metadata

f66f511

Make pre-commit happy

0c30365

Merge branch 'master' of https://github.com/apache/iceberg into fd-re…

41ee4e8

…ad-metadata

rdblue merged commit 5b55d00 into apache:master Aug 15, 2022

Conversation

Fokko commented Aug 5, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 12, 2022

Uh oh!

Fokko commented Aug 15, 2022

Uh oh!

rdblue commented Aug 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdblue Aug 10, 2022 •

edited

Loading

rdblue Aug 10, 2022 •

edited

Loading