Python: Reassign schema/partition-spec/sort-order ids by Fokko · Pull Request #5627 · apache/iceberg

Fokko · 2022-08-24T12:13:54Z

When creating a new schema.

Also created a type alias called TableMetadata that replaces the Union[TableMetadataV1, TableMetadataV2] annotation.

Resolves #5468

When creating a new schema Resolves apache#5468

python/pyiceberg/schema.py

python/pyiceberg/table/metadata.py

rdblue · 2022-08-24T18:50:02Z

python/pyiceberg/table/partitioning.py

+def assign_fresh_partition_spec_ids(spec: PartitionSpec, schema: Schema) -> PartitionSpec:
+    partition_fields = []
+    for pos, field in enumerate(spec.fields):
+        schema_field = schema.find_field(field.name)


This is the partition field name, not a schema field name. The schema field must be looked up by source_id. This method needs both the original schema and the fresh schema. The original schema is used to get field names and then the fresh schema is used to look up the new source ID.

Great catch! 👍🏻

@Fokko, looks like this hasn't been fixed yet, so I'm reopening the thread.

Sorry, that slipped through somehow

python/pyiceberg/table/sorting.py

…esh-ids-when-creating-a-table

python/pyiceberg/table/metadata.py

rdblue · 2022-08-28T22:04:11Z

python/pyiceberg/schema.py

        """Visit a PrimitiveType"""


+class PreOrderSchemaVisitor(Generic[T], ABC):


Is this pre-order? In Java we called it CustomOrder because you can choose when to visit children by accessing the callable.

It is pre-order traversal since we start at the root and then move to the leaves. In order is a bit less intuitive since it is not a binary tree. You could also do a reverse in-order, but not sure if we need that. We can also call it CustomOrder if you have a strong preference, but I think pre-order is the most logical way of using this visitor.

rdblue · 2022-08-28T22:08:33Z

python/pyiceberg/schema.py

+        return next(self.counter)
+
+    def schema(self, schema: Schema, struct_result: Callable[[], StructType]) -> Schema:
+        return Schema(*struct_result().fields, identifier_field_ids=schema.identifier_field_ids)


Shouldn't this re-map the identifier field IDs since it is returning a new schema?

Yes, we do that in the function itself:

def assign_fresh_schema_ids(schema: Schema) -> Schema: """Traverses the schema, and sets new IDs""" schema_struct = pre_order_visit(schema.as_struct(), _SetFreshIDs()) fresh_identifier_field_ids = [] new_schema = Schema(*schema_struct.fields) for field_id in schema.identifier_field_ids: original_field_name = schema.find_column_name(field_id) if original_field_name is None: raise ValueError(f"Could not find field: {field_id}") fresh_field = new_schema.find_field(original_field_name) if fresh_field is None: raise ValueError(f"Could not lookup field in new schema: {original_field_name}") fresh_identifier_field_ids.append(fresh_field.field_id) return new_schema.copy(update={"identifier_field_ids": fresh_identifier_field_ids})

This is because we first want to know all the IDs

Ah, I've refactored this because we need to build a map anyway 👍🏻

rdblue · 2022-08-28T22:11:03Z

python/pyiceberg/schema.py

+
+    def field(self, field: NestedField, field_result: Callable[[], IcebergType]) -> IcebergType:
+        return NestedField(
+            field_id=self._get_and_increment(), name=field.name, field_type=field_result(), required=field.required, doc=field.doc


This is going to visit children before visiting the next field. If you're trying to match the behavior of assignment in Java, you'd need to increment the counter for each field and then visit children.

Missed that one, thanks! Just updated the code and tests

python/pyiceberg/schema.py

rdblue · 2022-08-28T22:13:11Z

python/pyiceberg/table/metadata.py

+) -> TableMetadata:
+    fresh_schema = assign_fresh_schema_ids(schema)
+    fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, fresh_schema)
+    fresh_sort_order = assign_fresh_sort_order_ids(sort_order, schema, fresh_schema)


Do the "fresh" methods always reset schema_id, spec_id, and order_id?

Only when you create TableMetadata out of it (when creating a new table). And it resets if it isn't 1.

…esh-ids-when-creating-a-table

Fokko

I'll move this forward. Let me know if there is anything that you would like to see changed. There are two followups that I'd like to do:

Remove the pre-validators because they are confusing and error prone
Smooth out the API for the docs

Python: Reassign schema/partition-spec/sort-order ids

178659c

When creating a new schema Resolves apache#5468

github-actions bot added the python label Aug 24, 2022

Fokko commented Aug 24, 2022

View reviewed changes

python/pyiceberg/schema.py Show resolved Hide resolved