Python: Reassign schema/partition-spec/sort-order ids #5627
Python: Reassign schema/partition-spec/sort-order ids #5627Fokko merged 7 commits intoapache:masterfrom
Conversation
When creating a new schema Resolves apache#5468
| def assign_fresh_partition_spec_ids(spec: PartitionSpec, schema: Schema) -> PartitionSpec: | ||
| partition_fields = [] | ||
| for pos, field in enumerate(spec.fields): | ||
| schema_field = schema.find_field(field.name) |
There was a problem hiding this comment.
This is the partition field name, not a schema field name. The schema field must be looked up by source_id. This method needs both the original schema and the fresh schema. The original schema is used to get field names and then the fresh schema is used to look up the new source ID.
There was a problem hiding this comment.
@Fokko, looks like this hasn't been fixed yet, so I'm reopening the thread.
There was a problem hiding this comment.
Sorry, that slipped through somehow
dc0dbc5 to
2df8cee
Compare
| """Visit a PrimitiveType""" | ||
|
|
||
|
|
||
| class PreOrderSchemaVisitor(Generic[T], ABC): |
There was a problem hiding this comment.
Is this pre-order? In Java we called it CustomOrder because you can choose when to visit children by accessing the callable.
There was a problem hiding this comment.
It is pre-order traversal since we start at the root and then move to the leaves. In order is a bit less intuitive since it is not a binary tree. You could also do a reverse in-order, but not sure if we need that. We can also call it CustomOrder if you have a strong preference, but I think pre-order is the most logical way of using this visitor.
python/pyiceberg/schema.py
Outdated
| return next(self.counter) | ||
|
|
||
| def schema(self, schema: Schema, struct_result: Callable[[], StructType]) -> Schema: | ||
| return Schema(*struct_result().fields, identifier_field_ids=schema.identifier_field_ids) |
There was a problem hiding this comment.
Shouldn't this re-map the identifier field IDs since it is returning a new schema?
There was a problem hiding this comment.
Yes, we do that in the function itself:
def assign_fresh_schema_ids(schema: Schema) -> Schema:
"""Traverses the schema, and sets new IDs"""
schema_struct = pre_order_visit(schema.as_struct(), _SetFreshIDs())
fresh_identifier_field_ids = []
new_schema = Schema(*schema_struct.fields)
for field_id in schema.identifier_field_ids:
original_field_name = schema.find_column_name(field_id)
if original_field_name is None:
raise ValueError(f"Could not find field: {field_id}")
fresh_field = new_schema.find_field(original_field_name)
if fresh_field is None:
raise ValueError(f"Could not lookup field in new schema: {original_field_name}")
fresh_identifier_field_ids.append(fresh_field.field_id)
return new_schema.copy(update={"identifier_field_ids": fresh_identifier_field_ids})This is because we first want to know all the IDs
There was a problem hiding this comment.
Ah, I've refactored this because we need to build a map anyway 👍🏻
python/pyiceberg/schema.py
Outdated
|
|
||
| def field(self, field: NestedField, field_result: Callable[[], IcebergType]) -> IcebergType: | ||
| return NestedField( | ||
| field_id=self._get_and_increment(), name=field.name, field_type=field_result(), required=field.required, doc=field.doc |
There was a problem hiding this comment.
This is going to visit children before visiting the next field. If you're trying to match the behavior of assignment in Java, you'd need to increment the counter for each field and then visit children.
There was a problem hiding this comment.
Missed that one, thanks! Just updated the code and tests
| ) -> TableMetadata: | ||
| fresh_schema = assign_fresh_schema_ids(schema) | ||
| fresh_partition_spec = assign_fresh_partition_spec_ids(partition_spec, fresh_schema) | ||
| fresh_sort_order = assign_fresh_sort_order_ids(sort_order, schema, fresh_schema) |
There was a problem hiding this comment.
Do the "fresh" methods always reset schema_id, spec_id, and order_id?
There was a problem hiding this comment.
Only when you create TableMetadata out of it (when creating a new table). And it resets if it isn't 1.
ad95028 to
bee1c81
Compare
Fokko
left a comment
There was a problem hiding this comment.
I'll move this forward. Let me know if there is anything that you would like to see changed. There are two followups that I'd like to do:
- Remove the pre-validators because they are confusing and error prone
- Smooth out the API for the docs
When creating a new schema.
Also created a type alias called
TableMetadatathat replaces theUnion[TableMetadataV1, TableMetadataV2]annotation.Resolves #5468