Fix Parquet with special characters in field names.#601
Fix Parquet with special characters in field names.#601rdblue merged 8 commits intoapache:masterfrom
Conversation
| return copy; | ||
| } | ||
|
|
||
| public static String makeCompatibleName(String name) { |
There was a problem hiding this comment.
Nit: do you want to use this method in org.apache.iceberg.avro.TypeToSchema#struct() api?
Currently the code is
boolean isValidFieldName = AvroSchemaUtil.validAvroName(origFieldName);
String fieldName = isValidFieldName ? origFieldName : AvroSchemaUtil.sanitize(origFieldName);
There was a problem hiding this comment.
I considered it, but the isValidFieldName is reused to add the original name as a property, so I think it's fine as it is.
This also fixes Avro's special character handling.
|
@rdsr, please have another look. I added tests to iceberg-data and ended up needing to fix a couple of things:
I think the second fix also addresses the case introduced by #207 when the Avro names don't match because the shouldn't be projected. Next, we should be able to fix Avro reads by using a similar pattern to iceberg-data, but one that produces Avro generics. |
|
Thanks @rdblue . I'll have a look over the weekend |
| 0, | ||
| Comparators.charSequences().compare("test", (CharSequence) full.getField("data%0"))); | ||
|
|
||
| Record projected = writeAndRead("full_projection", schema, schema.select("data%0"), record); |
There was a problem hiding this comment.
nit: consider changing the desc.
| } | ||
|
|
||
| private static <T> T visitArray(Type type, Schema array, AvroSchemaWithTypeVisitor<T> visitor) { | ||
| if (array.getLogicalType() instanceof LogicalMap) { |
There was a problem hiding this comment.
Is it worth while to also check for AvroSchemaUtil.isKeyValueSchema(array.getElementType())?
There was a problem hiding this comment.
No, but I think it would be good to check whether the Iceberg type is a map.
| return visit(iSchema.asStruct(), schema, visitor); | ||
| } | ||
|
|
||
| public static <T> T visit(Type iType, Schema schema, AvroSchemaWithTypeVisitor<T> visitor) { |
There was a problem hiding this comment.
When could the expected type be null? I see that we are traversing NULL branch of the union, is it because of that? Also, I'm not clear on why do we need to visit the NULL branch of the union
There was a problem hiding this comment.
The Iceberg type might be null if the Avro type has no corresponding field. For example, if we drop a column from an Iceberg schema and read an older data file, that column will not be in the read schema, but will be in file schemas.
|
Merged. Thanks for reviewing, @Parth-Brahmbhatt and @rdsr! |
This uses Avro's name sanitization methods to ensure that field names are compatible with Parquet. The names stored in each file don't actually matter because Iceberg uses field IDs.