Update indexing to handle nested lists#627
Conversation
|
FYI @rdsr |
63e4c14 to
76a773b
Compare
|
Thanks @rdblue I'll take a look |
|
Looks like it still not work for list of list. |
|
@chenjunjiedada, what do you mean? This works to index the test case you posted. |
| @Override | ||
| public Map<String, Integer> field(Types.NestedField field, Map<String, Integer> fieldResult) { | ||
| public Map<String, Integer> field(Types.NestedField field, Supplier<Map<String, Integer>> fieldResult) { | ||
| withName(field.name(), fieldResult::get); |
There was a problem hiding this comment.
Do we need to add some explanation here and other places where withName is called?
There was a problem hiding this comment.
I think it's pretty clear from the definition of withName that this is updating the parent names before getting the child results.
| addField(field.name(), field.fieldId()); | ||
| } | ||
|
|
||
| if (map.valueType().isStructType()) { |
There was a problem hiding this comment.
This will end with asymmetrical names but it does solve the problem, really smart way.
There was a problem hiding this comment.
Yeah, I tried adding both variants, but it fails when we create a BiMap from the index due to duplicate values (list.field and list.element.field with the same ID). We can fix this later by adding these as aliases, but we'd have to store it separately so I thought it would be better to start with the simple solution here.
| public Map<String, Integer> map(Types.MapType map, | ||
| Supplier<Map<String, Integer>> keyResult, | ||
| Supplier<Map<String, Integer>> valueResult) { | ||
| withName("key", keyResult::get); |
There was a problem hiding this comment.
Can we add a document to explain this is an in-order way IIUC?
There was a problem hiding this comment.
This is actually post-order because children are visited before updating for this node. But for this use, order doesn't matter. I don't think it makes sense to state that a certain order is used when it can be done with other orders.
There was a problem hiding this comment.
My understanding is that the map function is like accessing a binary tree, you have left child keyResult and right child valueResult. It visits the left child firstly with withName, and the node itself, then the right child.
| @Override | ||
| public Map<String, Integer> struct(Types.StructType struct, List<Map<String, Integer>> fieldResults) { | ||
| public Map<String, Integer> struct(Types.StructType struct, Iterable<Map<String, Integer>> fieldResults) { | ||
| // iterate through the fields to update the index for each one, use size to avoid errorprone failure |
There was a problem hiding this comment.
We may need some explanation here? what is error-prone failure?
There was a problem hiding this comment.
Errorprone is a static analysis checker that we run.
|
@chenjunjiedada, the problem in that commit was that not all of the visitor methods return a non-null map, so calling |
|
LGTM, +1 |
|
Thanks, @chenjunjiedada! I'll merge this. Can you update #619 to use this validation instead? |
This updates how
Schemafields are indexed by name.Previously, the visit method maintained a list of parent field names in each visitor, and the
IndexByNamevisitor used these fields to create qualified field names. But the visitor did not add "element", "key", and "value" parent names, which resulted in duplicate names when indexing, for example, a list of lists.The purpose of omitting "element", "key", and "value" from parent field names was to avoid forcing users to handle unnamed structures. For example, leaving out "element" from names for
points: struct<x double, y double>results in fields "points.x" and "points.y" (and the nested struct as "points.element") instead of "points.element.x" and "points.element.y".This updates how element and value names are skipped, by only skipping the names if a map value or list element is a nested struct. That way, a list of lists will correctly add an "element" level when processing the outer list's element. This also updates indexing so that "key" is always used so that key and value fields will not conflict.
This also moves the parent field names into
IndexByNamebecause they are only used in that class, and changes it to be aCustomOrderSchemaVisitor.