Skip to content

Update indexing to handle nested lists#627

Merged
rdblue merged 4 commits intoapache:masterfrom
rdblue:fix-schema-name-indexes
Nov 13, 2019
Merged

Update indexing to handle nested lists#627
rdblue merged 4 commits intoapache:masterfrom
rdblue:fix-schema-name-indexes

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Nov 11, 2019

This updates how Schema fields are indexed by name.

Previously, the visit method maintained a list of parent field names in each visitor, and the IndexByName visitor used these fields to create qualified field names. But the visitor did not add "element", "key", and "value" parent names, which resulted in duplicate names when indexing, for example, a list of lists.

The purpose of omitting "element", "key", and "value" from parent field names was to avoid forcing users to handle unnamed structures. For example, leaving out "element" from names for points: struct<x double, y double> results in fields "points.x" and "points.y" (and the nested struct as "points.element") instead of "points.element.x" and "points.element.y".

This updates how element and value names are skipped, by only skipping the names if a map value or list element is a nested struct. That way, a list of lists will correctly add an "element" level when processing the outer list's element. This also updates indexing so that "key" is always used so that key and value fields will not conflict.

This also moves the parent field names into IndexByName because they are only used in that class, and changes it to be a CustomOrderSchemaVisitor.

@rdblue rdblue mentioned this pull request Nov 11, 2019
@rdblue
Copy link
Contributor Author

rdblue commented Nov 11, 2019

FYI @rdsr

@rdblue rdblue force-pushed the fix-schema-name-indexes branch from 63e4c14 to 76a773b Compare November 11, 2019 00:13
@rdsr
Copy link
Contributor

rdsr commented Nov 11, 2019

Thanks @rdblue I'll take a look

@chenjunjiedada
Copy link
Collaborator

Looks like it still not work for list of list.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 11, 2019

@chenjunjiedada, what do you mean? This works to index the test case you posted.

@Override
public Map<String, Integer> field(Types.NestedField field, Map<String, Integer> fieldResult) {
public Map<String, Integer> field(Types.NestedField field, Supplier<Map<String, Integer>> fieldResult) {
withName(field.name(), fieldResult::get);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some explanation here and other places where withName is called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's pretty clear from the definition of withName that this is updating the parent names before getting the child results.

addField(field.name(), field.fieldId());
}

if (map.valueType().isStructType()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will end with asymmetrical names but it does solve the problem, really smart way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried adding both variants, but it fails when we create a BiMap from the index due to duplicate values (list.field and list.element.field with the same ID). We can fix this later by adding these as aliases, but we'd have to store it separately so I thought it would be better to start with the simple solution here.

public Map<String, Integer> map(Types.MapType map,
Supplier<Map<String, Integer>> keyResult,
Supplier<Map<String, Integer>> valueResult) {
withName("key", keyResult::get);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a document to explain this is an in-order way IIUC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually post-order because children are visited before updating for this node. But for this use, order doesn't matter. I don't think it makes sense to state that a certain order is used when it can be done with other orders.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the map function is like accessing a binary tree, you have left child keyResult and right child valueResult. It visits the left child firstly with withName, and the node itself, then the right child.

@Override
public Map<String, Integer> struct(Types.StructType struct, List<Map<String, Integer>> fieldResults) {
public Map<String, Integer> struct(Types.StructType struct, Iterable<Map<String, Integer>> fieldResults) {
// iterate through the fields to update the index for each one, use size to avoid errorprone failure
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need some explanation here? what is error-prone failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errorprone is a static analysis checker that we run.

@chenjunjiedada
Copy link
Collaborator

@rdblue, in commit ed3e3cb, the unit test in case of list of list case gets NPE, while it works with the latest commit. I 'm not sure what had been fixed since both of them is lazy initialization.

@rdblue
Copy link
Contributor Author

rdblue commented Nov 13, 2019

@chenjunjiedada, the problem in that commit was that not all of the visitor methods return a non-null map, so calling size on child results could fail. All we need to do is to iterate through the list to index the children, which is why we now use Lists.newArrayList. Since we also have to do something to consume the output, we call size on that list.

@chenjunjiedada
Copy link
Collaborator

LGTM, +1

@rdblue
Copy link
Contributor Author

rdblue commented Nov 13, 2019

Thanks, @chenjunjiedada! I'll merge this. Can you update #619 to use this validation instead?

@rdblue rdblue merged commit d9ffdef into apache:master Nov 13, 2019
rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 20, 2020
fbertsch pushed a commit to fbertsch/iceberg that referenced this pull request Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants