Update indexing to handle nested lists by rdblue · Pull Request #627 · apache/iceberg

rdblue · 2019-11-11T00:11:48Z

This updates how Schema fields are indexed by name.

Previously, the visit method maintained a list of parent field names in each visitor, and the IndexByName visitor used these fields to create qualified field names. But the visitor did not add "element", "key", and "value" parent names, which resulted in duplicate names when indexing, for example, a list of lists.

The purpose of omitting "element", "key", and "value" from parent field names was to avoid forcing users to handle unnamed structures. For example, leaving out "element" from names for points: struct<x double, y double> results in fields "points.x" and "points.y" (and the nested struct as "points.element") instead of "points.element.x" and "points.element.y".

This updates how element and value names are skipped, by only skipping the names if a map value or list element is a nested struct. That way, a list of lists will correctly add an "element" level when processing the outer list's element. This also updates indexing so that "key" is always used so that key and value fields will not conflict.

This also moves the parent field names into IndexByName because they are only used in that class, and changes it to be a CustomOrderSchemaVisitor.

rdblue · 2019-11-11T00:13:04Z

FYI @rdsr

rdsr · 2019-11-11T05:05:50Z

Thanks @rdblue I'll take a look

chenjunjiedada · 2019-11-11T17:07:08Z

Looks like it still not work for list of list.

rdblue · 2019-11-11T22:02:28Z

@chenjunjiedada, what do you mean? This works to index the test case you posted.

chenjunjiedada · 2019-11-12T15:52:16Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

  @Override
-  public Map<String, Integer> field(Types.NestedField field, Map<String, Integer> fieldResult) {
+  public Map<String, Integer> field(Types.NestedField field, Supplier<Map<String, Integer>> fieldResult) {
+    withName(field.name(), fieldResult::get);


Do we need to add some explanation here and other places where withName is called?

I think it's pretty clear from the definition of withName that this is updating the parent names before getting the child results.

chenjunjiedada · 2019-11-12T16:21:05Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

      addField(field.name(), field.fieldId());
    }
+
+    if (map.valueType().isStructType()) {


This will end with asymmetrical names but it does solve the problem, really smart way.

Yeah, I tried adding both variants, but it fails when we create a BiMap from the index due to duplicate values (list.field and list.element.field with the same ID). We can fix this later by adding these as aliases, but we'd have to store it separately so I thought it would be better to start with the simple solution here.

chenjunjiedada · 2019-11-12T16:23:07Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

+  public Map<String, Integer> map(Types.MapType map,
+                                  Supplier<Map<String, Integer>> keyResult,
+                                  Supplier<Map<String, Integer>> valueResult) {
+    withName("key", keyResult::get);


Can we add a document to explain this is an in-order way IIUC?

This is actually post-order because children are visited before updating for this node. But for this use, order doesn't matter. I don't think it makes sense to state that a certain order is used when it can be done with other orders.

My understanding is that the map function is like accessing a binary tree, you have left child keyResult and right child valueResult. It visits the left child firstly with withName, and the node itself, then the right child.

chenjunjiedada · 2019-11-12T16:25:00Z

api/src/main/java/org/apache/iceberg/types/IndexByName.java

  @Override
-  public Map<String, Integer> struct(Types.StructType struct, List<Map<String, Integer>> fieldResults) {
+  public Map<String, Integer> struct(Types.StructType struct, Iterable<Map<String, Integer>> fieldResults) {
+    // iterate through the fields to update the index for each one, use size to avoid errorprone failure


We may need some explanation here? what is error-prone failure?

Errorprone is a static analysis checker that we run.

chenjunjiedada · 2019-11-13T00:00:13Z

@rdblue, in commit ed3e3cb, the unit test in case of list of list case gets NPE, while it works with the latest commit. I 'm not sure what had been fixed since both of them is lazy initialization.

rdblue · 2019-11-13T01:09:13Z

@chenjunjiedada, the problem in that commit was that not all of the visitor methods return a non-null map, so calling size on child results could fail. All we need to do is to iterate through the list to index the children, which is why we now use Lists.newArrayList. Since we also have to do something to consume the output, we call size on that list.

chenjunjiedada · 2019-11-13T02:54:58Z

LGTM, +1

rdblue · 2019-11-13T03:41:06Z

Thanks, @chenjunjiedada! I'll merge this. Can you update #619 to use this validation instead?

Update indexing to handle nested lists.

76a773b

rdblue mentioned this pull request Nov 11, 2019

Add schema validator #619

Merged

rdblue force-pushed the fix-schema-name-indexes branch from 63e4c14 to 76a773b Compare November 11, 2019 00:13

rdblue added 2 commits November 10, 2019 16:21

Fix checkstyle problems.

5470c8c

Avoid Errorprone problem.

ed3e3cb

Fix another errorprone false positive.

da80d82

chenjunjiedada reviewed Nov 12, 2019

View reviewed changes

rdblue merged commit d9ffdef into apache:master Nov 13, 2019

rdblue added a commit to rdblue/iceberg that referenced this pull request Jan 20, 2020

Update indexing to handle nested lists (apache#627)

12f7440

fbertsch pushed a commit to fbertsch/iceberg that referenced this pull request Jan 19, 2026

NETFLIX-BUILD: Add CloneTable spark procedure (apache#627)

bc5e7a4

Conversation

rdblue commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Nov 11, 2019

Uh oh!

rdsr commented Nov 11, 2019

Uh oh!

chenjunjiedada commented Nov 11, 2019

Uh oh!

rdblue commented Nov 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenjunjiedada commented Nov 13, 2019

Uh oh!

rdblue commented Nov 13, 2019

Uh oh!

chenjunjiedada commented Nov 13, 2019

Uh oh!

rdblue commented Nov 13, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue commented Nov 11, 2019 •

edited

Loading