Skip to content

Add schema validator#619

Merged
rdblue merged 5 commits intoapache:masterfrom
chenjunjiedada:schema
Nov 20, 2019
Merged

Add schema validator#619
rdblue merged 5 commits intoapache:masterfrom
chenjunjiedada:schema

Conversation

@chenjunjiedada
Copy link
Collaborator

This fixes #598.


if (invalidFields.size() > 0) {
throw new ValidationException("Invalid Schema, please check fields: " +
Joiner.on(",").join(invalidFields.iterator()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing a space in the separator after ,.

@rdblue
Copy link
Contributor

rdblue commented Nov 9, 2019

I think this is actually easier to implement than the version here. Instead of running a new validator class, I think you can add this validation to the existing IndexByName class. Just change addField to include the validation:

  private void addField(String name, int fieldId) {
    String fullName = name;
    if (!fieldNames().isEmpty()) {
      fullName = DOT.join(DOT.join(fieldNames().descendingIterator()), name);
    }

    Integer existingFieldId = nameToId.put(fullName, fieldId);
    if (existingFieldId != null) {
      throw new ValidationException(
          "Invalid schema: multiple fields for name %s: %s and %s", fullName, existingFieldId, fieldId);
    }
  }

Then all you'd need to do is to access lazyNameToId() in the constructor.

@chenjunjiedada
Copy link
Collaborator Author

@rdblue, Good idea. Let me update it

@chenjunjiedada
Copy link
Collaborator Author

chenjunjiedada commented Nov 9, 2019

@rdblue, sorry I forgot that this visitor is different from IndexByName visitor since it directly returns null in list and map function. That is because we don't push filedNames in case of List and Map when visiting the schema, so the testMixedTypes test fails if we apply validation in IndexByName.

For example, when we define a mixed typed schema such as:

    Schema schema = new Schema(
        required(1, "list_of_lists",
            ListType.ofOptional(2, ListType.ofOptional(3, StructType.of(
                required(4, "id", LongType.get()))
            ))));

It throws exception like:
Invalid schema: multiple fields for name list_of_lists.element: 3 and 2 org.apache.iceberg.exceptions.ValidationException: Invalid schema: multiple fields for name list_of_lists.element: 3 and 2

The map returned by TypeUtil.indexByName actually cannot express the concept of list of list.

@rdblue
Copy link
Contributor

rdblue commented Nov 10, 2019

@chenjunjiedada, ignoring the problem when the name is "element" seems to solve that case:

    Integer existingFieldId = nameToId.put(fullName, fieldId);
    if (existingFieldId != null && !"element".equals(name)) {
      throw new ValidationException(
          "Invalid schema: multiple fields for name %s: %s and %s", fullName, existingFieldId, fieldId);
    }

I think that's a reasonable fix for now. I think we should also go back and fix how these indexes are produced. The field names are only used in IndexByName, so I think we can fix this.

@rdblue
Copy link
Contributor

rdblue commented Nov 11, 2019

@chenjunjiedada, I submitted a fix for the indexing problem with the ValidationException in #627. Can you take a look?

@chenjunjiedada
Copy link
Collaborator Author

Sure, will take a look.

this.aliasToId = aliases != null ? ImmutableBiMap.copyOf(aliases) : null;

// validate the schema through IndexByName visitor
TypeUtil.indexByName(struct);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as indexByName is being called, we may as well capture the result and use it for the name to ID map. Can you update these constructors to call lazyNameToId instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since lazyNameToId has one more check for uniqueness for value, some unit tests failed due to they don't have unique field ID such as testMixedTypes. And I think the lazy initialization is not lazy if we put it in a constructor since NameToId is always null when constructing. So can we just call IndexByName in constructor? It would be an easy way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix those tests by reassigning new IDs to types? I think that would be a better solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean update scheme definition in unit tests by reassigning new IDs? For example, there is a constant field SUPPORTED_PRIMITIVES in testMixedTypes that is used by other nested fields, do we need to remove the constant and use raw definition with new IDs? Or do you mean to use reassignIds from TypeUtils?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean to use TypeUtil.assignFreshIds.

)))
);

Schema schema = new Schema(TypeUtil.assignFreshIds(structType, this::assignNewId).asStructType().fields());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use (new AtomicInteger(0))::incrementAndGet instead. There is no need to create global state in the test cases.

@chenjunjiedada
Copy link
Collaborator Author

@rdblue , ready for merge?

@rdblue rdblue merged commit 8b41ee5 into apache:master Nov 20, 2019
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Jul 21, 2020
sunchao pushed a commit to sunchao/iceberg that referenced this pull request May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support recognizing the column represented with an array of fields

2 participants