API, Core: Add geospatial predicates and bounding box literals#14101
API, Core: Add geospatial predicates and bounding box literals#14101Kontinuation wants to merge 13 commits intoapache:mainfrom
Conversation
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
560a29d to
fa54b84
Compare
|
@stevenzwu @szehon-ho @huaxingao @wgtmac @hsiang-c Would you please take a look at this PR? |
| return new Literals.DecimalLiteral(value); | ||
| } | ||
|
|
||
| static Literal<ByteBuffer> of(BoundingBox value) { |
There was a problem hiding this comment.
curious if BoundingBox is the only geo literal we are going to have.
There was a problem hiding this comment.
BoundingBox is the only geo literal we plan to add in the series of PRs. I think the current design is open to adding more types of geospatial literals.
There was a problem hiding this comment.
This is probably the first case in Iceberg where the literal value type (BoundingBox) is different with the field value (ByteBuffer). I have been wondering if the return type should be Literal<BoundingBox>.
There was a problem hiding this comment.
BoundingBoxLiteral implements Literal<ByteBuffer> to have compatible type with the Java types of iceberg geospatial types, so the return type was defined as Literal<ByteBuffer>.
There was a problem hiding this comment.
I don't think the Java type needs to match. I think we just need to handle this in UnboundPredicate.bind.
There was a problem hiding this comment.
Do you suggest that we hold Literal<BoundingBox> in the literals field of UnboundPredicate<ByteBuffer> by violating the type signature of List<Literal<T>> literals? I think we can still make it compile, but I just want to make sure that it is the right way to do it.
There was a problem hiding this comment.
I'm saying that BoundingBoxLiteral should be Literal<BoundingBox> and not Literal<ByteBuffer>. We don't need the term and literal types to match later on.
|
|
||
| @Override | ||
| protected Type.TypeID typeId() { | ||
| return Type.TypeID.GEOMETRY; |
There was a problem hiding this comment.
this seems a bit weird to me. bounding box literal can only be GEOMETRY type?
Maybe we should revisit a previous discussion.
#12667 (comment)
Maybe the BoundingBox needs to include the geo type?
There was a problem hiding this comment.
Yes, this is indeed strange, although this implementation works in this case since typeId() is only used by BaseLiteral.toByteBuffer for serializing the value of the literal.
I found 2 ways to fix this weirdness:
- Making
BoundingBoxLiteraldirectly implementinginterface Literal, but not extends fromBaseLiteral. TheLiteralinterface does not have thetypeIdmethod we don't have to come up with a confusing implementation. - Adding geo type to
BoundingBox. Check that the arguments passed to spatial predicates must have the same geo type.
I am working towards approach 2 and see if everything fits perfectly after revising the BoundingBox definition.
There was a problem hiding this comment.
I decided to switch to approach 1 and remove the typeId() method from BoundingBoxLiteral. Here are the reasons:
- If
BoundingBoxcarries typeId, it should also carry the full fledged type information including CRS and algorithm. Otherwise it would be strange that only the geospatial type ID is explicitly specified while other type attributes inherit from the type of the field. - Iceberg needs to check that the type of the field and the bounding box matches with each other when binding and evaluating spatial predicates. The CRS and algorithm also need to match.
- Once the query engine checked that the geospatial field is supported by the engine and decided to generate a bounding box for pushing down a spatial predicate, it needs extract the geospatial type attributes from the Iceberg schema to construct the
BoundingBox. The CRS attribute is tightly coupled with the iceberg table since it is the name of the table property defining the full CRS in PROJJSON, so there's a high chance that the engine will extract field type in iceberg table schema directly use it to construct the bounding box. Inferring the type of BoundingBox according to the context would eliminate the need for this extra step. - The untyped BoundingBox looks concise in the REST Catalog SPEC: REST: Add spatial expressions to REST Catalog OpenAPI Spec #14856. Adding type information to the bounding box would make the data model more cluttered.
There was a problem hiding this comment.
I agree with the reasoning. Let's go with approach 1 for now.
api/src/main/java/org/apache/iceberg/expressions/ExpressionUtil.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/expressions/TestExpressionParser.java
Outdated
Show resolved
Hide resolved
c320c60 to
5bb1a6b
Compare
api/src/main/java/org/apache/iceberg/expressions/UnboundPredicate.java
Outdated
Show resolved
Hide resolved
5bb1a6b to
2a56c2b
Compare
2a56c2b to
6a4a1b3
Compare
api/src/main/java/org/apache/iceberg/expressions/UnboundPredicate.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/types/TestConversions.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestExpressionUtil.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestExpressionUtil.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestPredicateBinding.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/expressions/ExpressionParser.java
Outdated
Show resolved
Hide resolved
core/src/main/java/org/apache/iceberg/expressions/ExpressionParser.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/UnboundPredicate.java
Outdated
Show resolved
Hide resolved
| return new Literals.DecimalLiteral(value); | ||
| } | ||
|
|
||
| static Literal<ByteBuffer> of(BoundingBox value) { |
There was a problem hiding this comment.
This is probably the first case in Iceberg where the literal value type (BoundingBox) is different with the field value (ByteBuffer). I have been wondering if the return type should be Literal<BoundingBox>.
api/src/test/java/org/apache/iceberg/expressions/TestExpressionUtil.java
Outdated
Show resolved
Hide resolved
4bb3e84 to
fc86d0c
Compare
|
@stevenzwu @huaxingao @wgtmac Hi folks, happy new year! Can you please take a look at this PR again? |
|
Sorry—I didn’t notice this PR is scheduled for discussion in the next community sync. Let’s wait and follow up after the sync. |
|
@Kontinuation please resolve the merge conflict |
c6698fb to
f55e6e2
Compare
| return new UnboundPredicate<>(Expression.Operation.NOT_STARTS_WITH, expr, value); | ||
| } | ||
|
|
||
| public static UnboundPredicate<ByteBuffer> stIntersects(String name, BoundingBox value) { |
There was a problem hiding this comment.
Why is this a BoundingBox rather than passing values directly? Should we have a version that accepts xmin, xmax, ymin, ymax for simple cases?
There was a problem hiding this comment.
GeospatialBound also have additionally optional dimensions z and m. Hence, I thought BoundingBox(GeospatialBound min, GeospatialBound max) is probably good here.
If we want to provide some convenient constructor, should we do it in BoundingBox?
public static BoundingBox from(xmin, xmax, ymin, ymax)
There was a problem hiding this comment.
I think I'm fine either way. We could include an override here that produces a bounding box. I agree with not having 6 and 8 argument versions of the method. And we will need to consider what to name the factory methods in BoundingBox because it can be ambiguous whether you're passing z or m through a 6-arg factory method.
| } | ||
|
|
||
| public static UnboundPredicate<ByteBuffer> stIntersects( | ||
| UnboundTerm<ByteBuffer> expr, BoundingBox value) { |
There was a problem hiding this comment.
What unbound term is this intended to accept? I don't think that we have anything to pass here other than refs and if that's the case we should remove this. We don't add anything to the public API that isn't actually needed.
|
|
||
| public <T> R stIntersects(BoundReference<T> ref, Literal<ByteBuffer> lit) { | ||
| throw new UnsupportedOperationException( | ||
| "stIntersects expression is not supported by the visitor"); |
There was a problem hiding this comment.
No need for "by the visitor" because this is clear from the stack trace.
There was a problem hiding this comment.
Other methods such as startsWith, in and notIn also includes "by the visitor" in the error message. Would you like to have a separate PR to fix these error messages as well?
There was a problem hiding this comment.
Yes, and I don't think it is necessary to modify those. But we don't need this and try to keep things direct and minimal.
| "notStartsWith expression is not supported by the visitor"); | ||
| } | ||
|
|
||
| public <T> R stIntersects(BoundReference<T> ref, Literal<ByteBuffer> lit) { |
There was a problem hiding this comment.
Why does this only work with ByteBuffer? Wouldn't it be better to use BoundingBox?
My guess is that when the expression is bound, the literal gets coerced and the Java type we used for the geo types is ByteBuffer? I think we need to address this so that we don't deserialize this for every evaluation. This visitor is called in a tight loop and this would lose a lot of performance.
There was a problem hiding this comment.
I think that the right signature for this is stIntersects(BoundReference<T>, BoundingBox) or maybe with BoundReference<ByteBuffer>. The use of T in both BoundReference and Literal is because we want comparisons to be able to operate without requiring casting. That way, ref.comparator() is a Comparator<T> and both ref.accessor().get(row) and lit.value() are as well.
This is correct that we don't need T to match, but it's awkward to get a BoundReference<T> and a concrete Literal. Since we are going to need to trust binding to geo types, we can update the visitor to pass the BoundingBox. I'm not sure that we want to wrap BoundingBox in Literal, but I'm learning against doing it. I think we should just pass the right types here. We can unwrap the literal from the BoundingBox in the visitor and the cast will guarantee the type is correct, rather than putting the responsibility on the implementations.
| return notStartsWith((BoundReference<T>) pred.term(), literalPred.literal()); | ||
| case ST_INTERSECTS: | ||
| return stIntersects( | ||
| (BoundReference<T>) pred.term(), (Literal<ByteBuffer>) literalPred.literal()); |
There was a problem hiding this comment.
This unchecked cast is suspicious and should be fixed. The solution is probably to fix the literal type from binding.
There was a problem hiding this comment.
I think we still need to do an unchecked cast to cast literalPred.literal from Literal<T> to Literal<BoundingBox>, even if we define BoundingBoxLiteral implements Literal<BoundingBox>. Is this cast acceptable and look less problematic?
There was a problem hiding this comment.
Yes, we may still need to do an unchecked cast. This is an odd situation where the literal and term don't produce the same thing. I would rather have an unchecked cast here and pass BoundingBox than try to make everything work as ByteBuffer, where the Java type matches but the actual values are completely different.
| } | ||
| } | ||
|
|
||
| static class BoundingBoxLiteral implements Literal<ByteBuffer> { |
There was a problem hiding this comment.
I think this should be Literal<BoundingBox> since that's what we need in the evaluators.
There was a problem hiding this comment.
Implementing Literal<ByteBuffer> was for not breaking the type constraint of List<Literal<T>> literals in UnboundPredicate<ByteBuffer> and Literal<T> literal of BoundLiteralPredicate<ByteBuffer>. Do you suggest that we break the constraint for spatial predicates involving bounding box literals, or do a major refactoring by implementing new variants of predicates?
There was a problem hiding this comment.
I don't think this requires breaking constraints for UnboundPredicate. Those constraints are not violated:
Tis a type held by a literal, in this caseBoundingBoxLiteral<T>is a literal that holds that type, which isLiteral<BoundingBox>implemented byBoundingBoxLiteralList<Literal<T>>is a holder in case the unbound expression has multiple literals. It won't in this case.
| this.value = value.toByteBuffer(); | ||
| } | ||
|
|
||
| BoundingBoxLiteral(ByteBuffer value) { |
There was a problem hiding this comment.
I don't think this is needed. Conversion from Literal<ByteBuffer> should be done in BinaryLiteral and from Literal<byte[]> in FixedLiteral.
| } | ||
|
|
||
| @Override | ||
| public Comparator<ByteBuffer> comparator() { |
There was a problem hiding this comment.
This doesn't quite fit since we don't expect the types to align. I think this should throw an exception because there isn't a comparator.
| } | ||
|
|
||
| BoundingBoxLiteral that = (BoundingBoxLiteral) other; | ||
| return comparator().compare(value(), that.value()) == 0; |
There was a problem hiding this comment.
If this works, then the comparator is incorrect for values. This is comparing serialized bounding boxes.
| @Override | ||
| Object readResolve() throws ObjectStreamException { | ||
| return new Literals.BoundingBoxLiteral( | ||
| ByteBuffer.wrap(bytes()).order(ByteOrder.LITTLE_ENDIAN)); |
There was a problem hiding this comment.
By setting order, this changes the buffer that was serialized. I think that the expression code should leave the byte order alone and the BoundingBox serialization and deserialization should handle it. That serialization should also be duplicating the ByteBuffer anyway because it should not be consuming or changing the incoming buffer.
| case GEOGRAPHY: | ||
| // GEOMETRY and GEOGRAPHY values are represented as byte buffers. They are stored as WKB | ||
| // (Well-Known Binary) format in Iceberg, and geospatial bounding boxes are stored as | ||
| // serialized byte buffers. Return the byte buffer as is. |
There was a problem hiding this comment.
This should not be used for BoundingBox. This is for values, not for what you may compare values against.
This PR is based on #12667. 2 spatial predicates were added to operate on geospatial bounds:
ST_INTERSECTS: Test if 2 geospatial bounds intersects with each otherST_DISJOINT: The negation ofST_INTERSECTS.A new literal type
BoundingBoxLiteralis added for representing a geospatial query window or geospatial bounds in metrics or Parquet stats.