Skip to content

API, Core: Add geospatial predicates and bounding box literals#14101

Open
Kontinuation wants to merge 13 commits intoapache:mainfrom
Kontinuation:pr-2-geo-expr
Open

API, Core: Add geospatial predicates and bounding box literals#14101
Kontinuation wants to merge 13 commits intoapache:mainfrom
Kontinuation:pr-2-geo-expr

Conversation

@Kontinuation
Copy link
Member

This PR is based on #12667. 2 spatial predicates were added to operate on geospatial bounds:

  • ST_INTERSECTS: Test if 2 geospatial bounds intersects with each other
  • ST_DISJOINT: The negation of ST_INTERSECTS.

A new literal type BoundingBoxLiteral is added for representing a geospatial query window or geospatial bounds in metrics or Parquet stats.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 19, 2025
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Oct 27, 2025
@stevenzwu stevenzwu reopened this Nov 10, 2025
@github-actions github-actions bot removed the stale label Nov 11, 2025
@Kontinuation Kontinuation force-pushed the pr-2-geo-expr branch 2 times, most recently from 560a29d to fa54b84 Compare November 18, 2025 08:55
@Kontinuation Kontinuation changed the title [WIP] API, Core: Add geospatial predicates and bounding box literals API, Core: Add geospatial predicates and bounding box literals Nov 18, 2025
@Kontinuation Kontinuation marked this pull request as ready for review November 18, 2025 16:20
@jiayuasu
Copy link
Member

@stevenzwu @szehon-ho @huaxingao @wgtmac @hsiang-c Would you please take a look at this PR?

return new Literals.DecimalLiteral(value);
}

static Literal<ByteBuffer> of(BoundingBox value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if BoundingBox is the only geo literal we are going to have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoundingBox is the only geo literal we plan to add in the series of PRs. I think the current design is open to adding more types of geospatial literals.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the first case in Iceberg where the literal value type (BoundingBox) is different with the field value (ByteBuffer). I have been wondering if the return type should be Literal<BoundingBox>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BoundingBoxLiteral implements Literal<ByteBuffer> to have compatible type with the Java types of iceberg geospatial types, so the return type was defined as Literal<ByteBuffer>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the Java type needs to match. I think we just need to handle this in UnboundPredicate.bind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest that we hold Literal<BoundingBox> in the literals field of UnboundPredicate<ByteBuffer> by violating the type signature of List<Literal<T>> literals? I think we can still make it compile, but I just want to make sure that it is the right way to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that BoundingBoxLiteral should be Literal<BoundingBox> and not Literal<ByteBuffer>. We don't need the term and literal types to match later on.


@Override
protected Type.TypeID typeId() {
return Type.TypeID.GEOMETRY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit weird to me. bounding box literal can only be GEOMETRY type?

Maybe we should revisit a previous discussion.
#12667 (comment)

Maybe the BoundingBox needs to include the geo type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is indeed strange, although this implementation works in this case since typeId() is only used by BaseLiteral.toByteBuffer for serializing the value of the literal.

I found 2 ways to fix this weirdness:

  1. Making BoundingBoxLiteral directly implementing interface Literal, but not extends from BaseLiteral. The Literal interface does not have the typeId method we don't have to come up with a confusing implementation.
  2. Adding geo type to BoundingBox. Check that the arguments passed to spatial predicates must have the same geo type.

I am working towards approach 2 and see if everything fits perfectly after revising the BoundingBox definition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to switch to approach 1 and remove the typeId() method from BoundingBoxLiteral. Here are the reasons:

  1. If BoundingBox carries typeId, it should also carry the full fledged type information including CRS and algorithm. Otherwise it would be strange that only the geospatial type ID is explicitly specified while other type attributes inherit from the type of the field.
  2. Iceberg needs to check that the type of the field and the bounding box matches with each other when binding and evaluating spatial predicates. The CRS and algorithm also need to match.
  3. Once the query engine checked that the geospatial field is supported by the engine and decided to generate a bounding box for pushing down a spatial predicate, it needs extract the geospatial type attributes from the Iceberg schema to construct the BoundingBox. The CRS attribute is tightly coupled with the iceberg table since it is the name of the table property defining the full CRS in PROJJSON, so there's a high chance that the engine will extract field type in iceberg table schema directly use it to construct the bounding box. Inferring the type of BoundingBox according to the context would eliminate the need for this extra step.
  4. The untyped BoundingBox looks concise in the REST Catalog SPEC: REST: Add spatial expressions to REST Catalog OpenAPI Spec #14856. Adding type information to the bounding box would make the data model more cluttered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the reasoning. Let's go with approach 1 for now.

@Kontinuation Kontinuation force-pushed the pr-2-geo-expr branch 2 times, most recently from c320c60 to 5bb1a6b Compare December 11, 2025 13:28
return new Literals.DecimalLiteral(value);
}

static Literal<ByteBuffer> of(BoundingBox value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the first case in Iceberg where the literal value type (BoundingBox) is different with the field value (ByteBuffer). I have been wondering if the return type should be Literal<BoundingBox>.

@jiayuasu
Copy link
Member

@stevenzwu @huaxingao @wgtmac Hi folks, happy new year! Can you please take a look at this PR again?

@huaxingao
Copy link
Contributor

Thanks all. This PR looks in good shape and I’d like to merge it soon. I’ll wait until Wednesday for any additional feedback from @hsiang-c @wgtmac @rdblue; if there are no further comments by then, I’ll go ahead and merge.

@huaxingao
Copy link
Contributor

Sorry—I didn’t notice this PR is scheduled for discussion in the next community sync. Let’s wait and follow up after the sync.

@stevenzwu
Copy link
Contributor

@Kontinuation please resolve the merge conflict

return new UnboundPredicate<>(Expression.Operation.NOT_STARTS_WITH, expr, value);
}

public static UnboundPredicate<ByteBuffer> stIntersects(String name, BoundingBox value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a BoundingBox rather than passing values directly? Should we have a version that accepts xmin, xmax, ymin, ymax for simple cases?

Copy link
Contributor

@stevenzwu stevenzwu Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeospatialBound also have additionally optional dimensions z and m. Hence, I thought BoundingBox(GeospatialBound min, GeospatialBound max) is probably good here.

If we want to provide some convenient constructor, should we do it in BoundingBox?

public static BoundingBox from(xmin, xmax, ymin, ymax)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm fine either way. We could include an override here that produces a bounding box. I agree with not having 6 and 8 argument versions of the method. And we will need to consider what to name the factory methods in BoundingBox because it can be ambiguous whether you're passing z or m through a 6-arg factory method.

}

public static UnboundPredicate<ByteBuffer> stIntersects(
UnboundTerm<ByteBuffer> expr, BoundingBox value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What unbound term is this intended to accept? I don't think that we have anything to pass here other than refs and if that's the case we should remove this. We don't add anything to the public API that isn't actually needed.


public <T> R stIntersects(BoundReference<T> ref, Literal<ByteBuffer> lit) {
throw new UnsupportedOperationException(
"stIntersects expression is not supported by the visitor");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for "by the visitor" because this is clear from the stack trace.

Copy link
Member Author

@Kontinuation Kontinuation Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other methods such as startsWith, in and notIn also includes "by the visitor" in the error message. Would you like to have a separate PR to fix these error messages as well?

public <T> R in(BoundReference<T> ref, Set<T> literalSet) {
throw new UnsupportedOperationException("In expression is not supported by the visitor");
}
public <T> R notIn(BoundReference<T> ref, Set<T> literalSet) {
throw new UnsupportedOperationException("notIn expression is not supported by the visitor");
}
public <T> R startsWith(BoundReference<T> ref, Literal<T> lit) {
throw new UnsupportedOperationException(
"startsWith expression is not supported by the visitor");
}
public <T> R notStartsWith(BoundReference<T> ref, Literal<T> lit) {
throw new UnsupportedOperationException(
"notStartsWith expression is not supported by the visitor");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I don't think it is necessary to modify those. But we don't need this and try to keep things direct and minimal.

"notStartsWith expression is not supported by the visitor");
}

public <T> R stIntersects(BoundReference<T> ref, Literal<ByteBuffer> lit) {
Copy link
Contributor

@rdblue rdblue Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this only work with ByteBuffer? Wouldn't it be better to use BoundingBox?

My guess is that when the expression is bound, the literal gets coerced and the Java type we used for the geo types is ByteBuffer? I think we need to address this so that we don't deserialize this for every evaluation. This visitor is called in a tight loop and this would lose a lot of performance.

Copy link
Contributor

@rdblue rdblue Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the right signature for this is stIntersects(BoundReference<T>, BoundingBox) or maybe with BoundReference<ByteBuffer>. The use of T in both BoundReference and Literal is because we want comparisons to be able to operate without requiring casting. That way, ref.comparator() is a Comparator<T> and both ref.accessor().get(row) and lit.value() are as well.

This is correct that we don't need T to match, but it's awkward to get a BoundReference<T> and a concrete Literal. Since we are going to need to trust binding to geo types, we can update the visitor to pass the BoundingBox. I'm not sure that we want to wrap BoundingBox in Literal, but I'm learning against doing it. I think we should just pass the right types here. We can unwrap the literal from the BoundingBox in the visitor and the cast will guarantee the type is correct, rather than putting the responsibility on the implementations.

return notStartsWith((BoundReference<T>) pred.term(), literalPred.literal());
case ST_INTERSECTS:
return stIntersects(
(BoundReference<T>) pred.term(), (Literal<ByteBuffer>) literalPred.literal());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unchecked cast is suspicious and should be fixed. The solution is probably to fix the literal type from binding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to do an unchecked cast to cast literalPred.literal from Literal<T> to Literal<BoundingBox>, even if we define BoundingBoxLiteral implements Literal<BoundingBox>. Is this cast acceptable and look less problematic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we may still need to do an unchecked cast. This is an odd situation where the literal and term don't produce the same thing. I would rather have an unchecked cast here and pass BoundingBox than try to make everything work as ByteBuffer, where the Java type matches but the actual values are completely different.

}
}

static class BoundingBoxLiteral implements Literal<ByteBuffer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be Literal<BoundingBox> since that's what we need in the evaluators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing Literal<ByteBuffer> was for not breaking the type constraint of List<Literal<T>> literals in UnboundPredicate<ByteBuffer> and Literal<T> literal of BoundLiteralPredicate<ByteBuffer>. Do you suggest that we break the constraint for spatial predicates involving bounding box literals, or do a major refactoring by implementing new variants of predicates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this requires breaking constraints for UnboundPredicate. Those constraints are not violated:

  • T is a type held by a literal, in this case BoundingBox
  • Literal<T> is a literal that holds that type, which is Literal<BoundingBox> implemented by BoundingBoxLiteral
  • List<Literal<T>> is a holder in case the unbound expression has multiple literals. It won't in this case.

this.value = value.toByteBuffer();
}

BoundingBoxLiteral(ByteBuffer value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. Conversion from Literal<ByteBuffer> should be done in BinaryLiteral and from Literal<byte[]> in FixedLiteral.

}

@Override
public Comparator<ByteBuffer> comparator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite fit since we don't expect the types to align. I think this should throw an exception because there isn't a comparator.

}

BoundingBoxLiteral that = (BoundingBoxLiteral) other;
return comparator().compare(value(), that.value()) == 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works, then the comparator is incorrect for values. This is comparing serialized bounding boxes.

@Override
Object readResolve() throws ObjectStreamException {
return new Literals.BoundingBoxLiteral(
ByteBuffer.wrap(bytes()).order(ByteOrder.LITTLE_ENDIAN));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By setting order, this changes the buffer that was serialized. I think that the expression code should leave the byte order alone and the BoundingBox serialization and deserialization should handle it. That serialization should also be duplicating the ByteBuffer anyway because it should not be consuming or changing the incoming buffer.

case GEOGRAPHY:
// GEOMETRY and GEOGRAPHY values are represented as byte buffers. They are stored as WKB
// (Well-Known Binary) format in Iceberg, and geospatial bounding boxes are stored as
// serialized byte buffers. Return the byte buffer as is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be used for BoundingBox. This is for values, not for what you may compare values against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants