-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[CALCITE-6372] Support ASOF joins #3883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The CI failures are not fundamental, I will try to fix them today |
26a1aa6 to
d4ec7aa
Compare
|
I am very interested in this PR and I may need to spend some time reviewing it. |
|
Go ahead |
| double innerRowCount = left * right * selectivity; | ||
| switch (join.getJoinType()) { | ||
| case ASOF: | ||
| case LEFT_ASOF: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ASOF similar to inner join or left join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation we have both ASOF and LEFT ASOF.
ASOF itself is an INNER join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that LEFT_ASOF returns left. Would it be more accurate for asof to return (left * selectivity)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this part of the code, I see that other places use 1 - selectivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that for a regular inner join, selectivity means the probability of the condition being true in the result, and 1 - selectivity is because for an outer join we need to add the unmatched rows on the null side.
For asof join, we should consider how many matches there are against the left table, and we also could to consider the matchCondition, which is more complicated. Perhaps this optimization can be discussed in another issue.
| import java.util.List; | ||
| import java.util.Set; | ||
|
|
||
| /** Implementation of {@link LogicalAsofJoin} in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Implementation of {@link Join} not LogicalAsofJoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, this doesn't implement a generic join, it only implements an ASOF join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should consider introducing a new AsofJoin class? Both LogicalAsofJoin and EnumerableAsofJoin can extends this class
| condition, | ||
| matchCondition); | ||
| } else { | ||
| if (type == JoinType.ASOF && type != JoinType.LEFT_ASOF) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition is type == JoinType.ASOF || type == JoinType.LEFT_ASOF ?
| /** Implementation of {@link LogicalAsofJoin} in | ||
| * {@link EnumerableConvention enumerable calling convention}. */ | ||
| public class EnumerableAsofJoin extends Join implements EnumerableRel { | ||
| // This implementation is based on {@link EnumerableHashJoin} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is based on {@link EnumerableHashJoin},
Can this class directly extends EnumerableHashJoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there would be much code reuse.
| @Override public EnumerableAsofJoin copy(RelTraitSet traitSet, RexNode condition, | ||
| RelNode left, RelNode right, JoinRelType joinType, | ||
| boolean semiJoinDone) { | ||
| // This method does not know about the matchCondition, so it should not be called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to support the copy of changing matchCondition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly this method cannot be right, so it's better to catch attempts to call it.
| // match comparator | ||
| .append(matchPredicate) | ||
| // timestamp comparator | ||
| .append(timestampComparator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the comparison limited to timestamp only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"timestamp" is just a name for a pair of columns from each side that are being compared.
It has nothing to do with the TIMESTAMP data type.
The MATCH_CONDITION is restricted to one comparison between a pair of columns from either side. That's enforced by the validator, and part of the spec.
|
|
||
| /** | ||
| * An ASOF JOIN operation combines rows from two tables based on comparable timestamp values. | ||
| * For each row in the left table, the join finds a single row in the right table that has the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any confusion in this description? From the description, it may be understood as left join.
| if (joinCondition != null) { | ||
| switch (join.getConditionType()) { | ||
| case ON: | ||
| writer.keyword("ON"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this restriction need to be processed in sql Parser.jj?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is, right now only ON is supported in the parser.
I plan to add USING later.
| * @param right Right input | ||
| * @param hints Hints | ||
| * @param condition Join condition | ||
| * @param matchCondition ASOF join match condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing joinType
| ImmutableList<RelDataTypeField> systemFieldList) { | ||
| super(cluster, traitSet, hints, left, right, condition, ImmutableSet.of(), joinType); | ||
| this.systemFieldList = requireNonNull(systemFieldList, "systemFieldList"); | ||
| this.matchCondition = matchCondition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limit requireNonNull(matchCondition)?
| case FULL: | ||
| return JoinType.FULL; | ||
| case ASOF: | ||
| return JoinType.ASOF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing LEFT_ASOF case?
|
I will review this PR this weekend, sorry for the delay |
|
@YiwenWu thank you for the review, I have pushed a commit with fixes for problems you found. |
| import java.util.List; | ||
| import java.util.Set; | ||
|
|
||
| /** Implementation of {@link LogicalAsofJoin} in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should consider introducing a new AsofJoin class? Both LogicalAsofJoin and EnumerableAsofJoin can extends this class
| + join.getRight().getRowType().getFieldCount(); | ||
| final RexNode conditionExpr = join.getCondition(); | ||
| final RexNode matchConditionExpr = (join instanceof LogicalAsofJoin) | ||
| ? ((LogicalAsofJoin) join).getMatchCondition() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly limit ((LogicalAsofJoin) join).getMatchCondition() always not null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest an Objects.requireNonNull? I am not sure what you would gain from that.
| joinType = JoinType() | ||
| e2 = TableRef1(ExprContext.ACCEPT_QUERY_OR_JOIN) | ||
| ( | ||
| [ <MATCH_CONDITION> matchCondition = Expression(ExprContext.ACCEPT_SUB_QUERY) ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) This grammar does not require parentheses, but accepts them, I think it's more general this way. We can add them if desired.
(2) The grammar accepts any expression, but the validator will reject complex expressions, accepting only expressions that contain a single comparison. This would be very hard to do using the grammar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1). This is a minor issue. The current implementation is based on Snowflake syntax, so keeping it consistent can help more users avoid confusion and learning new syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation accepts the Snowflake syntax, so users familiar with that syntax don't have to learn anything new. I think in general it's ok to be put as few constraints as possible on the programs you accept (the Jon Postel principle: https://en.wikipedia.org/wiki/Robustness_principle).
Also, please note that the Snowflake construct is deemed experimental in their documentation, so I don't think we should strive too much to adhere to it, maybe they will change their syntax too. (I hope they will, and, who knows, maybe we can even influence them?)
|
I have added a base |
YiwenWu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| matchCondition = expand(matchCondition, joinScope); | ||
| join.setOperand(6, matchCondition); | ||
| validateWhereOrOn(joinScope, matchCondition, "MATCH_CONDITION"); | ||
| ConjunctionOfEqualities conj = new ConjunctionOfEqualities(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to replace RelOptUtil.conjunctions() and match equal here.
Maybe the bool filed type affects like bool_column3 = (bool_column1 and bool_column2). WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, this is a rare case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"and' is not the same as "=" for booleans, since "and" returns "false" for "false = false".
But maybe I misunderstand your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that the checks here refer to snowflake's specification, i.e. "ON clause for ASOF JOIN must contain conjunctions of equality conditions only. Each side of an equality condition must only refer to either the left table or the right table. (S.STATE = P.STATE) OR (S.LOCATION = P.LOCATION) is invalid."
We expect that condition needs to satisfy
l.c1 = r.c1 and l.c2 = r.c2
But if you just check if sqlcall is equal / and, may a case like the following would also pass, if the types of c1 and c2 would both be boolean
(l.c1 and r.c1 ) and (l.c2 and r.c2 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will add a test case to make sure this doesn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I can't reuse RelOptUtil because that works on RexNode, and here I only have SqlNode objects.
I will improve the ConjunctionOfEqualities visitor.
|
Thank you @suibianwanwank, I hope I have fixed the problems you highlighted. |
|
Does anyone else plan to review this PR or can I merge it? |
|
I have tagged this with "will merge soon", please untag if you plan to review or disagree. |
suibianwanwank
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| } | ||
| if (kind == SqlKind.AND) { | ||
| List<SqlNode> operands = call.getOperandList(); | ||
| for (SqlNode operand : operands) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there might still be an issue here. If the condition is a.id = b.id AND a.id2 = b.id2 AND a.id3 = b.id3, this might not pass as expected because the conditions in SqlNode are not in the same operandList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Calcite AND can have N operands and I believe that there is a pass prior to this which will make sure that a sequence of AND operators is flattened into a single operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It look like that is only true for RexNodes, so you are right, I will have to handle this case too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed this case and added a test. Great observation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good✅
Signed-off-by: Mihai Budiu <[email protected]>
Signed-off-by: Mihai Budiu <[email protected]>
Signed-off-by: Mihai Budiu <[email protected]>
Signed-off-by: Mihai Budiu <[email protected]>
|
| e2 = TableRef1(ExprContext.ACCEPT_QUERY_OR_JOIN) | ||
| ( | ||
| [ <MATCH_CONDITION> matchCondition = Expression(ExprContext.ACCEPT_SUB_QUERY) ] | ||
| <ON> { on = JoinConditionType.ON.symbol(getPos()); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mihaibudiu this means that the ON clause is mandatory for ASOF joins as well, is that intentional? Snowflake's ASOF JOIN only requires the MATCH_CONDITION, with ON / USING being optional - https://docs.snowflake.com/en/sql-reference/constructs/asof-join#syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that it is mandatory with this implementation
And you can't just use 'true' either, since it requires an equijoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you can't just use 'true' either, since it requires an equijoin
Actually, this does seem to work because the ConjunctionOfEqualities class in SqlValidatorImpl doesn't override the literal visit method of SqlShuttle so ConjunctionOfEqualities.illegal retains its default value of false for a join condition like ON true.




This PR is probably easiest to review commit by commit. The 4 commits correspond to various Calcite program representations.