-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-22608][SQL] add new API to CodeGeneration.splitExpressions() #19821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #84191 has finished for PR 19821 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep it unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it consistent, how about changing it to
def splitExpressions(row: String, arguments: Seq[(String, String)]): String = {There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In addition to that, how about this since many caller passes INPUT_ROW?
def splitExpressions(row: String, arguments: Seq[(String, String)] = ("InternalRow", INPUT_ROW)): String = {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let the caller also provides ctx.INPUT_ROW?
Change it to
arguments: Seq[(String, String)]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is good from the view of consistency. I have one question in my mind.
If we use the same argument name arguments, is it possible to for developer to distinguish this splitExpressions from the below (rich) splitExpressions when they want to pass only three arguments expressions, funcName, and arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we combine these two functions? since the next one already provides the default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we have three splitExpressions in this PR.
splitExpression(row, expressions)splitExpression(expressions, funcName, arguments)splitExpression(expressions, funcName, arguments, returnType, ...)
It is hard to combine 2. and 3. since 2. takes care of INPUT_ROW and currentVars while 3. does not take care of them.
Are you suggesting me to combine 1. and 2. which take care of INPUT_ROW and currentVars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check all the callers of case 3 to ensure we do not miss checking INPUT_ROW and currentVars for any of them?
In addition, case 1 and 2 can be easily combined.
I think we need a different name for case 1 and 2. How about splitExpressionsOnInputRow?
cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I will check it tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed there are some cases that do not require to check INPUT_ROW and currentVars.
- access fields in struct
UnsafeJoiner- comparison for ordering
I will try to merge cases 1 and 2. If a different name is required, I will use splitExpressionsOnInputRow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we avoid using return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed to use return like the above splitExpressions. Is it better for this place to write as follows?
if (....) {
expressions.mkString("\n")
} else {
splitExpressions(...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If-else looks better. Thanks!
|
is it really worth? seems not used in many places and eventually the if-else will be removed after we make |
|
I have no strong preference. |
|
@kiszk Can you fix the conflict? now we can add a middle-advanced version: |
|
Sure, I have resolved the conflict in my environment. I will commit soon. |
|
|
||
| /** | ||
| * Splits the generated code of expressions into multiple functions, because function has | ||
| * 64kb code size limit in JVM. This version takes care of INPUT_ROW and currentVars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
Similar to [[splitExpressions(expressions: Seq[String])]], but has customized function name and extra arguments.
|
LGTM, can you remove |
|
Test build #84293 has finished for PR 19821 at commit
|
|
Test build #84298 has finished for PR 19821 at commit
|
|
thanks, merging to master! |
|
LGTM |
What changes were proposed in this pull request?
This PR adds a new API to
CodeGenenerator.splitExpressionsince since severalCodeGenenerator.splitExpressionare used withctx.INPUT_ROWto avoid code duplication.How was this patch tested?
Used existing test suits