Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Nov 26, 2017

What changes were proposed in this pull request?

This PR adds a new API to CodeGenenerator.splitExpression since since several CodeGenenerator.splitExpression are used with ctx.INPUT_ROW to avoid code duplication.

How was this patch tested?

Used existing test suits

@SparkQA
Copy link

SparkQA commented Nov 26, 2017

Test build #84191 has finished for PR 19821 at commit 7b6526a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk kiszk changed the title [WIP][SPARK-22608][SQL] add new API to CodeGeneration.splitExpressions() [SPARK-22608][SQL] add new API to CodeGeneration.splitExpressions() Nov 26, 2017
@kiszk kiszk changed the title [SPARK-22608][SQL] add new API to CodeGeneration.splitExpressions() [WIP][SPARK-22608][SQL] add new API to CodeGeneration.splitExpressions() Nov 26, 2017
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it unchanged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it consistent, how about changing it to

 def splitExpressions(row: String, arguments: Seq[(String, String)]): String = {

Copy link
Member Author

@kiszk kiszk Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In addition to that, how about this since many caller passes INPUT_ROW?

def splitExpressions(row: String, arguments: Seq[(String, String)] = ("InternalRow", INPUT_ROW)): String = {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let the caller also provides ctx.INPUT_ROW?
Change it to

arguments: Seq[(String, String)]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is good from the view of consistency. I have one question in my mind.

If we use the same argument name arguments, is it possible to for developer to distinguish this splitExpressions from the below (rich) splitExpressions when they want to pass only three arguments expressions, funcName, and arguments?

Copy link
Member

@gatorsmile gatorsmile Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we combine these two functions? since the next one already provides the default values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we have three splitExpressions in this PR.

  1. splitExpression(row, expressions)
  2. splitExpression(expressions, funcName, arguments)
  3. splitExpression(expressions, funcName, arguments, returnType, ...)

It is hard to combine 2. and 3. since 2. takes care of INPUT_ROW and currentVars while 3. does not take care of them.
Are you suggesting me to combine 1. and 2. which take care of INPUT_ROW and currentVars?

Copy link
Member

@gatorsmile gatorsmile Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check all the callers of case 3 to ensure we do not miss checking INPUT_ROW and currentVars for any of them?

In addition, case 1 and 2 can be easily combined.

I think we need a different name for case 1 and 2. How about splitExpressionsOnInputRow?

cc @cloud-fan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I will check it tonight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed there are some cases that do not require to check INPUT_ROW and currentVars.

  • access fields in struct
  • UnsafeJoiner
  • comparison for ordering

I will try to merge cases 1 and 2. If a different name is required, I will use splitExpressionsOnInputRow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid using return?

Copy link
Member Author

@kiszk kiszk Nov 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed to use return like the above splitExpressions. Is it better for this place to write as follows?

if (....) {
  expressions.mkString("\n")
} else {
  splitExpressions(...)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If-else looks better. Thanks!

@cloud-fan
Copy link
Contributor

is it really worth? seems not used in many places and eventually the if-else will be removed after we make splitExpression work with whole stage codegen

@kiszk
Copy link
Member Author

kiszk commented Nov 28, 2017

I have no strong preference.
@gatorsmile WDYT?

@cloud-fan
Copy link
Contributor

@kiszk Can you fix the conflict? now we can add a middle-advanced version:

def splitExpressions(
    expressions: Seq[String],
    funcName: String,
    extraArguments: Seq[(String, String)])

@kiszk
Copy link
Member Author

kiszk commented Nov 29, 2017

Sure, I have resolved the conflict in my environment. I will commit soon.


/**
* Splits the generated code of expressions into multiple functions, because function has
* 64kb code size limit in JVM. This version takes care of INPUT_ROW and currentVars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Similar to [[splitExpressions(expressions: Seq[String])]], but has customized function name and extra arguments.

@cloud-fan
Copy link
Contributor

LGTM, can you remove WIP in PR title?

@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84293 has finished for PR 19821 at commit 5332f12.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk kiszk changed the title [WIP][SPARK-22608][SQL] add new API to CodeGeneration.splitExpressions() [SPARK-22608][SQL] add new API to CodeGeneration.splitExpressions() Nov 29, 2017
@SparkQA
Copy link

SparkQA commented Nov 29, 2017

Test build #84298 has finished for PR 19821 at commit 0a218fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@gatorsmile
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants