[SPARK-41434][CONNECT][PYTHON] Initial `LambdaFunction` implementation #39068

zhengruifeng · 2022-12-15T05:45:54Z

What changes were proposed in this pull request?

There are 11 lambda functions, this PR adds the basic support for LambdaFunction and add the exists function.

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

yes, new API

How was this patch tested?

added UT

zhengruifeng · 2022-12-15T05:46:26Z

reviewers can refer to the implementation in PySpark

https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L8094-L8197

zhengruifeng · 2022-12-15T05:49:58Z

python/pyspark/sql/connect/functions.py

PySpark invokes UnresolvedNamedLambdaVariable.freshVarName in JVM to get a unique variable name

object UnresolvedNamedLambdaVariable { // Counter to ensure lambda variable names are unique private val nextVarNameId = new AtomicInteger(0) def freshVarName(name: String): String = { s"${name}_${nextVarNameId.getAndIncrement()}" } }

I think from looking at this the last time, the reason for the variables is mostly to create unique aliases but not to reference them actually in the plan. Because the expression itself must be a boolean expression. And the lambda function we pass in is really just a expression transformation.

zhengruifeng · 2022-12-15T06:34:20Z

The failed ./dev/test-dependencies.sh is irrelevant

zhengruifeng · 2022-12-15T06:58:53Z

cc @HyukjinKwon @cloud-fan @amaliujia @grundprinzip @hvanhovell

hvanhovell · 2022-12-15T12:07:47Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

Can we see if we can do without this one? This is very much an implementation detail.

will remove it.

hvanhovell · 2022-12-15T12:08:27Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

Can we just make this an unresolved attribute? And do the heavy lifting in the connect planner?

Unless there is special resolution in Catalyst on this type otherwise +1 to re-use unresolved attribute

Can someone give me a good example what the variable is supposed to do?

it are literally the argument names in a lambda, for example x => x + 1, the UnresolvedNamedLambdaVariable is x.

hvanhovell · 2022-12-15T12:11:22Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

I am on the fence about adding this message. We could go the CaseWhen route here as well. The only argument against that are the arguments them self :).

How would the lambda function look like for using the CaseWhen route?

ok, that would be simpler

How would the lambda function look like for using the CaseWhen route?

this one #38956, we build the expressions in Connect Planner other than let FunctionRegistry do this

actually, we can remove ExpressionString, UnresolvedStar, UnresolvedRegex, Cast in the same way.

But now I'm not sure whether it is the correct way.

I think this is tricky because it's hard to provide clear guidance on what the right approach is. Generally, the downside of the unresolved function approach is that you're using a magic value that someone has to understand. This knowledge is now embedded in the client and cannot be inferred when looking at the protos.

will add LambdaFunction back in the protos

I think this is tricky because it's hard to provide clear guidance on what the right approach is. Generally, the downside of the unresolved function approach is that you're using a magic value that someone has to understand. This knowledge is now embedded in the client and cannot be inferred when looking at the protos.

Agreed, that is what I feel. We should avoid abusing unresolved function

amaliujia · 2022-12-15T17:43:02Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

If all arguments must be UnresolvedNamedLambdaVariables. why not define this as
repeated UnresolvedNamedLambdaVariables arguments = 2?

I was hitting something weird if using UnresolvedNamedLambdaVariables here

In the scala side, a LambdaFunction actually accepts NamedExpressions as argument;
In PySpark, an argument is always a UnresolvedNamedLambdaVariable;

Here we use Expression instead of UnresolvedAttributes, a benefit I can image is, if we need to support more types of Expression, we do not need to change the proto

Shall we simply use repeated String ...? If we think about the SQL API exists(array_col, c -> ...), what we need to provide here is the argument name list, which is really just a list of string.

In general I'm in favor of repeated string since it's clearer about the intention. I think this function is a good example where we need to put guidance on where the correct resolution of function arguments happens.

yes, that's simpler. At least repeated String is enough for existing pyspark's implementations. will update

grundprinzip

First pass.

grundprinzip · 2022-12-15T17:27:15Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

How would the lambda function look like for using the CaseWhen route?

grundprinzip · 2022-12-15T18:30:15Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

Can someone give me a good example what the variable is supposed to do?

grundprinzip · 2022-12-15T18:41:06Z

python/pyspark/sql/connect/functions.py

I think from looking at this the last time, the reason for the variables is mostly to create unique aliases but not to reference them actually in the plan. Because the expression itself must be a boolean expression. And the lambda function we pass in is really just a expression transformation.

grundprinzip · 2022-12-16T23:22:37Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

This branch deserves it's own function please.

+1 will add it back

grundprinzip · 2022-12-16T23:26:01Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

There are two interesting issues here:

When someone submits to the API an expression that does not transform into UnresolvedExpression this would throw a weird error message about the name parts but actually the type does not match.

Why the restriction to single part names? Is this a Spark limitation?

1, the error message also mentioned UnresolvedAttribute

2, existing implementation in PySpark only use single part name

grundprinzip · 2022-12-16T23:27:08Z

python/pyspark/sql/connect/column.py

Adding these assertions here is helpful in the Python client but the server side does not do the same assertion. What happens if we drop the assertion on ColumnReference what would happen on the server?

Is the analysis exception not better than the Python assertion>

we also check UnresolvedAttribute in the server side

grundprinzip · 2022-12-16T23:27:31Z

python/pyspark/sql/connect/functions.py

zhengruifeng · 2022-12-19T04:15:46Z

python/pyspark/sql/connect/functions.py

existing pyspark only uses single name part, so check it in the server side

cloud-fan · 2022-12-19T08:47:57Z

python/pyspark/sql/connect/functions.py

since we support up to 3 parameters, shall we add an assert here to avoid future mistakes?

zhengruifeng · 2022-12-19T10:08:44Z

python/pyspark/sql/connect/functions.py

@cloud-fan Here is the validation on number of paramters

grundprinzip

Looks good from my side.

grundprinzip · 2022-12-19T09:00:11Z

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

In general I'm in favor of repeated string since it's clearer about the intention. I think this function is a good example where we need to put guidance on where the correct resolution of function arguments happens.

grundprinzip · 2022-12-19T13:52:39Z

...connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

Do we get an analysis exception if more than three arguments are submitted?

There is a validation in the client side already but I guess we need that in the server side as well?

ok, let me add it in server side

amaliujia

LGTM

cloud-fan · 2022-12-20T10:16:50Z

python/pyspark/sql/connect/functions.py

IIUC, we use Column with UnresolvedAttribute to invoke the function, then at the server side, we replace UnresolvedAttribute with LambdaVariable.

Shall we use Column with LambdaVariable to invoke the function in client side directly?

the first commit used UnresolvedNamedLambdaVariable

arg_cols: List[Column] = [] for arg in arg_names[: len(parameters)]: # TODO: How to make sure lambda variable names are unique? RPC for increasing ID? _uuid = str(uuid.uuid4()).replace("-", "_") arg_cols.append(Column(UnresolvedNamedLambdaVariable([f"{arg}_{_uuid}"]))) result = f(*arg_cols) if not isinstance(result, Column): raise ValueError(f"Callable {f} should return Column, got {type(result)}") return LambdaFunction(result._expr, [arg._expr for arg in arg_cols])

then switched to UnresolvedAttribute according to the comments #39068 (comment)

I don't think this can simplify the client-side implementation, but it does save one proto message. At least we should document this in the proto message LambdaFunction: the function body should use UnresolvedAttribute as arguments to build the query plan.

Co-authored-by: Martin Grund <[email protected]>

HyukjinKwon · 2022-12-21T03:43:18Z

Merged to master.

zhengruifeng · 2022-12-21T03:44:12Z

thank you all for the reviews!

…bda functions ### What changes were proposed in this pull request? 1, #39068 reused the `UnresolvedAttribute` for the `UnresolvedNamedLambdaVariable`, but then `Column('x')` and `UnresolvedNamedLambdaVariable('x')` are mixed in `lambda x: x + cdf.x` (since we use `x/y/z` as augment names); this PR adds the `UnresolvedNamedLambdaVariable` back to distinguish between `Column('x')` and `UnresolvedNamedLambdaVariable('x')`; 2, the `refreshVarName` logic in PySpark was added in #32523 to address similar issue in PySpark's Lambda Function, this PR adds a similar function in the Python Client to avoid rewriting the function expression in the server side, which is unnecessary and prone to error . ### Why are the changes needed? before this PR, the nested lambda function doesn't work properly ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? enabled UT and added UT Closes #39619 from zhengruifeng/connect_fix_nested_lambda. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels Dec 15, 2022

zhengruifeng commented Dec 15, 2022

View reviewed changes

hvanhovell reviewed Dec 15, 2022

View reviewed changes

amaliujia reviewed Dec 15, 2022

View reviewed changes

grundprinzip reviewed Dec 16, 2022

View reviewed changes

zhengruifeng force-pushed the connect_function_lambda branch from 432286c to 91b290c Compare December 16, 2022 07:57

grundprinzip reviewed Dec 16, 2022

View reviewed changes

zhengruifeng force-pushed the connect_function_lambda branch from f8771a9 to d6aea83 Compare December 19, 2022 04:07

zhengruifeng commented Dec 19, 2022

View reviewed changes

cloud-fan reviewed Dec 19, 2022

View reviewed changes

zhengruifeng commented Dec 19, 2022

View reviewed changes

grundprinzip reviewed Dec 19, 2022

View reviewed changes

amaliujia approved these changes Dec 20, 2022

View reviewed changes

cloud-fan reviewed Dec 20, 2022

View reviewed changes

zhengruifeng and others added 8 commits December 21, 2022 09:10

init

b9c4856

reimpl

bdac284

Apply suggestions from code review

f8b3dab

Co-authored-by: Martin Grund <[email protected]>

add lambda function proto back

dee1863

resolve conflicts

42f5d4d

address comments

05744a2

address comments

73084cf

address comments

ab49f07

zhengruifeng force-pushed the connect_function_lambda branch from 568458f to ab49f07 Compare December 21, 2022 01:32

HyukjinKwon approved these changes Dec 21, 2022

View reviewed changes

HyukjinKwon closed this in e23983a Dec 21, 2022

zhengruifeng deleted the connect_function_lambda branch December 21, 2022 03:44

zhengruifeng mentioned this pull request Jan 17, 2023

[SPARK-42089][CONNECT][PYTHON] Fix variable name issues in nested lambda functions #39619

Closed

[SPARK-41434][CONNECT][PYTHON] Initial LambdaFunction implementation #39068

[SPARK-41434][CONNECT][PYTHON] Initial LambdaFunction implementation #39068

Uh oh!

Conversation

zhengruifeng commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Dec 15, 2022

Uh oh!

zhengruifeng Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Dec 15, 2022

Uh oh!

zhengruifeng commented Dec 15, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

[SPARK-41434][CONNECT][PYTHON] Initial `LambdaFunction` implementation #39068

[SPARK-41434][CONNECT][PYTHON] Initial `LambdaFunction` implementation #39068

zhengruifeng commented Dec 15, 2022 •

edited

Loading

zhengruifeng Dec 15, 2022 •

edited

Loading