[SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark #34893

HyukjinKwon · 2021-12-14T07:22:22Z

What changes were proposed in this pull request?

This PR proposes to use an existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark.

Why are the changes needed?

Because it shows warnings for configurations not taking effect as below:

Otherwise, it attempts to create a new session, and shows warnings as below:

>>> ps.range(10)
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
...
   id
0   0
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8

Does this PR introduce any user-facing change?

Yes, after this PR, it will explicitly uses active Spark session, and does not show such warnings:

>>> import pyspark.pandas as ps
>>> ps.range(10)
...
   id
0   0
1   1
2   2
3   3

How was this patch tested?

Manually tested as below:

import pyspark.pandas as ps
ps.range(10)

… in pandas API on Spark

HyukjinKwon · 2021-12-14T07:22:29Z

cc @xinrong-databricks and @ueshin FYI

SparkQA · 2021-12-14T07:51:32Z

Test build #146171 has finished for PR 34893 at commit 39acf99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-14T08:13:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50644/

SparkQA · 2021-12-14T08:58:51Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50644/

HyukjinKwon · 2021-12-14T10:06:32Z

python/pyspark/pandas/utils.py

-        builder = builder.config(key, value)
-    # Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
-    # configuration. This is needed with Spark 3.0+.
-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)


In fact, we fixed this bug in the master branch so we don't need to set this anymore

HyukjinKwon · 2021-12-14T10:08:17Z

python/pyspark/pandas/utils.py

-    builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
-
-    if is_testing():
-        builder.config("spark.executor.allowSparkContext", False)


and this will be set separately when SparkContext is created, not here.

and in fact this was set when we run tests with pytest when it was in Koalas repo.

Actually this was added for our test to check our code doesn't create SparkContext in executors.
But we can remove it anyway because the default value is False now.

SparkQA · 2021-12-14T11:11:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50653/

SparkQA · 2021-12-14T11:17:51Z

Test build #146180 has finished for PR 34893 at commit ce0cd8b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-14T12:12:41Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50653/

ueshin · 2021-12-15T00:19:21Z

python/pyspark/pandas/utils.py

-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:


I'm not sure we can remove the conf argument here?
I guess we should show a deprecation warning if it's not None for now and remove it in the future?

I think this isn't an API that's not documented so it should be fine.

I'd just leave it to you.

ueshin

LGTM.

ueshin · 2021-12-15T02:34:35Z

python/pyspark/pandas/utils.py

-def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
-    if conf is None:
-        conf = dict()
+def default_session() -> SparkSession:


I'd just leave it to you.

HyukjinKwon · 2021-12-15T02:43:38Z

Merged to master.

Use existing active Spark session instead of SparkSession.getOrCreate…

39acf99

… in pandas API on Spark

github-actions bot added CORE PYTHON labels Dec 14, 2021

Remove unused confs

ce0cd8b

HyukjinKwon commented Dec 14, 2021

View reviewed changes

ueshin reviewed Dec 15, 2021

View reviewed changes

ueshin approved these changes Dec 15, 2021

View reviewed changes

HyukjinKwon closed this in 988381b Dec 15, 2021

HyukjinKwon mentioned this pull request Dec 15, 2021

[SPARK-37504][PYTHON] Pyspark create SparkSession with existed session should not pass static conf #34757

Closed

HyukjinKwon deleted the SPARK-37638 branch January 4, 2022 00:51

[SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark #34893

[SPARK-37638][PYTHON] Use existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark #34893

Uh oh!

Conversation

HyukjinKwon commented Dec 14, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

SparkQA commented Dec 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants