Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to use an existing active Spark session instead of SparkSession.getOrCreate in pandas API on Spark.

Why are the changes needed?

Because it shows warnings for configurations not taking effect as below:

Otherwise, it attempts to create a new session, and shows warnings as below:

>>> ps.range(10)
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect.
21/12/14 16:12:58 WARN SparkSession: Using an existing SparkSession; some spark core configurations may not take effect.
...
   id
0   0
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8

Does this PR introduce any user-facing change?

Yes, after this PR, it will explicitly uses active Spark session, and does not show such warnings:

>>> import pyspark.pandas as ps
>>> ps.range(10)
...
   id
0   0
1   1
2   2
3   3

How was this patch tested?

Manually tested as below:

import pyspark.pandas as ps
ps.range(10)

@HyukjinKwon
Copy link
Member Author

cc @xinrong-databricks and @ueshin FYI

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Test build #146171 has finished for PR 34893 at commit 39acf99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

builder = builder.config(key, value)
# Currently, pandas-on-Spark is dependent on such join due to 'compute.ops_on_diff_frames'
# configuration. This is needed with Spark 3.0+.
builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, we fixed this bug in the master branch so we don't need to set this anymore

builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)

if is_testing():
builder.config("spark.executor.allowSparkContext", False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this will be set separately when SparkContext is created, not here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in fact this was set when we run tests with pytest when it was in Koalas repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this was added for our test to check our code doesn't create SparkContext in executors.
But we can remove it anyway because the default value is False now.

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

Test build #146180 has finished for PR 34893 at commit ce0cd8b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 14, 2021

def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
if conf is None:
conf = dict()
def default_session() -> SparkSession:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can remove the conf argument here?
I guess we should show a deprecation warning if it's not None for now and remove it in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't an API that's not documented so it should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just leave it to you.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

def default_session(conf: Optional[Dict[str, Any]] = None) -> SparkSession:
if conf is None:
conf = dict()
def default_session() -> SparkSession:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just leave it to you.

@HyukjinKwon
Copy link
Member Author

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants