[SPARK-37291][PYTHON][SQL] PySpark init SparkSession should copy conf to sharedState #34559

AngersZhuuuu · 2021-11-11T15:09:34Z

What changes were proposed in this pull request?

When use write pyspark script like

conf = SparkConf().setAppName("test")
sc = SparkContext(conf = conf)
session = SparkSession().build().enableHiveSupport().getOrCreate()

It will build a session without hive support since we use a existed SparkContext and we create SparkSession use

SparkSession(sc)

This cause we loss configuration added by config() such as catalog implement.

In scala class SparkSession, we create SparkSession with SparkContext and option configurations and will pass option configurations to SharedState then use SharedState's conf create SessionState, but in pyspark, we won't pass options configuration to SharedState, but pass to SessionState, but this time SessionState has been initialized. So it won't support hive.

In this pr, I pass option configurations to SharedState when first init SparkSession, then when init SessionState, this options will be passed to SessionState too.

Why are the changes needed?

Avoid loss configuration when build SparkSession in pyspark

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manuel tested & added UT

…f to sharedState

AngersZhuuuu · 2021-11-11T15:09:56Z

ping @dongjoon-hyun @HyukjinKwon

SparkQA · 2021-11-11T15:26:31Z

Test build #145114 has finished for PR 34559 at commit 54e3a0c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-11T15:54:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49582/

SparkQA · 2021-11-11T17:14:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49582/

dongjoon-hyun

PySpark GitHub Action jobs seem to complain. Could you take a look at that, @AngersZhuuuu ?

AngersZhuuuu · 2021-11-12T02:35:12Z

PySpark GitHub Action jobs seem to complain. Could you take a look at that, @AngersZhuuuu ?

mis write sharedState ==. Also update the pr desc make it more clear

SparkQA · 2021-11-12T02:38:45Z

Test build #145129 has finished for PR 34559 at commit 1518b2d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T03:08:05Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49600/

SparkQA · 2021-11-12T03:12:44Z

Test build #145131 has finished for PR 34559 at commit b5d995f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T03:28:10Z

Test build #145134 has finished for PR 34559 at commit 05c65a4.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T03:49:36Z

Test build #145136 has finished for PR 34559 at commit e599f57.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T03:52:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49600/

SparkQA · 2021-11-12T04:07:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49602/

SparkQA · 2021-11-12T04:18:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49605/

SparkQA · 2021-11-12T04:22:47Z

Test build #145140 has finished for PR 34559 at commit ee3e2b3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T04:26:55Z

Test build #145141 has finished for PR 34559 at commit a71c903.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T05:09:04Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49602/

SparkQA · 2021-11-12T05:43:40Z

Test build #145143 has finished for PR 34559 at commit 7f1e668.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T06:02:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49607/

SparkQA · 2021-11-12T06:02:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49612/

SparkQA · 2021-11-12T06:04:10Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49614/

AngersZhuuuu · 2021-11-12T06:04:17Z

@dongjoon-hyun tested a lot since not familiar with python code. Now I think can be reviewed

dongjoon-hyun · 2021-11-12T06:06:33Z

Thank you for updates, @AngersZhuuuu .

SparkQA · 2021-11-12T06:08:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49611/

SparkQA · 2021-11-12T06:47:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49612/

SparkQA · 2021-11-12T06:51:27Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49614/

SparkQA · 2021-11-12T06:55:45Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49611/

SparkQA · 2021-11-12T07:41:12Z

Test build #145148 has finished for PR 34559 at commit aafb6a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T07:58:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49619/

SparkQA · 2021-11-12T09:00:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49619/

SparkQA · 2021-11-12T18:34:32Z

Test build #145172 has finished for PR 34559 at commit 38b96bf.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2021-11-12T18:35:31Z

gentle ping @dongjoon-hyun @HyukjinKwon Unit test added to confirm this change.

SparkQA · 2021-11-12T19:43:04Z

Test build #145174 has finished for PR 34559 at commit 1185d1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-12T20:01:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49643/

SparkQA · 2021-11-12T21:00:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49643/

dongjoon-hyun · 2021-11-13T23:32:05Z

python/pyspark/sql/session.py

+        sparkContext: SparkContext,
+        jsparkSession: Optional[JavaObject] = None,
+        options: Optional[Dict[str, Any]] = None,
+    ):


This seems to be not a breaking change in Python, right? How do you think about this, @HyukjinKwon ?

Yeah I don't think this is breaking. Let me double check closely by tmr EOD but from a cursory look it seems fine.

dongjoon-hyun

+1, LGTM. Merged to master.

HyukjinKwon · 2021-11-29T02:31:38Z

python/pyspark/sql/session.py

            ):
                jsparkSession = self._jvm.SparkSession.getDefaultSession().get()
            else:
                jsparkSession = self._jvm.SparkSession(self._jsc.sc())


LGTM with a couple of nits: @AngersZhuuuu,

Can we actually leverage existing constructor on SparkSession to pass the initial options instead of setting it manually? Here unlike Scala, it initiates sharedState always. I think it's best to keep the code path matched.

Another nit is that: It's always preferred to use less Py4J connections which exposes potential flakiness.

Can we actually leverage existing constructor on SparkSession to pass the initial options instead of setting it manually? Here unlike Scala, it initiates sharedState always. I think it's best to keep the code path matched.

Yea, will try this.

Create a new one or a followup?

Yup, let's create a followup PR.

[SPARK-37291][SQL][PYSPARK] PySpark init SparkSession should copy con…

54e3a0c

…f to sharedState

github-actions bot added CORE PYTHON SQL labels Nov 11, 2021

dongjoon-hyun reviewed Nov 11, 2021

View reviewed changes

Update session.py

1518b2d

Update session.py

4fd5bdb

Update session.py

b5d995f

Update session.py

05c65a4

Update session.py

e599f57

Update session.py

ee3e2b3

AngersZhuuuu marked this pull request as draft November 12, 2021 04:17

Update session.py

a71c903

AngersZhuuuu marked this pull request as ready for review November 12, 2021 04:25

Update session.py

7f1e668

reformat

aafb6a1

Update test_session.py

38b96bf

Update test_session.py

1185d1f

dongjoon-hyun reviewed Nov 13, 2021

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-37291][SQL][PYSPARK] PySpark init SparkSession should copy conf to sharedState~~ [SPARK-37291][SQL][PYTHON] PySpark init SparkSession should copy conf to sharedState Nov 13, 2021

dongjoon-hyun changed the title ~~[SPARK-37291][SQL][PYTHON] PySpark init SparkSession should copy conf to sharedState~~ [SPARK-37291][PYTHON] PySpark init SparkSession should copy conf to sharedState Nov 13, 2021

dongjoon-hyun changed the title ~~[SPARK-37291][PYTHON] PySpark init SparkSession should copy conf to sharedState~~ [SPARK-37291][PYTHON][SQL] PySpark init SparkSession should copy conf to sharedState Nov 13, 2021

dongjoon-hyun approved these changes Nov 13, 2021

View reviewed changes

dongjoon-hyun closed this in 6d6ef76 Nov 13, 2021

HyukjinKwon reviewed Nov 29, 2021

View reviewed changes

HyukjinKwon mentioned this pull request Dec 15, 2021

[SPARK-37504][PYTHON] Pyspark create SparkSession with existed session should not pass static conf #34757

Closed

[SPARK-37291][PYTHON][SQL] PySpark init SparkSession should copy conf to sharedState #34559

[SPARK-37291][PYTHON][SQL] PySpark init SparkSession should copy conf to sharedState #34559

Uh oh!

Conversation

AngersZhuuuu commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Nov 11, 2021

Uh oh!

SparkQA commented Nov 11, 2021

Uh oh!

SparkQA commented Nov 11, 2021

Uh oh!

SparkQA commented Nov 11, 2021

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Nov 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

AngersZhuuuu commented Nov 12, 2021

Uh oh!

dongjoon-hyun commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

AngersZhuuuu commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

Uh oh!

SparkQA commented Nov 12, 2021

AngersZhuuuu commented Nov 11, 2021 •

edited

Loading

AngersZhuuuu commented Nov 12, 2021 •

edited

Loading

HyukjinKwon Nov 14, 2021 •

edited

Loading

HyukjinKwon Nov 29, 2021 •

edited

Loading