Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Nov 11, 2021

What changes were proposed in this pull request?

When use write pyspark script like

conf = SparkConf().setAppName("test")
sc = SparkContext(conf = conf)
session = SparkSession().build().enableHiveSupport().getOrCreate()

It will build a session without hive support since we use a existed SparkContext and we create SparkSession use

SparkSession(sc)

This cause we loss configuration added by config() such as catalog implement.

In scala class SparkSession, we create SparkSession with SparkContext and option configurations and will pass option configurations to SharedState then use SharedState's conf create SessionState, but in pyspark, we won't pass options configuration to SharedState, but pass to SessionState, but this time SessionState has been initialized. So it won't support hive.

In this pr, I pass option configurations to SharedState when first init SparkSession, then when init SessionState, this options will be passed to SessionState too.

Why are the changes needed?

Avoid loss configuration when build SparkSession in pyspark

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manuel tested & added UT

@AngersZhuuuu
Copy link
Contributor Author

ping @dongjoon-hyun @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Nov 11, 2021

Test build #145114 has finished for PR 34559 at commit 54e3a0c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2021

@SparkQA
Copy link

SparkQA commented Nov 11, 2021

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PySpark GitHub Action jobs seem to complain. Could you take a look at that, @AngersZhuuuu ?

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Nov 12, 2021

PySpark GitHub Action jobs seem to complain. Could you take a look at that, @AngersZhuuuu ?

mis write sharedState ==. Also update the pr desc make it more clear

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145129 has finished for PR 34559 at commit 1518b2d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145131 has finished for PR 34559 at commit b5d995f.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145134 has finished for PR 34559 at commit 05c65a4.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145136 has finished for PR 34559 at commit e599f57.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@AngersZhuuuu AngersZhuuuu marked this pull request as draft November 12, 2021 04:17
@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145140 has finished for PR 34559 at commit ee3e2b3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu AngersZhuuuu marked this pull request as ready for review November 12, 2021 04:25
@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145141 has finished for PR 34559 at commit a71c903.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145143 has finished for PR 34559 at commit 7f1e668.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@AngersZhuuuu
Copy link
Contributor Author

@dongjoon-hyun tested a lot since not familiar with python code. Now I think can be reviewed

@dongjoon-hyun
Copy link
Member

Thank you for updates, @AngersZhuuuu .

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145148 has finished for PR 34559 at commit aafb6a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145172 has finished for PR 34559 at commit 38b96bf.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

gentle ping @dongjoon-hyun @HyukjinKwon Unit test added to confirm this change.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

Test build #145174 has finished for PR 34559 at commit 1185d1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

@SparkQA
Copy link

SparkQA commented Nov 12, 2021

sparkContext: SparkContext,
jsparkSession: Optional[JavaObject] = None,
options: Optional[Dict[str, Any]] = None,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be not a breaking change in Python, right? How do you think about this, @HyukjinKwon ?

Copy link
Member

@HyukjinKwon HyukjinKwon Nov 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think this is breaking. Let me double check closely by tmr EOD but from a cursory look it seems fine.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-37291][SQL][PYSPARK] PySpark init SparkSession should copy conf to sharedState [SPARK-37291][SQL][PYTHON] PySpark init SparkSession should copy conf to sharedState Nov 13, 2021
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-37291][SQL][PYTHON] PySpark init SparkSession should copy conf to sharedState [SPARK-37291][PYTHON] PySpark init SparkSession should copy conf to sharedState Nov 13, 2021
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-37291][PYTHON] PySpark init SparkSession should copy conf to sharedState [SPARK-37291][PYTHON][SQL] PySpark init SparkSession should copy conf to sharedState Nov 13, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.

):
jsparkSession = self._jvm.SparkSession.getDefaultSession().get()
else:
jsparkSession = self._jvm.SparkSession(self._jsc.sc())
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of nits: @AngersZhuuuu,

  • Can we actually leverage existing constructor on SparkSession to pass the initial options instead of setting it manually? Here unlike Scala, it initiates sharedState always. I think it's best to keep the code path matched.
  • Another nit is that: It's always preferred to use less Py4J connections which exposes potential flakiness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we actually leverage existing constructor on SparkSession to pass the initial options instead of setting it manually? Here unlike Scala, it initiates sharedState always. I think it's best to keep the code path matched.

Yea, will try this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a new one or a followup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, let's create a followup PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants