-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37291][PYTHON][SQL] PySpark init SparkSession should copy conf to sharedState #34559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #145114 has finished for PR 34559 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PySpark GitHub Action jobs seem to complain. Could you take a look at that, @AngersZhuuuu ?
mis write sharedState ==. Also update the pr desc make it more clear |
|
Test build #145129 has finished for PR 34559 at commit
|
|
Kubernetes integration test starting |
|
Test build #145131 has finished for PR 34559 at commit
|
|
Test build #145134 has finished for PR 34559 at commit
|
|
Test build #145136 has finished for PR 34559 at commit
|
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Test build #145140 has finished for PR 34559 at commit
|
|
Test build #145141 has finished for PR 34559 at commit
|
|
Kubernetes integration test status failure |
|
Test build #145143 has finished for PR 34559 at commit
|
|
Kubernetes integration test status failure |
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
@dongjoon-hyun tested a lot since not familiar with python code. Now I think can be reviewed |
|
Thank you for updates, @AngersZhuuuu . |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Test build #145148 has finished for PR 34559 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #145172 has finished for PR 34559 at commit
|
|
gentle ping @dongjoon-hyun @HyukjinKwon Unit test added to confirm this change. |
|
Test build #145174 has finished for PR 34559 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
| sparkContext: SparkContext, | ||
| jsparkSession: Optional[JavaObject] = None, | ||
| options: Optional[Dict[str, Any]] = None, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be not a breaking change in Python, right? How do you think about this, @HyukjinKwon ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think this is breaking. Let me double check closely by tmr EOD but from a cursory look it seems fine.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
| ): | ||
| jsparkSession = self._jvm.SparkSession.getDefaultSession().get() | ||
| else: | ||
| jsparkSession = self._jvm.SparkSession(self._jsc.sc()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple of nits: @AngersZhuuuu,
- Can we actually leverage existing constructor on
SparkSessionto pass the initial options instead of setting it manually? Here unlike Scala, it initiatessharedStatealways. I think it's best to keep the code path matched. - Another nit is that: It's always preferred to use less Py4J connections which exposes potential flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we actually leverage existing constructor on SparkSession to pass the initial options instead of setting it manually? Here unlike Scala, it initiates sharedState always. I think it's best to keep the code path matched.
Yea, will try this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a new one or a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, let's create a followup PR.
What changes were proposed in this pull request?
When use write pyspark script like
It will build a session without hive support since we use a existed SparkContext and we create SparkSession use
This cause we loss configuration added by
config()such as catalog implement.In scala class
SparkSession, we createSparkSessionwithSparkContextand option configurations and will pass option configurations toSharedStatethen useSharedState's conf create SessionState, but in pyspark, we won't pass options configuration toSharedState, but pass toSessionState, but this timeSessionStatehas been initialized. So it won't support hive.In this pr, I pass option configurations to
SharedStatewhen first initSparkSession, then when initSessionState, this options will be passed toSessionStatetoo.Why are the changes needed?
Avoid loss configuration when build SparkSession in pyspark
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manuel tested & added UT