[SPARK-37504][PYTHON] Pyspark create SparkSession with existed session should not pass static conf #34757

AngersZhuuuu · 2021-11-30T10:07:53Z

What changes were proposed in this pull request?

In current pyspark, we have code as below

for key, value in self._options.items():
      session._jsparkSession.sessionState().conf().setConfString(key, value)
return session

Here will pass all options to created/existed SparkSession, in Scala code path, spark only pass non-static sql conf.

    private def applyModifiableSettings(session: SparkSession): Unit = {
      val (staticConfs, otherConfs) =
        options.partition(kv => SQLConf.isStaticConfigKey(kv._1))

      otherConfs.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }

      if (staticConfs.nonEmpty) {
        logWarning("Using an existing SparkSession; the static sql configurations will not take" +
          " effect.")
      }
      if (otherConfs.nonEmpty) {
        logWarning("Using an existing SparkSession; some spark core configurations may not take" +
          " effect.")
      }
    }

In this pr, we keep this behavior consistent

Why are the changes needed?

Keep consistent behavior between pyspark and Scala code. when initialize SparkSession, when their are existed Session, only overwrite non-static sql conf.

Does this PR introduce any user-facing change?

User can't overwrite static sql conf when use pyspark with existed SparkSession

How was this patch tested?

Modefied UT

…on should not pass static conf

AngersZhuuuu · 2021-11-30T10:08:07Z

ping @HyukjinKwon @dongjoon-hyun

SparkQA · 2021-11-30T10:51:21Z

Test build #145764 has finished for PR 34757 at commit 800a184.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-30T11:10:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50236/

SparkQA · 2021-11-30T12:09:25Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50236/

SparkQA · 2021-11-30T15:48:53Z

Test build #145771 has finished for PR 34757 at commit a91cdc9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-30T16:09:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50244/

SparkQA · 2021-11-30T17:07:46Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50244/

python/pyspark/sql/session.py

SparkQA · 2021-12-01T03:43:23Z

Test build #145788 has finished for PR 34757 at commit 85f9188.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T04:55:44Z

Test build #145791 has finished for PR 34757 at commit 5b406f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T05:19:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50261/

SparkQA · 2021-12-01T06:02:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50261/

SparkQA · 2021-12-01T06:10:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50264/

SparkQA · 2021-12-01T07:03:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50264/

SparkQA · 2021-12-01T07:58:36Z

Test build #145800 has finished for PR 34757 at commit bc644ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-01T08:26:41Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50274/

python/pyspark/sql/session.py

HyukjinKwon · 2021-12-02T00:17:21Z

python/pyspark/sql/session.py

                jsparkSession = self._jvm.SparkSession.getDefaultSession().get()
+                self._jvm.SparkSession.applyModifiableSettings(jsparkSession, options)
            else:
                jsparkSession = self._jvm.SparkSession(self._jsc.sc(), options)


Shall we add a short comment here that this is the case when we can set static configurations

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

Co-authored-by: Hyukjin Kwon <[email protected]>

HyukjinKwon

Looks good otherwise

SparkQA · 2021-12-02T00:20:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50302/

SparkQA · 2021-12-02T01:18:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50302/

SparkQA · 2021-12-02T01:20:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50305/

SparkQA · 2021-12-02T02:27:44Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50305/

SparkQA · 2021-12-02T04:09:19Z

Test build #145828 has finished for PR 34757 at commit 00b4c9a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-12-02T04:30:37Z

Test build #145830 has finished for PR 34757 at commit 93c429e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-12-02T04:40:26Z

@AngersZhuuuu some tests look being failed. mind taking a look please?

AngersZhuuuu · 2021-12-02T05:04:43Z

@AngersZhuuuu some tests look being failed. mind taking a look please?

Seems private[sql] cause py4j can't find this method?

HyukjinKwon · 2021-12-02T05:15:10Z

You can add private[sql] but instead you would have too access to the method a bit different way, something like:

getattr(getattr(session._jvm, "SparkSession$"), "MODULE$").applyModifiableSettings(...)

You can check how to access this in Java first FWIW.

SparkQA · 2021-12-02T06:21:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50319/

SparkQA · 2021-12-02T07:04:49Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50319/

SparkQA · 2021-12-02T10:10:16Z

Test build #145844 has finished for PR 34757 at commit f6f06cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2021-12-02T15:29:22Z

You can add private[sql] but instead you would have too access to the method a bit different way, something like:
getattr(getattr(session._jvm, "SparkSession$"), "MODULE$").applyModifiableSettings(...)
You can check how to access this in Java first FWIW.

Thanks for your help. New knowledge for me....

SparkQA · 2021-12-02T16:17:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50336/

SparkQA · 2021-12-02T17:16:26Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50336/

SparkQA · 2021-12-02T20:03:46Z

Test build #145861 has finished for PR 34757 at commit 8e66ec4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-12-02T23:43:53Z

Merged to master.

HyukjinKwon · 2021-12-15T05:10:32Z

python/pyspark/sql/session.py

                and not self._jvm.SparkSession.getDefaultSession().get().sparkContext().isStopped()
            ):
                jsparkSession = self._jvm.SparkSession.getDefaultSession().get()
+                getattr(getattr(self._jvm, "SparkSession$"), "MODULE$").applyModifiableSettings(


@AngersZhuuuu, this actually shows a lot of new warnings (see also #34893). Another reproducer:

./bin/spark-shell --conf spark.executor.memory=8g --conf spark.driver.memory=8g

>>> from pyspark.sql.functions import udf >>> udf(lambda x: x)("a") 21/12/15 14:03:15 WARN SparkSession: Using an existing SparkSession; the static sql configurations will not take effect. Column<'<lambda>(a)'>

There are more places to fix like this:

ml/util.py: self._sparkSession = SparkSession.builder.getOrCreate() sql/column.py: spark = SparkSession.builder.getOrCreate() sql/context.py: sparkSession = SparkSession.builder.getOrCreate() sql/readwriter.py: spark = SparkSession.builder.getOrCreate() sql/readwriter.py: spark = SparkSession.builder.getOrCreate() sql/session.py: return SparkSession.builder.getOrCreate() sql/session.py: return SparkSession.builder.getOrCreate() sql/streaming.py: spark = SparkSession.builder.getOrCreate() sql/streaming.py: spark = SparkSession.builder.getOrCreate() sql/udf.py: spark = SparkSession.builder.getOrCreate()

If we can't make it in Spark 3.3, I think maybe it's just safer to revert #34757 #34732 and #34559 for now because each patch here will introduce either:

Unexpected configuration propagation of static SQL configuration, or

Too much warnings

Separately, I still feel 8424f55 is inefficient. We don't know which configurations don't take effect, or why it keeps complaining (see the example above) for which configuration. We should probably at least print out the keys or lower the level of log.

cc @AngersZhuuuu @yaooqinn @maropu @dongjoon-hyun FYI

@xinrong-databricks actually this is more Python side codes. Are you interested in creating a followup?

We should probably at least print out the keys or lower the level of log.

+1

Thank you for the head-ups, @HyukjinKwon .

Certainly, I will fix it and keep you updated. Thanks!

We should probably at least print out the keys or lower the level of log.

+1

Made a PR at #35001 👍

…y set in SparkSession.builder.getOrCreate ### What changes were proposed in this pull request? This PR proposes to show ignored configurations and hide the warnings for configurations that are already set when invoking `SparkSession.builder.getOrCreate`. ### Why are the changes needed? Currently, `SparkSession.builder.getOrCreate()` is too noisy even when duplicate configurations are set. Users cannot easily tell which configurations are to fix. See the example below: ```bash ./bin/spark-shell --conf spark.abc=abc ``` ```scala import org.apache.spark.sql.SparkSession spark.sparkContext.setLogLevel("DEBUG") SparkSession.builder.config("spark.abc", "abc").getOrCreate ``` ``` 21:04:01.670 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. ``` It is straitforward when there are few configurations but it is difficult for users to figure out when there are too many configurations especially when these configurations are defined in a property file like 'spark-default.conf' maintained separately by system admins in production. See also #34757 (comment). ### Does this PR introduce _any_ user-facing change? Yes. 1. Show ignored configurations in debug level logs: ```bash ./bin/spark-shell --conf spark.abc=abc ``` ```scala import org.apache.spark.sql.SparkSession spark.sparkContext.setLogLevel("DEBUG") SparkSession.builder .config("spark.sql.warehouse.dir", "2") .config("spark.abc", "abcb") .config("spark.abcd", "abcb4") .getOrCreate ``` **Before:** ``` 21:13:28.360 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; the static sql configurations will not take effect. 21:13:28.360 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. ``` **After**: ``` 20:34:30.619 [main] WARN org.apache.spark.sql.SparkSession - Using an existing Spark session; only runtime SQL configurations will take effect. 20:34:30.622 [main] DEBUG org.apache.spark.sql.SparkSession - Ignored static SQL configurations: spark.sql.warehouse.dir=2 20:34:30.623 [main] DEBUG org.apache.spark.sql.SparkSession - Configurations that might not take effect: spark.abcd=abcb4 spark.abc=abcb ``` 2. Do not issue a warning and hide a configuration already explicitly set (with the same value) before. ```bash ./bin/spark-shell --conf spark.abc=abc ``` ```scala import org.apache.spark.sql.SparkSession spark.sparkContext.setLogLevel("DEBUG") SparkSession.builder.config("spark.abc", "abc").getOrCreate // **Ignore** warnings because it's already set in --conf SparkSession.builder.config("spark.abc.new", "abc").getOrCreate // **Show** warnings for only configuration newly set. SparkSession.builder.config("spark.abc.new", "abc").getOrCreate // **Ignore** warnings because it's set ^. ``` **Before**: ``` 21:13:56.183 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. 21:13:56.356 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. 21:13:56.476 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. ``` **After:** ``` 20:36:36.251 [main] WARN org.apache.spark.sql.SparkSession - Using an existing Spark session; only runtime SQL configurations will take effect. 20:36:36.253 [main] DEBUG org.apache.spark.sql.SparkSession - Configurations that might not take effect: spark.abc.new=abc ``` 3. Do not issue a warning and hide runtime SQL configurations in debug log: ```bash ./bin/spark-shell ``` ```scala import org.apache.spark.sql.SparkSession spark.sparkContext.setLogLevel("DEBUG") SparkSession.builder.config("spark.sql.ansi.enabled", "true").getOrCreate // **Ignore** warnings for runtime SQL configurations SparkSession.builder.config("spark.buffer.size", "1234").getOrCreate // **Show** warnings for Spark core configuration SparkSession.builder.config("spark.sql.source.specific", "abc").getOrCreate // **Show** warnings for custom runtime options SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate // **Show** warnings for static SQL configurations ``` **Before**: ``` 11:11:40.846 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. 11:11:41.037 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. 11:11:41.167 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; some spark core configurations may not take effect. 11:11:41.318 [main] WARN org.apache.spark.sql.SparkSession - Using an existing SparkSession; the static sql configurations will not take effect. ``` **After**: ``` 10:39:54.870 [main] WARN org.apache.spark.sql.SparkSession - Using an existing Spark session; only runtime SQL configurations will take effect. 10:39:54.872 [main] DEBUG org.apache.spark.sql.SparkSession - Configurations that might not take effect: spark.buffer.size=1234 10:39:54.988 [main] WARN org.apache.spark.sql.SparkSession - Using an existing Spark session; only runtime SQL configurations will take effect. 10:39:54.988 [main] DEBUG org.apache.spark.sql.SparkSession - Configurations that might not take effect: spark.sql.source.specific=abc 10:39:55.107 [main] WARN org.apache.spark.sql.SparkSession - Using an existing Spark session; only runtime SQL configurations will take effect. 10:39:55.108 [main] DEBUG org.apache.spark.sql.SparkSession - Ignored static SQL configurations: spark.sql.warehouse.dir=xyz ``` Note that there is no behaviour change on session state initialization when configurations are not set. For example: ```scala import org.apache.spark.sql.SparkSession spark.sparkContext.setLogLevel("DEBUG") SparkSession.builder.getOrCreate ``` But the session state initialization can be triggered now for static SQL configurations set after this PR. Previously, it was not triggered. This would not introduce something user-facing or a bug but worth noting it. For runtime SQL configurations, the session state initialization in this code path was introduced at #15295. ### How was this patch tested? It was manually tested as shown above. Closes #35001 from HyukjinKwon/SPARK-37727. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

[SPARK-37054][PYSPARK] Pyspark create SparkSession with existed sessi…

800a184

…on should not pass static conf

github-actions bot added CORE PYTHON SQL labels Nov 30, 2021

reformat

a91cdc9

HyukjinKwon changed the title ~~[SPARK-37054][PYSPARK] Pyspark create SparkSession with existed session should not pass static conf~~ [SPARK-37054][PYTHON] Pyspark create SparkSession with existed session should not pass static conf Dec 1, 2021

HyukjinKwon reviewed Dec 1, 2021

View reviewed changes

python/pyspark/sql/session.py Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~[SPARK-37054][PYTHON] Pyspark create SparkSession with existed session should not pass static conf~~ [SPARK-37504][PYTHON] Pyspark create SparkSession with existed session should not pass static conf Dec 1, 2021

Update session.py

85f9188

AngersZhuuuu added 2 commits December 1, 2021 12:03

Update session.py

8a0f79c

Update session.py

5b406f5

Update session.py

bc644ff

HyukjinKwon reviewed Dec 1, 2021

View reviewed changes

python/pyspark/sql/session.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 1, 2021

View reviewed changes

python/pyspark/sql/session.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 1, 2021

View reviewed changes

python/pyspark/sql/session.py Show resolved Hide resolved

HyukjinKwon reviewed Dec 2, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 2, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Outdated Show resolved Hide resolved

AngersZhuuuu and others added 2 commits December 2, 2021 08:18

Update sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

af2185f

Co-authored-by: Hyukjin Kwon <[email protected]>

Update sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

93c429e

Co-authored-by: Hyukjin Kwon <[email protected]>

HyukjinKwon approved these changes Dec 2, 2021

View reviewed changes

Update SparkSession.scala

f6f06cc

update

47480cf

Update session.py

8e66ec4

HyukjinKwon closed this in 8952fbc Dec 2, 2021

HyukjinKwon reviewed Dec 15, 2021

View reviewed changes

HyukjinKwon mentioned this pull request Dec 23, 2021

[SPARK-37727][SQL] Show ignored confs & hide warnings for conf already set in SparkSession.builder.getOrCreate #35001

Closed

[SPARK-37504][PYTHON] Pyspark create SparkSession with existed session should not pass static conf #34757

[SPARK-37504][PYTHON] Pyspark create SparkSession with existed session should not pass static conf #34757

Uh oh!

Conversation

AngersZhuuuu commented Nov 30, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

SparkQA commented Nov 30, 2021

Uh oh!

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

SparkQA commented Dec 1, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon Dec 2, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

HyukjinKwon commented Dec 2, 2021

Uh oh!

AngersZhuuuu commented Dec 2, 2021

Uh oh!

HyukjinKwon commented Dec 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

Uh oh!

AngersZhuuuu commented Dec 2, 2021

Uh oh!

SparkQA commented Dec 2, 2021

HyukjinKwon commented Dec 2, 2021 •

edited

Loading

HyukjinKwon Dec 15, 2021 •

edited

Loading

HyukjinKwon Dec 15, 2021 •

edited

Loading

AngersZhuuuu Dec 15, 2021 •

edited

Loading