[SPARK-52777][SQL] Enable shuffle cleanup mode configuration in Spark SQL #51458

karuppayya · 2025-07-12T00:39:01Z

What changes were proposed in this pull request?

This change enables shuffle cleanup mode configuration in regular Spark SQL execution

Why are the changes needed?

Currently, shuffle cleanup mode configuration only works in Spark Connect but ignored in reguklar SQL execution

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

karuppayya · 2025-07-17T01:29:57Z

@dongjoon-hyun @sunchao Can you of help review this change?

dongjoon-hyun

Could you make a CI happy please, @karuppayya ?

cc @peter-toth

karuppayya · 2025-07-17T18:04:33Z

[info] - SPARK-33551: Do not use AQE shuffle read for repartition *** FAILED *** (78 milliseconds)
[info]   scala.Predef.refArrayOps[org.apache.spark.Partition](parts).forall(((x$1: org.apache.spark.Partition) => rdd.preferredLocations(x$1).nonEmpty)) was false (AdaptiveQueryExecSuite.scala:205)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
[info]   at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
[info]   at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
[info]   at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
[info]   at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.$anonfun$checkNumLocalShuffleReads$1(AdaptiveQueryExecSuite.scala:205)
[info]   at scala.collection.immutable.List.foreach(List.scala:334)
[info]   at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.checkNumLocalShuffleReads(AdaptiveQueryExecSuite.scala:202)
[info]   at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.checkBHJ$1(AdaptiveQueryExecSuite.scala:1828)
[info]   at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.$anonfun$new$191(AdaptiveQueryExecSuite.scala:1888)
[info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
[info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf(SQLConfHelper.scala:56)
[info]   at org.apache.spark.sql.catalyst.SQLConfHelper.withSQLConf$(SQLConfHelper.scala:38)
[info]   at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.org$apache$spark$sql$test$SQLTestUtilsBase$$super$withSQLConf(AdaptiveQueryExecSuite.scala:60)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf(SQLTestUtils.scala:253)
[info]   at org.apache.spark.sql.test.SQLTestUtilsBase.withSQLConf$(SQLTestUtils.scala:251)
[info]   at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.withSQLConf(AdaptiveQueryExecSuite.scala:60)
[info]   at org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite.$anonfun$new$190(AdaptiveQueryExecSuite.scala:1884)

The failure seems unrelated(and seems to run fine locally). But let me rebase master and retrigger the CI

karuppayya · 2025-07-17T18:19:21Z

cc: @bozhang2820 and @cloud-fan since it was introduced with #45930

karuppayya · 2025-07-18T22:20:27Z

The test failure was from my chnage.
Looks like default for SHUFFLE_DEPENDENCY_FILE_CLEANUP_ENABLED, SHUFFLE_DEPENDENCY_SKIP_MIGRATION_ENABLED is Utils.isTesting,
which was forcing cleanup before the test assertions

karuppayya · 2025-07-21T19:41:42Z

sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala

As of today, my chnage affects only tests which have Adaptive execution enabled. The following test failed in CI(specific to adaptive)

SPARK-35695: get observable metrics with adaptive execution by callback

But i have another PR, which does the shuffle cleanup for non-adaptive paths.
So the conf have to set at the top level and not specific to adaptive execution. And so added to BeforeAll

Why the test failed?
org.apache.spark.sql.util.DataFrameCallbackSuite#validateObservedMetrics does collect() on same df twice.
With shuffle cleanup(enabled by default with tests), it result in recomputation, resulting in double counting of metrics(using CollectMetricsExec) and assertions fail
(This is potentially an issue with org.apache.spark.sql.execution.CollectMetricsExec#collectedMetrics? The accumulator needs to be reset in such case, otherwise the metrics are going to be inconsistent based on expression used(for eg: sum). cc: @dongjoon-hyun for your thoughts, which i can take in a subequent PR if needed)

violet-nspct · 2025-07-26T06:21:45Z

Should add unit tests to cover the following scenarios of handlePlan method in SparkConnectPlanExecution.scala?

  // Execution Paths
  test("execution paths properly handle shuffle cleanup modes") {
    // Tests different execution paths with cleanup modes
    // Covers: SparkSession.execute scenarios
  }

  // Query Interruption
  test("query interruption handles cleanup modes correctly") {
    // Tests interruption behavior
    // Covers: Query cancellation and cleanup
  }

  // Configuration
  test("handlePlan should use correct shuffle cleanup mode from configuration") {
    // Tests configuration combinations
    // Covers: Configuration handling
  }

  // Consistency
  test("shuffle cleanup mode should be consistent between Spark Connect and regular SQL") {
    // Tests consistency across execution methods
    // Covers: Cross-execution consistency
  }

  // Long-running queries with configuration changes
  test("long running queries handle configuration changes correctly") {
    // Tests long-running queries with dynamic configuration
    // Covers: Both runtime and configuration aspects
  }

  // Edge case handling
  test("handles edge cases with different cleanup modes") {
    // Tests various edge cases from both plans
    // Covers: Comprehensive edge case handling
  }

karuppayya · 2025-07-26T13:59:54Z

@violetnspct the shuffle modes were added in #45930, this is an change to make it configurable for SQL quey execution

cloud-fan · 2025-07-28T13:15:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

we should access the conf from the passed in sparkSession. One idea is

... val specifiedShuffleCleanupMode: Option[ShuffleCleanupMode] = None) extends Logging { ... def shuffleCleanupMode = specifiedShuffleCleanupMode.getOrElse( determineShuffleCleanupMode(sparkSession.sessionState.conf)) }

cloud-fan · 2025-07-28T13:18:02Z

The reason why the config applies to Spark Connect only: With classic Spark SQL, users can hold a DataFrame instance forever and we can never clean up the shuffle as the df instance still need to read it when being executed.

karuppayya · 2025-07-28T14:45:03Z

Thanks @cloud-fan for the comments. I think it would be still be beneficial in the normal execution
For example in notebooks, the reference could remain active forever and delaying cleanup.
In suhc cases, making this configurable can cleanup shuffle, and providing for aggressive executor downscaling(This also prevents org.apache.spark.storage.FallbackStorage from having to copy shuffle blocks to remote whne not needed. )

Also, this would be an opt-in(per SQL) with default of org.apache.spark.sql.execution.DoNotCleanup. Users control when aggressive cleanup is appropriate, maintaining backward compatibility.

cloud-fan · 2025-07-29T07:32:51Z

then at least we should have separate configs for classic and connect, as it's much safer to enable it for connect.

karuppayya · 2025-07-29T23:33:48Z

@cloud-fan looks like the connect specific configs are prefixed with spark.sql.connect.* .
In this case the config is named spark.sql.shuffleDependency.fileCleanup.enabled(without the connect) though its specific to connect. Can you advise on how we could name this new config.

cloud-fan · 2025-07-31T05:52:39Z

how about we add spark.sql.connect.* and spark.sql.classic.* configs to control the connect and classic behavior separately? We then deprecate spark.sql.shuffleDependency.fileCleanup.enabled, and let spark.sql.connect.* fallback to it for backward compatibility.

karuppayya · 2025-08-04T21:59:02Z

@cloud-fan I have addressed the comments. Can you please take a look?

…ller

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

cloud-fan · 2025-08-19T05:20:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

    withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
-      SQLConf.SHUFFLE_PARTITIONS.key -> "5") {
+      SQLConf.SHUFFLE_PARTITIONS.key -> "5",
+      SQLConf.CLASSIC_SHUFFLE_DEPENDENCY_FILE_CLEANUP_ENABLED.key -> "false") {


can we leave a code comment to explain it?

I have added a comment, let me know if that sounds ok. I will will then have this conversation resolved

karuppayya · 2025-08-20T03:12:35Z

I reran that workflow that was failing and it passed in the 2nd attempt.
I think its good to be merged
edit: actually the PR got updated that all checks turned green after the retry passed

karuppayya · 2025-08-21T13:29:39Z

@cloud-fan Can we have this merged, if you didn't have any other comments

karuppayya · 2025-08-27T05:08:10Z

@cloud-fan Can you please help merge this

cloud-fan · 2025-08-27T15:21:03Z

thanks, merging to master!

cloud-fan · 2025-08-28T00:57:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(Utils.isTesting)

+  val CLASSIC_SHUFFLE_DEPENDENCY_FILE_CLEANUP_ENABLED =
+    buildConf("spark.sql.classic.shuffleDependency.fileCleanup.enabled")


@karuppayya shall we also add configs for thriftserver? I think it's another entry point that we can safely enable file cleanup.

+1 , thanks for bringing this up
I will have this handled in a follow up PR soon

cloud-fan · 2025-08-28T00:58:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val CLASSIC_SHUFFLE_DEPENDENCY_FILE_CLEANUP_ENABLED =
+    buildConf("spark.sql.classic.shuffleDependency.fileCleanup.enabled")
+      .doc("When enabled, shuffle files will be cleaned up at the end of classic " +
+        "SQL executions.")


can we also mention the caveats? The eager shuffle cleanup will trigger stage retry if users repeatedly execute the same dataframe instance.

I will have this added in this PR, if thats ok.

yes please!

### What changes were proposed in this pull request? We have the ability top clean up shuffle in `spark.sql.classic.shuffleDependency.fileCleanup.enabled`. Honoring this in Thrift server and cleaning up shuffle. Related PR comment [here](#51458 (comment)) ### Why are the changes needed? This is to bring the behavior in par with other modes of sql execution(classic, connect) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA ### Was this patch authored or co-authored using generative AI tooling? No Closes #52213 from karuppayya/SPARK-53469. Authored-by: Karuppayya Rajendran <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added SQL CONNECT labels Jul 12, 2025

HyukjinKwon changed the title ~~[SPARK-52777] Enable shuffle cleanup mode configuration in Spark SQL~~ [SPARK-52777][SQL] Enable shuffle cleanup mode configuration in Spark SQL Jul 14, 2025

karuppayya force-pushed the SPARK-52777 branch from 9379341 to 037a782 Compare July 17, 2025 04:29

dongjoon-hyun reviewed Jul 17, 2025

View reviewed changes

karuppayya force-pushed the SPARK-52777 branch from 037a782 to 8dc673e Compare July 17, 2025 18:08

karuppayya commented Jul 21, 2025

View reviewed changes

karuppayya force-pushed the SPARK-52777 branch from 027e0a5 to 192e202 Compare July 21, 2025 21:23

karuppayya closed this Jul 22, 2025

karuppayya reopened this Jul 22, 2025

karuppayya force-pushed the SPARK-52777 branch from f495d2a to 192e202 Compare July 22, 2025 13:55

karuppayya closed this Jul 22, 2025

karuppayya reopened this Jul 22, 2025

cloud-fan reviewed Jul 28, 2025

View reviewed changes

karuppayya closed this Aug 4, 2025

karuppayya reopened this Aug 4, 2025

karuppayya force-pushed the SPARK-52777 branch from e3e02d5 to 911bfb4 Compare August 4, 2025 21:56

karuppayya requested a review from cloud-fan August 4, 2025 21:58

karuppayya added 13 commits August 16, 2025 22:06

Fix build: Remove unused import

61272bc

Fix test

66e4a52

Fix test

e752c66

Re-trigger CI tests

3fb3c2d

Address revewi comments

5e56d46

Address rewviw comments

420aabc

Addrees review comments

16db183

Address review ocmments

740ee66

Address review comments: pass the shuffle cleanup mode to QueryExecution

ccb48b8

Address review comment

2c28f1c

Address rview comment: Add DonotCleanUp as default and remove from ca…

d63af52

…ller

Test fix: remove unwanted code from previous commits

826cf8c

trigger CI

f8ebc84

karuppayya force-pushed the SPARK-52777 branch from ebaa89e to f8ebc84 Compare August 17, 2025 05:06

cloud-fan reviewed Aug 19, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 19, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Aug 19, 2025

View reviewed changes

cloud-fan approved these changes Aug 19, 2025

View reviewed changes

Address revewi comments: Add code comment, set clean up mode

4e1e809

cloud-fan closed this in 5337a57 Aug 27, 2025

cloud-fan reviewed Aug 28, 2025

View reviewed changes

This was referenced Aug 29, 2025

[SPARK-53413][SQL] Shuffle cleanup for commands #52157

Closed

[SPARK-53469][SQL] Ability to cleanup shuffle in Thrift server #52213

Closed

[SPARK-52777][SQL] Enable shuffle cleanup mode configuration in Spark SQL #51458

[SPARK-52777][SQL] Enable shuffle cleanup mode configuration in Spark SQL #51458

Uh oh!

Conversation

karuppayya commented Jul 12, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

karuppayya commented Jul 17, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

karuppayya commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karuppayya commented Jul 17, 2025

Uh oh!

karuppayya commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karuppayya Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

violet-nspct commented Jul 26, 2025

Uh oh!

karuppayya commented Jul 26, 2025

Uh oh!

cloud-fan Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 28, 2025

Uh oh!

karuppayya commented Jul 28, 2025

Uh oh!

cloud-fan commented Jul 29, 2025

Uh oh!

karuppayya commented Jul 29, 2025

Uh oh!

cloud-fan commented Jul 31, 2025

Uh oh!

karuppayya commented Aug 4, 2025

Uh oh!

Uh oh!

Uh oh!

cloud-fan Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

karuppayya Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karuppayya commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karuppayya commented Aug 21, 2025

Uh oh!

karuppayya commented Aug 27, 2025

Uh oh!

cloud-fan commented Aug 27, 2025

Uh oh!

cloud-fan Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

karuppayya Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

karuppayya Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

karuppayya commented Jul 17, 2025 •

edited

Loading

karuppayya commented Jul 18, 2025 •

edited

Loading

karuppayya Jul 21, 2025 •

edited

Loading

karuppayya Aug 20, 2025 •

edited

Loading

karuppayya commented Aug 20, 2025 •

edited

Loading

cloud-fan Aug 28, 2025 •

edited

Loading