[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

HeartSaVioR · 2025-05-10T09:49:01Z

What changes were proposed in this pull request?

We propose to add another tree string which focuses to produce output columns with data type and nullability. This will be shown in plan change logger, along with existing tree string plan.

For example, here is a one of example from plan change logging:

=== Applying Rule org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan ===
!HashAggregate(keys=[id#334L], functions=[count(1)], output=[id#334L, count#335L])              AdaptiveSparkPlan isFinalPlan=false
!+- HashAggregate(keys=[id#334L], functions=[partial_count(1)], output=[id#334L, count#339L])   +- HashAggregate(keys=[id#334L], functions=[count(1)], output=[id#334L, count#335L])
!   +- Range (0, 1, step=1, splits=2)                                                              +- HashAggregate(keys=[id#334L], functions=[partial_count(1)], output=[id#334L, count#339L])
!                                                                                                     +- Range (0, 1, step=1, splits=2)

Output Information:
!HashAggregate <output=id#334L[nullable=false], count#335L[nullable=false]>      AdaptiveSparkPlan <output=id#334L[nullable=false], count#335L[nullable=false]>
!+- HashAggregate <output=id#334L[nullable=false], count#339L[nullable=false]>   +- HashAggregate <output=id#334L[nullable=false], count#335L[nullable=false]>
!   +- Range <output=id#334L[nullable=false]>                                       +- HashAggregate <output=id#334L[nullable=false], count#339L[nullable=false]>
!                                                                                      +- Range <output=id#334L[nullable=false]>

In some cases, it's not even feasible to evaluate the output of the node. (e.g. Project with Star expression) In that case, we will simply put <output=unresolved> since it's mostly due to UnresolvedException.

For example,

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions ===
!'Aggregate [id#334L], [id#334L, 'count(1) AS count#335]   Aggregate [id#334L], [id#334L, count(1) AS count#335L]
 +- Range (0, 1, step=1, splits=Some(2))                   +- Range (0, 1, step=1, splits=Some(2))

Output Information:
!Aggregate <output='Unresolved'>             Aggregate <output=id#334L[nullable=false], count#335L[nullable=false]>
 +- Range <output=id#334L[nullable=false]>   +- Range <output=id#334L[nullable=false]>

Why are the changes needed?

We recently got into very tricky issue (nullability change broke stateful operator) which required custom debug logging on plan change logging. This is because of lack of visibility for the output columns, especially their nullability, in our tree string of the plan.

Ideally, we shouldn't have two different tree strings and just make a fix to the existing tree string, but in many cases, current tree string is long enough so that we had to restrict the number of fields to show, hence we think it's better to have a separate tree plan for it.

Does this PR introduce any user-facing change?

Yes, when they change SQL config for plan change logger log level to their visible log level in log4j2 config. The application of this change is at least opt-in instead of opt-out.
(If we are changing the existing tree string, it will change many places.)

How was this patch tested?

Modified UT to cover the change.

Was this patch authored or co-authored using generative AI tooling?

No.

…nullability

HeartSaVioR · 2025-05-10T09:49:42Z

cc. @cloud-fan PTAL, thanks!

HeartSaVioR · 2025-05-14T04:50:07Z

@cloud-fan Friendly reminder - I know you are busy with release, so just in case if you have time.

HeartSaVioR · 2025-05-21T08:59:02Z

cc. @cloud-fan Friendly reminder to see your availability.

cloud-fan · 2025-05-21T09:51:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

             |${MDC(QUERY_PLAN, sideBySide(oldPlan.treeString, newPlan.treeString).mkString("\n"))}
+             |
+             |Output Information:
+             |${MDC(QUERY_PLAN, newPlan.treeStringWithOutputColumns)}


treeString is also a public method, can we just call it with printOutputColumns = true?

instead of a new section with new plan only, shall we change sideBySide(oldPlan.treeString, newPlan.treeString) to use the new string with output?

I think it's too long, verbose is not an optional param, so we need to specify both verbose and printOutputColumns. I'm OK if the length does not matter. Let me change it and revert if you see it be too long.

I'll make a change to do sideBySide here. Great suggestion!

cloud-fan · 2025-05-21T09:53:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala

  def innerChildren: Seq[TreeNode[_]] = Seq.empty

+  def nodeWithOutputColumnsString(maxColumns: Int): String = {
+    throw new UnsupportedOperationException("TreeNode does not have output columns")


to be conservative shall we just call simpleString here?

HeartSaVioR · 2025-05-21T13:50:10Z

Thanks @cloud-fan for your review! I've addressed your comments. Please take another look when you have time. Thanks!

cloud-fan · 2025-05-22T14:46:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

             |${MDC(QUERY_PLAN, sideBySide(oldPlan.treeString, newPlan.treeString).mkString("\n"))}
+             |
+             |Output Information:
+             |${MDC(QUERY_PLAN, sideBySide(oldPlan.treeString(verbose = false, printOutputColumns = true), newPlan.treeString(verbose = false, printOutputColumns = true)))}


nit:

val oldPlanStringWithOutput = ... val newPlanStringWithOutput log""" ... ... sideBySide(oldPlanStringWithOutput, newPlanStringWithOutput)

Can we update the PR description to reflect this change?

Good point. Didn't realize the message() method itself is lazily evaluated.

Also updated the PR description.

HeartSaVioR · 2025-05-25T22:53:13Z

@cloud-fan I've addressed your feedback. PTAL, thanks!

cloud-fan · 2025-05-29T14:22:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

+          // scalastyle:off line.size.limit
          log"""
             |=== Result of Batch ${MDC(BATCH_NAME, batchName)} ===
             |${MDC(QUERY_PLAN, sideBySide(oldPlan.treeString, newPlan.treeString).mkString("\n"))}


What if we append the output info in the verbose treeString? Is the diff readable?

Sorry I was pulled into other things.

I'm not sure I'm following your suggestion actually. Do you suggest to print out two trees when we call treeString with verbose = true? Or do you suggest to add output columns into node for verboseString?

In either way, they have their own issue. For former, sideBySide won't line up if the optimization removes out some nodes. (In some cases we will need to compare diagonal lines.) For latter, we will print out max 50 (25 * 2) columns which isn't easily to read.

cloud-fan · 2025-06-16T03:06:52Z

thanks, merging to master!

HeartSaVioR · 2025-06-16T05:22:07Z

@cloud-fan Thanks for reviewing and merging!

see apache/spark#50852

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark*/.../QueryExecutionShim.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala shims/spark*/.../v2/DataSourceV2RelationShim.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

| Cause | Type | Category | Description | Affected Files | |-------|------|----------|-------------|----------------| | - | Feat | Feature | Introduce Spark41Shims and update build configuration to support Spark 4.1. | pom.xml shims/pom.xml shims/spark41/pom.xml shims/spark41/.../META-INF/services/org.apache.gluten.sql.shims.SparkShimProvider shims/spark41/.../spark41/Spark41Shims.scala shims/spark41/.../spark41/SparkShimProvider.scala | | [#51477](apache/spark#51477) | Fix | Compatibility | Use class name instead of class object for streaming call detection to ensure Spark 4.1 compatibility. | gluten-core/.../caller/CallerInfo.scala | | [#50852](apache/spark#50852) | Fix | Compatibility | Add printOutputColumns parameter to generateTreeString methods | shims/spark41/.../GenerateTreeStringShim.scala | | [#51775](apache/spark#51775) | Fix | Compatibility | Remove unused MDC import in FileSourceScanExecShim.scala | shims/spark41/.../FileSourceScanExecShim.scala | | [#51979](apache/spark#51979) | Fix | Compatibility | Add missing StoragePartitionJoinParams import in BatchScanExecShim and AbstractBatchScanExec | shims/spark41/.../v2/AbstractBatchScanExec.scala shims/spark41/.../v2/BatchScanExecShim.scala | | [#51302](apache/spark#51302) | Fix | Compatibility | Remove TimeAdd from ExpressionConverter and ExpressionMappings for test | gluten-substrait/.../ExpressionConverter.scala gluten-substrait/.../ExpressionMappings.scala | | [#50598](apache/spark#50598) | Fix | Compatibility | Adapt to QueryExecution.createSparkPlan interface change | gluten-substrait/.../GlutenImplicits.scala shims/spark\*/.../shims/spark\*/Spark*Shims.scala | | [#52599](apache/spark#52599) | Fix | Compatibility | Adapt to DataSourceV2Relation interface change | backends-velox/.../ArrowConvertorRule.scala | | [#52384](apache/spark#52384) | Fix | Compatibility | Using new interface of ParquetFooterReader | backends-velox/.../ParquetMetadataUtils.scala gluten-ut/spark40/.../parquet/GlutenParquetRowIndexSuite.scala shims/spark*/.../parquet/ParquetFooterReaderShim.scala | | [#52509](apache/spark#52509) | Fix | Build | Update Scala version to 2.13.17 in pom.xml to fix `java.lang.NoSuchMethodError: 'java.lang.String scala.util.hashing.MurmurHash3$.caseClassHash$default$2()'` | pom.xml | | - | Fix | Test | Refactor Spark version checks in VeloxHashJoinSuite to improve readability and maintainability | backends-velox/.../VeloxHashJoinSuite.scala | | [#50849](apache/spark#50849) | Fix | Test | Fix MiscOperatorSuite to support OneRowRelationExec plan Spark 4.1 | backends-velox/.../MiscOperatorSuite.scala | | [#52723](apache/spark#52723) | Fix | Compatibility | Add GeographyVal and GeometryVal support in ColumnarArrayShim | shims/spark41/.../vectorized/ColumnarArrayShim.java | | [#48470](apache/spark#48470) | 4.1.0 | Exclude | Exclude split test in VeloxStringFunctionsSuite | backends-velox/.../VeloxStringFunctionsSuite.scala | | [#51259](apache/spark#51259) | 4.1.0 | Exclude | Only Run ArrowEvalPythonExecSuite tests up to Spark 4.0， we need update ci python to 3.10 | backends-velox/.../python/ArrowEvalPythonExecSuite.scala |

HeartSaVioR added 4 commits May 9, 2025 13:06

[WIP] Change the plan change log to also produce output columns with …

365598c

…nullability

Compilation is fixed

e243f9d

fix AQE test failure

7a61566

modified test to cover the change

94c62c4

github-actions bot added the SQL label May 10, 2025

cloud-fan reviewed May 21, 2025

View reviewed changes

first pass reviews

fee4a2b

cloud-fan reviewed May 22, 2025

View reviewed changes

feedback

63dbb95

HeartSaVioR requested a review from cloud-fan May 29, 2025 12:54

cloud-fan reviewed May 29, 2025

View reviewed changes

cloud-fan closed this in 5d0b2f4 Jun 16, 2025

baibaichen added a commit to apache/incubator-gluten that referenced this pull request Dec 31, 2025

[Fix] Add printOutputColumns parameter to generateTreeString methods

d7d2c80

see apache/spark#50852

baibaichen added a commit to baibaichen/gluten that referenced this pull request Dec 31, 2025

[Fix] Add printOutputColumns parameter to generateTreeString methods

dbe40e0

see apache/spark#50852

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[Fix] Add printOutputColumns parameter to generateTreeString methods

fec7adf

see apache/spark#50852

baibaichen added a commit to baibaichen/gluten that referenced this pull request Jan 4, 2026

[Fix] Add printOutputColumns parameter to generateTreeString methods

ea32ac2

see apache/spark#50852

baibaichen mentioned this pull request Jan 5, 2026

[GLUTEN-11346][CORE][VL] Add Spark 4.1 Shim Layer apache/incubator-gluten#11347

Merged

[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

[SPARK-52065][SQL] Produce another plan tree with output columns (name, data type, nullability) in plan change logging #50852

Uh oh!

Conversation

HeartSaVioR commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HeartSaVioR commented May 10, 2025

Uh oh!

HeartSaVioR commented May 14, 2025

Uh oh!

HeartSaVioR commented May 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented May 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented May 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 16, 2025

Uh oh!

HeartSaVioR commented Jun 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HeartSaVioR commented May 10, 2025 •

edited

Loading

HeartSaVioR Jun 9, 2025 •

edited

Loading