[SPARK-42101][SQL] Make AQE support InMemoryTableScanExec #39624

ulysses-you · 2023-01-17T10:27:43Z

What changes were proposed in this pull request?

This pr adds a new abstract ReusableQueryStageExec which is the parent of ShuffleQueryStageExec and BroadcastQueryStageExec.

Add a new query stage TableCacheQueryStageExec which is used to wrap InMemoryTableScanExec

The InMemoryTableScanExec has some issues in AQE:

The first access to the cached plan is tricky. Currently, we can not preverse it's output partitioning and ordering. It's due to we planned the query with a un-materialized cached plan
Miss updating plan info in AQE if the final stage incude InMemoryTableScan which breaks UI
Miss propagate metrics in AQE since the InMemoryTableScanExec execution id does not map to current query execution
The cached plan misses lots of optimization in AQE framework in AQEOptimizer. Ideally, we konw it's accurate statistics

Why are the changes needed?

fix the issues above

Does this PR introduce any user-facing change?

yes, fix the issues above

How was this patch tested?

add test and test manually

val df = spark.sql("select * from t1 join t2 on t1.c1 = t2.c1").cache
df.groupBy($"t1.c1").agg(max($"t2.c2")).collect

maryannxue · 2023-01-18T03:51:26Z

Can you list all the issues we currently have for InMemoryTableScanExec in AQE? We need to justify the need to expand the semantics of QueryStage here. And also what assumptions we might be breaking if we did so.

maryannxue · 2023-01-18T04:21:43Z

Another way to achieve this is instead of wrapping, we can have a common interface (e.g., Materializable, or BlockingDependency) for QueryStageExec and InMemoryTableScanExec so that other call sites that match QueryStageExec will not be affected. Also QueryStageExec currently strictly maps to a Spark Job/Stage in AQE, so we need to be careful here.

maryannxue · 2023-01-18T04:40:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CacheQueryStageExec.scala

We do not need to and cannot cancel(). Similarly, "reuse" and "getRuntimeStats" are meaningless for SQL cache.

cancel; shall we keep it same with other querystage that if it's materialized then do nothing, otherwise cancel it ?

reuse; yeah it's meaningless

getRuntimeStats; if it's materialized then its statsitcs is accurate so we can mark it as isRuntime right ？

dongjoon-hyun

Could you reply to @maryannxue 's comment, @ulysses-you ?

ulysses-you · 2023-01-30T03:26:39Z

@maryannxue @dongjoon-hyun sorry for the so late response. I'm working on this pr now !

ulysses-you · 2023-01-30T03:31:03Z

@maryannxue Materializablesounds good to me, I updated this pr with it. do you have time to take a another look ? thank you. also cc @cloud-fan @viirya @dongjoon-hyun

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/Materializable.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala

cloud-fan · 2023-03-01T12:41:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

why do we consider relation.isMaterialized?

If the InMemoryTableScanExec is materialized, we should not run a new job again. Add the check is to avoid AQE framework call doMaterialize.

cloud-fan · 2023-03-01T12:42:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

according to the comment, !isSubquery is sufficient?

A cached plan in AdaptiveSparkPlan is not a subquery but we can not update plan which would overwrite the whole query plan.

e.g. we can not update plan for query execution 0. Instead, we should update metrics for it.

... | AdaptiveSparkPlanExec (query execution 0, no execution id) | InMemoryTableScanExec | ... | AdaptiveSparkPlanExec (query execution 1, execution id 0)

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

ulysses-you · 2023-03-02T03:26:17Z

@cloud-fan it seems the main concern is that, shall we make Materializable as first-class citizen in AQE framework ? if so, then all code place in AQE framework should use Materializable instead of QueryStage.

In fact, I'm not sure it is fine. There are a lot of code name depend on query stage. e.g., queryStagePreparationRules -> materializablePreparationRules. It actually the developer api

cloud-fan · 2023-03-10T07:35:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

          }
      }

+    case i: InMemoryTableScanExec =>


question: if the table cache is already materialized (second access of the cache), do we still need to wrap it with TableCacheQueryStage?

TableCacheQueryStage provides a base framework for runtime statistics, so I think wrap it should be more suitable for AQE framework. e.g., mark isRuntime = true in Statistics.

cloud-fan · 2023-03-10T07:37:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

+
+  private lazy val shouldUpdatePlan: Boolean = {
+    // Only the root `AdaptiveSparkPlanExec` of the main query that triggers this query execution
+    // should update UI.


Suggested change

// should update UI.

// need to do a final plan update for the UI.

cloud-fan · 2023-03-10T07:40:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

-      batch
+    val cached = cb.mapPartitionsInternal { it =>
+      new Iterator[CachedBatch] {
+        TaskContext.get().addTaskCompletionListener[Unit](_ => {


we can register this listener before returning the wrapping iterator.

oh, somehow I put the code here...

cloud-fan · 2023-03-10T07:41:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala

+   * This method is only used by AQE which executes the actually cached RDD that without filter and
+   * serialization of row/columnar.
+   */
+  def executeCache(): RDD[CachedBatch] = {


baseCacheRDD?

it doesn't execute anything and the current name is confusing.

cloud-fan

looks pretty good, only some nit comments.

cloud-fan · 2023-03-10T09:38:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

   */
  private def onUpdatePlan(executionId: Long, newSubPlans: Seq[SparkPlan]): Unit = {
-    if (isSubquery) {
+    if (!needFinalPlanUpdate) {


sorry I was wrong, the previous name is better. But we should update the comment here. It's not only for subquery.

combined the comments to the definition of shouldUpdatePlan

ulysses-you · 2023-03-10T10:17:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

    // last UI update in `getFinalPhysicalPlan`, so we need to update UI here again to make sure
    // the newly generated nodes of those subqueries are updated.
-    if (!isSubquery && currentPhysicalPlan.exists(_.subqueries.nonEmpty)) {
+    if (shouldUpdatePlan && currentPhysicalPlan.exists(_.subqueries.nonEmpty)) {


shouldUpdatePlan is not required since we already checked it inside onUpdatePlan. The reason leave it here is to fast skip currentPhysicalPlan.exists(_.subqueries.nonEmpty)

cloud-fan · 2023-03-10T10:23:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

+    //    of the new plan nodes, so that it can track the valid accumulator updates later
+    //    and display SQL metrics correctly.
+    // 2. If the `QueryExecution` does not match the current execution ID, it means the execution
+    //    ID belongs to another (parent) query, and we should not call update UI in this query.


shall we mention that this can happen with table cache?

sure, added

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

dongjoon-hyun

+1, LGTM.

cc @viirya , too

…e/AdaptiveSparkPlanExec.scala Co-authored-by: Wenchen Fan <[email protected]>

cloud-fan · 2023-03-13T03:09:00Z

thanks, merging to master!

### What changes were proposed in this pull request? This is a followup of #39624 . `TableCacheQueryStageExec.cancel` is a noop and we can move `def cancel` out from `QueryStageExec`. Due to this movement, I renamed `ReusableQueryStageExec` to `ExchangeQueryStageExec` ### Why are the changes needed? type safe ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #40399 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…eterialized consistent ### What changes were proposed in this pull request? This is a followup of #39624 . `QueryStageExec.isMeterialized` should only return true if `resultOption` is assigned. It can be a potential bug to have this inconsistency. ### Why are the changes needed? fix potential bug ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #40522 from cloud-fan/follow. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ageLevel.NONE on Dataset ### What changes were proposed in this pull request? Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using `StorageLevel.NONE`. Before that patch a query like: ``` import org.apache.spark.storage.StorageLevel spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count() ``` would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here: https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala#L294 that the input is empty. The problem is that the action that should make sure the statistics are collected here https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L285-L291 never use the iterator and when we have `StorageLevel.NONE` the persisting will also not use the iterator and we will not gather the correct statistics. The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances. ### Why are the changes needed? The current code has a correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, fixes the correctness issue. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43213 from eejbyfeldt/SPARK-45386-branch-3.5. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ageLevel.NONE on Dataset Support for InMememoryTableScanExec in AQE was added in apache#39624, but this patch contained a bug when a Dataset is persisted using `StorageLevel.NONE`. Before that patch a query like: ``` import org.apache.spark.storage.StorageLevel spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count() ``` would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here: https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala#L294 that the input is empty. The problem is that the action that should make sure the statistics are collected here https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L285-L291 never use the iterator and when we have `StorageLevel.NONE` the persisting will also not use the iterator and we will not gather the correct statistics. The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances. The current code has a correctness issue. Yes, fixes the correctness issue. New and existing unit tests. No Closes apache#43213 from eejbyfeldt/SPARK-45386-branch-3.5. Authored-by: Emil Ejbyfeldt <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added the SQL label Jan 17, 2023

ulysses-you force-pushed the cache-query-stage branch from e2356d4 to 721e3e7 Compare January 17, 2023 10:28

ulysses-you closed this Jan 17, 2023

ulysses-you reopened this Jan 17, 2023

ulysses-you closed this Jan 18, 2023

ulysses-you reopened this Jan 18, 2023

maryannxue reviewed Jan 18, 2023

View reviewed changes

dongjoon-hyun reviewed Jan 21, 2023

View reviewed changes

ulysses-you force-pushed the cache-query-stage branch from 721e3e7 to a64dac8 Compare January 30, 2023 03:21

ulysses-you changed the title ~~[SPARK-42101][SQL] Wrap InMemoryTableScanExec with QueryStage~~ [SPARK-42101][SQL] Introduce Materializable and MaterializableQueryStage for AQE framework Jan 30, 2023

ulysses-you force-pushed the cache-query-stage branch 2 times, most recently from 42e36c9 to b9bc6e7 Compare February 28, 2023 12:14