Skip to content

Conversation

@eejbyfeldt
Copy link
Contributor

What changes were proposed in this pull request?

Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using StorageLevel.NONE. Before that patch a query like:

import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()

would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:


that the input is empty. The problem is that the action that should make sure the statistics are collected here
sparkContext.submitJob(
rdd,
(_: Iterator[CachedBatch]) => (),
(0 until rdd.getNumPartitions).toSeq,
(_: Int, _: Unit) => (),
()
)

never use the iterator and when we have StorageLevel.NONE the persisting will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.

Why are the changes needed?

The current code has a correctness issue.

Does this PR introduce any user-facing change?

Yes, fixes the correctness issue.

How was this patch tested?

New and existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

…evel.NONE on Dataset (apache#43188)

* SPARK-45386: Fix correctness issue with StorageLevel.NONE

* Move to CacheManager

* Add comment
Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset for branch 3.5 [SPARK-45386][SQL][3.5] Fix correctness issue with persist using StorageLevel.NONE on Dataset Oct 4, 2023
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs). Thank you, @eejbyfeldt .

@HyukjinKwon
Copy link
Member

The test failure was due to 2a9dd2b. I reverted it out of branch-3.5.522af69

@HyukjinKwon
Copy link
Member

Merged to branch-3.5.

HyukjinKwon pushed a commit that referenced this pull request Oct 5, 2023
…ageLevel.NONE on Dataset

### What changes were proposed in this pull request?
Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using `StorageLevel.NONE`. Before that patch a query like:
```
import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()
```
would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:
https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala#L294
that the input is empty. The problem is that the action that should make sure the statistics are collected here
https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L285-L291
never use the iterator and when we have `StorageLevel.NONE` the persisting will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.

### Why are the changes needed?
The current code has a correctness issue.

### Does this PR introduce _any_ user-facing change?
Yes, fixes the correctness issue.

### How was this patch tested?
New and existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43213 from eejbyfeldt/SPARK-45386-branch-3.5.

Authored-by: Emil Ejbyfeldt <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon closed this Oct 5, 2023
turboFei pushed a commit to turboFei/spark that referenced this pull request Nov 6, 2025
…ageLevel.NONE on Dataset

Support for InMememoryTableScanExec in AQE was added in apache#39624, but this patch contained a bug when a Dataset is persisted using `StorageLevel.NONE`. Before that patch a query like:
```
import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()
```
would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:
https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala#L294
that the input is empty. The problem is that the action that should make sure the statistics are collected here
https://github.com/apache/spark/blob/eac5a8c7e6da94bb27e926fc9a681aed6582f7d3/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L285-L291
never use the iterator and when we have `StorageLevel.NONE` the persisting will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.

The current code has a correctness issue.

Yes, fixes the correctness issue.

New and existing unit tests.

No

Closes apache#43213 from eejbyfeldt/SPARK-45386-branch-3.5.

Authored-by: Emil Ejbyfeldt <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants