-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset #43188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK. I think this needs to go into 3.5 too?
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
Show resolved
Hide resolved
Yes, this fix is also needed in the 3.5 branch. |
|
@eejbyfeldt Spark package releasing pipeline has some issue recently, once it is fixed I will release new version graphframe for spark 3.5 |
|
@WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of a0c9ab6 it did not look like you merged it using the merge_spark_pr.py script? |
Yes, my fault :) We should use merge_spark_pr.py to merge it. I will backport it to spark 3.5. |
|
@eejbyfeldt I found there's some conflicts when I cherry-pick this commit a0c9ab6 to spark 3.5 could you file a separate PR against spark 3.5 ? Thanks! |
…evel.NONE on Dataset (apache#43188) * SPARK-45386: Fix correctness issue with StorageLevel.NONE * Move to CacheManager * Add comment
|
|
Thank you, @eejbyfeldt and all. To @WeichenXu123 . |
|
Wondering if there is a way to disable that "squash and merge" button @dongjoon-hyun :-) |
Sure. :) |
What changes were proposed in this pull request?
Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using
StorageLevel.NONE. Before that patch a query like:would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
Line 294 in eac5a8c
that the input is empty. The problem is that the action that should make sure the statistics are collected here
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala
Lines 285 to 291 in eac5a8c
never use the iterator and when we have
StorageLevel.NONEthe persisting will also not use the iterator and we will not gather the correct statistics.The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.
Why are the changes needed?
The current code has a correctness issue.
Does this PR introduce any user-facing change?
Yes, fixes the correctness issue.
How was this patch tested?
New and existing unit tests.
Was this patch authored or co-authored using generative AI tooling?
No