Skip to content

Conversation

@eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Sep 30, 2023

What changes were proposed in this pull request?

Support for InMememoryTableScanExec in AQE was added in #39624, but this patch contained a bug when a Dataset is persisted using StorageLevel.NONE. Before that patch a query like:

import org.apache.spark.storage.StorageLevel
spark.createDataset(Seq(1, 2)).persist(StorageLevel.NONE).count()

would correctly return 2. But after that patch it incorrectly returns 0. This is because AQE incorrectly determines based on the runtime statistics that are collected here:


that the input is empty. The problem is that the action that should make sure the statistics are collected here
sparkContext.submitJob(
rdd,
(_: Iterator[CachedBatch]) => (),
(0 until rdd.getNumPartitions).toSeq,
(_: Int, _: Unit) => (),
()
)

never use the iterator and when we have StorageLevel.NONE the persisting will also not use the iterator and we will not gather the correct statistics.

The proposed fix in the patch just make calling persist with StorageLevel.NONE a no-op. Changing the action since it always "emptied" the iterator would also work but seems like that would be unnecessary work in a lot of normal circumstances.

Why are the changes needed?

The current code has a correctness issue.

Does this PR introduce any user-facing change?

Yes, fixes the correctness issue.

How was this patch tested?

New and existing unit tests.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Sep 30, 2023
@eejbyfeldt eejbyfeldt changed the title SPARK-45386: Fix correctness issue with StorageLevel.NONE on Dataset SPARK-45386: Fix correctness issue with persist using StorageLevel.NONE on Dataset Sep 30, 2023
@eejbyfeldt eejbyfeldt changed the title SPARK-45386: Fix correctness issue with persist using StorageLevel.NONE on Dataset [SPARK-45386][SQL]: Fix correctness issue with persist using StorageLevel.NONE on Dataset Sep 30, 2023
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. I think this needs to go into 3.5 too?

@eejbyfeldt
Copy link
Contributor Author

Looks OK. I think this needs to go into 3.5 too?

Yes, this fix is also needed in the 3.5 branch.

@WeichenXu123 WeichenXu123 merged commit a0c9ab6 into apache:master Oct 2, 2023
@WeichenXu123
Copy link
Contributor

@eejbyfeldt Spark package releasing pipeline has some issue recently, once it is fixed I will release new version graphframe for spark 3.5

@eejbyfeldt
Copy link
Contributor Author

@WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of a0c9ab6 it did not look like you merged it using the merge_spark_pr.py script?

@WeichenXu123
Copy link
Contributor

@WeichenXu123 this commit should also go into the 3.5 branch since also affected by the correctness bug. Also based on the commit message of a0c9ab6 it did not look like you merged it using the merge_spark_pr.py script?

Yes, my fault :) We should use merge_spark_pr.py to merge it.

I will backport it to spark 3.5.

@WeichenXu123
Copy link
Contributor

@eejbyfeldt I found there's some conflicts when I cherry-pick this commit a0c9ab6 to spark 3.5

could you file a separate PR against spark 3.5 ? Thanks!

eejbyfeldt pushed a commit to eejbyfeldt/spark that referenced this pull request Oct 4, 2023
…evel.NONE on Dataset (apache#43188)

* SPARK-45386: Fix correctness issue with StorageLevel.NONE

* Move to CacheManager

* Add comment
@eejbyfeldt
Copy link
Contributor Author

@eejbyfeldt I found there's some conflicts when I cherry-pick this commit a0c9ab6 to spark 3.5

could you file a separate PR against spark 3.5 ? Thanks!

#43213

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 4, 2023

Thank you, @eejbyfeldt and all.

To @WeichenXu123 .
Please use our merge script. It has much more features to help Apache Spark committers. 😄

@mridulm
Copy link
Contributor

mridulm commented Oct 4, 2023

Wondering if there is a way to disable that "squash and merge" button @dongjoon-hyun :-)

@WeichenXu123
Copy link
Contributor

Thank you, @eejbyfeldt and all.

To @WeichenXu123 . Please use our merge script. It has much more features to help Apache Spark committers. 😄

Sure. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants