Spark 3.4 : Use correct statistics file in SparkScan::estimateStatistics(Snapshot)#12647
Conversation
|
This is a clean backport of my fix. |
|
Yes thanks @wypoon for this fix, and for reviewing it. |
| if (!files.isEmpty()) { | ||
| List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); | ||
| Optional<StatisticsFile> file = | ||
| files.stream().filter(f -> f.snapshotId() == snapshot.snapshotId()).findFirst(); |
There was a problem hiding this comment.
Could we have multiple files with the same snapshotId? Or findAny is enough?
There was a problem hiding this comment.
As per java implementation, it is always a single stats file per snapshot id.
I followed similar design while working on partition stats too.
There was a problem hiding this comment.
So we could have a theoretical performance boost from findAny
There was a problem hiding this comment.
The spec doesn't actually say that there should only be one statistics file per snapshot. This happens to be how it is implemented in Java. The spec simply allows for multiple statistics files.
I was thinking about the problem of tracking orphaned statistics files when they are recomputed. One idea I had was to keep replaced statistics files (for a snapshot) still in the list (as long as the files are tracked in metadata we can clean up unused ones), but to keep the newest one before others. Hence findFirst. It was just an idea (and honestly not one I'm seriously considering).
In any case, I do not think that findAny is faster than findFirst here.
|
Why is this PR against Spark 3.4? |
|
Thanks @ajantha-bhat for catching that this is a backport! @jeesou: Please when backporting tell it in the PR description. Also highlight any changes needed compared to the original PR, so the reviewers could have easier time. |
|
@pvary, @ajantha-bhat I already reviewed this a day before you and if you read my comment, I mentioned that this is a clean backport of my fix. @jeesou as Peter stated, you should state in the description that this is a backport of #12482 (and that I am the author of the fix). I implied it but did not state it explicitly. |
No description provided.