[SPARK-39678][SQL] Improve stats estimation for v2 tables#37083
[SPARK-39678][SQL] Improve stats estimation for v2 tables#37083singhpk234 wants to merge 8 commits intoapache:masterfrom
Conversation
wangyum
left a comment
There was a problem hiding this comment.
Could you enable spark.sql.cbo.enabled to estimate row count?
Thanks @wangyum, I am aware of the alternate visitor we use with cbo. I raised this pr considering :
Are you recommending it's an expected behavior / by design ? |
6f683ef to
dcaebec
Compare
|
rebased and regenerated the golden files via :
|
dcaebec to
c5526c2
Compare
c5526c2 to
2175a1a
Compare
5e5e72c to
7d88612
Compare
|
I think it's by design. So enabling |
|
Thanks @wangyum !
I believe then setting
for my knowledge, can you please point me to some jira's ,happy to learn more. Love to know your thoughts on the same, Happy to close this as well if we consider this is not a problem at all. |
|
Can one of the admins verify this patch? |
|
I'm a bit confused. After this PR, what's the difference between |
BasicStatsPlanVisitor additionally takes has columnStats such as (NDV / NullCount / min / max etc) on estimation, which generally is not passed from DSv1 / Dsv2 relation itself. As per my understanding, prior to this PR, SizeInBytesOnlyStatsPlanVisitor was estimating stats on the subset of info i.e only sizeInBytes and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc), now via this PR SizeInBytesOnlyStatsPlanVisitor is estimating stats on the subset of info but this subset is now (sizeInBytes / rowCount) and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc). |
|
Maybe we should name them BTW, with CBO off, where do we use row count? |
we use it in places like : where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of |
|
OK I think the idea makes sense. With CBO off, the optimizer/planner only needs size in bytes, but row count is also an important statistics to estimate size in bytes, and should be propagated in the stats plan visitor. |
c21
left a comment
There was a problem hiding this comment.
Thanks singhpk234 for the work! Having some comments/questions.
...a/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala
Outdated
Show resolved
Hide resolved
|
|
||
| override def visitIntersect(p: Intersect): Statistics = fallback(p) | ||
|
|
||
| override def visitJoin(p: Join): Statistics = fallback(p) |
There was a problem hiding this comment.
Why we fallback here, but not use JoinEstimation.estimate?
There was a problem hiding this comment.
fallback here would endup calling BasicStatsPlanVisitor.visit(p) which, will in turn call BasicStatsPlanVisitor#visitJoin which will be JoinEstimation(p).estimate.getOrElse(default(p)). Hence added the same.
...a/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala
Outdated
Show resolved
Hide resolved
...cala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala
Outdated
Show resolved
Hide resolved
|
|
||
| // v2 sources can bubble-up rowCount, so always propagate. | ||
| // Don't propagate attributeStats, since they are not estimated here. | ||
| Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount) |
There was a problem hiding this comment.
I am confused here. In the top-level comment - computes a single dimension for plan stats: size in bytes. But why we populate rowCount as well here?
There was a problem hiding this comment.
In this estimator i.e visitUnaryNode, we adjust the size by scaling it by (input row size / output row size) but since we don't have much info (in terms of min / max / ndv etc) to estimate the row count we just say the node output's child output row count which is mostly true for operators like project etc.
Since we were just computing sizeInBytes and just propagating rowCounts as it is.
Appologies I forgot to update the comment as per proposed behaviour.
Should I rephrase it to:
estimates size in bytes, row count for plan stats
There was a problem hiding this comment.
for dsv2 sources rowCount can be passed from the relation itself without running analyze, hence BasicStatsPlanVisitor which will be our default now, post this change will take rowcount into consideration.
| +- ReusedExchange (28) | ||
| TakeOrderedAndProject (37) | ||
| +- * Project (36) | ||
| +- * SortMergeJoin Inner (35) |
There was a problem hiding this comment.
is this expected ? looks like a plan regression to me
thought these stats are available in AQE and more accurate though |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
We should propagate the row count stats in SizeInBytesOnlyStatsPlanVisitor if available. Row counts are propagated from connectors to spark in case of v2 tables.
Why are the changes needed?
This can improve stats estimation for v2 tables, since row count is used at places to estimate sizeInBytes.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Modified existing UT's to match the proposed behavior.