[SPARK-39678][SQL] Improve stats estimation for v2 tables by singhpk234 · Pull Request #37083 · apache/spark

singhpk234 · 2022-07-05T08:26:54Z

What changes were proposed in this pull request?

We should propagate the row count stats in SizeInBytesOnlyStatsPlanVisitor if available. Row counts are propagated from connectors to spark in case of v2 tables.

Why are the changes needed?

This can improve stats estimation for v2 tables, since row count is used at places to estimate sizeInBytes.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Modified existing UT's to match the proposed behavior.

singhpk234 · 2022-07-05T08:27:35Z

cc @huaxingao @cloud-fan @wangyum

wangyum

Could you enable spark.sql.cbo.enabled to estimate row count?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/LogicalPlanStats.scala

Lines 33 to 40 in b1d719e

    
           def stats: Statistics = statsCache.getOrElse { 
        
             if (conf.cboEnabled) { 
        
               statsCache = Option(BasicStatsPlanVisitor.visit(self)) 
        
             } else { 
        
               statsCache = Option(SizeInBytesOnlyStatsPlanVisitor.visit(self)) 
        
             } 
        
             statsCache.get 
        
           }

singhpk234 · 2022-07-05T09:10:51Z

Could you enable spark.sql.cbo.enabled to estimate row count?

Thanks @wangyum, I am aware of the alternate visitor we use with cbo.

I raised this pr considering :

cbo is turned off by default.

We already have rowCount propagated via LeafNodes (DSv2Relation) which are used for estimating output size in SizeInBytesOnlyStatsPlanVisitor

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

Lines 93 to 100 in 161c596

    
           override def visitOffset(p: Offset): Statistics = { 
        
             val offset = p.offsetExpr.eval().asInstanceOf[Int] 
        
             val childStats = p.child.stats 
        
             val rowCount: BigInt = childStats.rowCount.map(c => c - offset).map(_.max(0)).getOrElse(0) 
        
             Statistics( 
        
               sizeInBytes = EstimationUtils.getOutputSize(p.output, rowCount, childStats.attributeStats), 
        
               rowCount = Some(rowCount)) 
        
           }

ANALYZE is not supported for v2 tables so except row count, IMHO we can't have ndv etc. I am refering to this jira : https://issues.apache.org/jira/browse/SPARK-39420

As per my understanding v1 tables can only pass in sizeInBytes unless they have some stats in catalog. whereas v2 tables already give both from the relation itself, hence I thought it's un-accounted for v2 tables.

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala

Lines 43 to 45 in 161c596

    
           catalogTable 
        
             .flatMap(_.stats.map(_.toPlanStats(output, conf.cboEnabled || conf.planStatsEnabled))) 
        
             .getOrElse(Statistics(sizeInBytes = relation.sizeInBytes))

Are you recommending it's an expected behavior / by design ?

singhpk234 · 2022-07-05T11:21:22Z

rebased and regenerated the golden files via :

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly PlanStabilitySuite"
SPARK_GENERATE_GOLDEN_FILES=1 SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly PlanStabilitySuite"

wangyum · 2022-07-06T00:58:52Z

I think it's by design. So enabling spark.sql.cbo.enabled is what you want?

singhpk234 · 2022-07-06T05:30:32Z

Thanks @wangyum !

So enabling spark.sql.cbo.enabled is what you want?

I believe then setting spark.sql.cbo.enabled to true by default could help, (what i wanted was to take this stat of row count, bubbled up from v2 connector to be accounted for in default spark behaviour) but I think it requires some additional efforts, since our other defaults such as auto-bhj etc needs to adjusted accordingly.

I think it's by design

for my knowledge, can you please point me to some jira's ,happy to learn more.

Love to know your thoughts on the same, Happy to close this as well if we consider this is not a problem at all.

AmplabJenkins · 2022-07-06T08:06:16Z

Can one of the admins verify this patch?

cloud-fan · 2022-07-08T06:37:34Z

I'm a bit confused. After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor?

singhpk234 · 2022-07-08T07:22:05Z

After this PR, what's the difference between SizeInBytesOnlyStatsPlanVisitor and BasicStatsPlanVisitor

BasicStatsPlanVisitor additionally takes has columnStats such as (NDV / NullCount / min / max etc) on estimation, which generally is not passed from DSv1 / Dsv2 relation itself.

As per my understanding, prior to this PR, SizeInBytesOnlyStatsPlanVisitor was estimating stats on the subset of info i.e only sizeInBytes and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc), now via this PR SizeInBytesOnlyStatsPlanVisitor is estimating stats on the subset of info but this subset is now (sizeInBytes / rowCount) and BasicStatsPlanVisitor on all 3 info (sizeInBytes, rowcount,ColumStats (min /max /NDV etc).

cloud-fan · 2022-07-08T07:33:51Z

Maybe we should name them BasicStatesPlanVisitor and AdvancedStatsPlanVisitor. We also need to make sure the updated SizeInBytesOnlyStatsPlanVisitor can propagate row count properly in all cases.

BTW, with CBO off, where do we use row count?

singhpk234 · 2022-07-08T08:16:44Z

BTW, with CBO off, where do we use row count?

we use it in places like :

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

Lines 93 to 100 in 161c596

    
           override def visitOffset(p: Offset): Statistics = { 
        
             val offset = p.offsetExpr.eval().asInstanceOf[Int] 
        
             val childStats = p.child.stats 
        
             val rowCount: BigInt = childStats.rowCount.map(c => c - offset).map(_.max(0)).getOrElse(0) 
        
             Statistics( 
        
               sizeInBytes = EstimationUtils.getOutputSize(p.output, rowCount, childStats.attributeStats), 
        
               rowCount = Some(rowCount)) 
        
           }

where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of row-count getting lost in places where we assume we only have sizeInBytes for example here :

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

Lines 54 to 58 in 161c596

    
           override def default(p: LogicalPlan): Statistics = p match { 
        
             case p: LeafNode => p.computeStats() 
        
             case _: LogicalPlan => 
        
               Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).filter(_ > 0L).product) 
        
           }

cloud-fan · 2022-07-08T08:45:51Z

OK I think the idea makes sense. With CBO off, the optimizer/planner only needs size in bytes, but row count is also an important statistics to estimate size in bytes, and should be propagated in the stats plan visitor.

…itor

cloud-fan · 2022-07-11T05:13:56Z

cc @wzhfy @c21 can you take a look first?

c21

Thanks singhpk234 for the work! Having some comments/questions.

...a/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala

c21 · 2022-07-11T23:05:38Z

...a/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala

+
+  override def visitIntersect(p: Intersect): Statistics = fallback(p)
+
+  override def visitJoin(p: Join): Statistics = fallback(p)


Why we fallback here, but not use JoinEstimation.estimate?

fallback here would endup calling BasicStatsPlanVisitor.visit(p) which, will in turn call BasicStatsPlanVisitor#visitJoin which will be JoinEstimation(p).estimate.getOrElse(default(p)). Hence added the same.

...a/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AdvancedStatsPlanVisitor.scala

...cala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala

c21 · 2022-07-11T23:10:03Z

...cala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/BasicStatsPlanVisitor.scala

+
+    // v2 sources can bubble-up rowCount, so always propagate.
+    // Don't propagate attributeStats, since they are not estimated here.
+    Statistics(sizeInBytes = sizeInBytes, rowCount = p.child.stats.rowCount)


I am confused here. In the top-level comment - computes a single dimension for plan stats: size in bytes. But why we populate rowCount as well here?

In this estimator i.e visitUnaryNode, we adjust the size by scaling it by (input row size / output row size) but since we don't have much info (in terms of min / max / ndv etc) to estimate the row count we just say the node output's child output row count which is mostly true for operators like project etc.

Since we were just computing sizeInBytes and just propagating rowCounts as it is.
Appologies I forgot to update the comment as per proposed behaviour.

Should I rephrase it to:

estimates size in bytes, row count for plan stats

for dsv2 sources rowCount can be passed from the relation itself without running analyze, hence BasicStatsPlanVisitor which will be our default now, post this change will take rowcount into consideration.

zinking · 2022-07-25T02:43:32Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q44.sf100/explain.txt

-      +- ReusedExchange (28)
+TakeOrderedAndProject (37)
+- * Project (36)
+   +- * SortMergeJoin Inner (35)


is this expected ? looks like a plan regression to me

zinking · 2022-07-25T02:53:13Z

BTW, with CBO off, where do we use row count?

we use it in places like :

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

Lines 93 to 100 in 161c596

override def visitOffset(p: Offset): Statistics = {

val offset = p.offsetExpr.eval().asInstanceOf[Int]

val childStats = p.child.stats

val rowCount: BigInt = childStats.rowCount.map(c => c - offset).map(_.max(0)).getOrElse(0)

Statistics(

sizeInBytes = EstimationUtils.getOutputSize(p.output, rowCount, childStats.attributeStats),

rowCount = Some(rowCount))

}

where we just multiply row-count with row size. We also use it for BF to create bloomFilterAgg. In v1 scenario in case of logical relation row-count can seep in from catalog stats but as you correctly pointed out it has a has a chance of row-count getting lost in places where we assume we only have sizeInBytes for example here :

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

Lines 54 to 58 in 161c596

override def default(p: LogicalPlan): Statistics = p match {

case p: LeafNode => p.computeStats()

case _: LogicalPlan =>

Statistics(sizeInBytes = p.children.map(_.stats.sizeInBytes).filter(_ > 0L).product)

}

thought these stats are available in AQE and more accurate though

github-actions · 2022-11-03T00:24:04Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions bot added the SQL label Jul 5, 2022

wangyum reviewed Jul 5, 2022

View reviewed changes

singhpk234 force-pushed the fix/stats_estimation_for_v2_sources branch from 6f683ef to dcaebec Compare July 5, 2022 11:10

singhpk234 force-pushed the fix/stats_estimation_for_v2_sources branch from dcaebec to c5526c2 Compare July 5, 2022 13:53

Prashant Singh added 4 commits July 5, 2022 19:32

Improve stats estimation for v2 tables

54f7cfa

generate golden files w.r.t the change

3548145

fix ut failure

8effb5b

fix union estimation

2175a1a

singhpk234 force-pushed the fix/stats_estimation_for_v2_sources branch from c5526c2 to 2175a1a Compare July 5, 2022 14:26

dont use cbo join estimation

7d88612

singhpk234 force-pushed the fix/stats_estimation_for_v2_sources branch from 5e5e72c to 7d88612 Compare July 5, 2022 17:18

Prashant Singh added 2 commits July 9, 2022 14:11

Address PR feedback - BasicStatesPlanVisitor and AdvancedStatsPlanVis…

c88c2c2

…itor

Golden files for improved join size estimation

436ebba

c21 reviewed Jul 11, 2022

View reviewed changes

adress review feeback - round2

a49aaed

zinking reviewed Jul 25, 2022

View reviewed changes

github-actions bot added the Stale label Nov 3, 2022

github-actions bot closed this Nov 5, 2022

singhpk234 mentioned this pull request Jul 11, 2024

Support Spark Column Stats apache/iceberg#10659

Merged

	def stats: Statistics = statsCache.getOrElse {
	if (conf.cboEnabled) {
	statsCache = Option(BasicStatsPlanVisitor.visit(self))
	} else {
	statsCache = Option(SizeInBytesOnlyStatsPlanVisitor.visit(self))
	}
	statsCache.get
	}


		override def visitIntersect(p: Intersect): Statistics = fallback(p)

		override def visitJoin(p: Join): Statistics = fallback(p)

Conversation

singhpk234 commented Jul 5, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

singhpk234 commented Jul 5, 2022

Uh oh!

wangyum left a comment

Choose a reason for hiding this comment

Uh oh!

singhpk234 commented Jul 5, 2022

Uh oh!

singhpk234 commented Jul 5, 2022

Uh oh!

wangyum commented Jul 6, 2022

Uh oh!

singhpk234 commented Jul 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Jul 6, 2022

Uh oh!

cloud-fan commented Jul 8, 2022

Uh oh!

singhpk234 commented Jul 8, 2022

Uh oh!

cloud-fan commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singhpk234 commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Jul 8, 2022

Uh oh!

cloud-fan commented Jul 11, 2022

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

c21 Jul 11, 2022

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jul 12, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

c21 Jul 11, 2022

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Jul 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zinking Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

zinking commented Jul 25, 2022

Uh oh!

github-actions bot commented Nov 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

singhpk234 commented Jul 6, 2022 •

edited

Loading

cloud-fan commented Jul 8, 2022 •

edited

Loading

singhpk234 commented Jul 8, 2022 •

edited

Loading

singhpk234 Jul 12, 2022 •

edited

Loading

singhpk234 Jul 12, 2022 •

edited

Loading