Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26741

Analyzer incorrectly resolves aggregate function outside of Aggregate operators

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • SQL
    • None

    Description

      The analyzer can sometimes hit issues with resolving functions. e.g.

      select max(id)
      from range(10)
      group by id
      having count(1) >= 1
      order by max(id)
      

      The analyzed plan of this query is:

      == Analyzed Logical Plan ==
      max(id): bigint
      Project [max(id)#91L]
      +- Sort [max(id#88L) ASC NULLS FIRST], true
         +- Project [max(id)#91L, id#88L]
            +- Filter (count(1)#93L >= cast(1 as bigint))
               +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS count(1)#93L, id#88L]
                  +- Range (0, 10, step=1, splits=None)
      

      Note how an aggregate function is outside of Aggregate operators in the fully analyzed plan:
      Sort max(id#88L) ASC NULLS FIRST, true, which makes the plan invalid.

      Trying to run this query will lead to weird issues in codegen, but the root cause is in the analyzer:

      java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[1, bigint, false])
        at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
        at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
        at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
        at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
        at scala.Option.getOrElse(Option.scala:138)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at scala.collection.TraversableLike.map(TraversableLike.scala:237)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
        at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
        at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
        at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.<init>(GenerateOrdering.scala:195)
        at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.<init>(GenerateOrdering.scala:192)
        at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
        at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
        at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
        at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
        at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
        at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
        at org.apache.spark.sql.Dataset.show(Dataset.scala:721)
      

      The test case SPARK-23957 Remove redundant sort from subquery plan(scalar subquery) in SubquerySuite has been disabled because of hitting this issue, caught by SPARK-26735. We should re-enable that test once this bug is fixed.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rednaxelafx Kris Mok
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: