Skip to content

[Bug] Optimizer could choose HashAggregate to dedup based on cost. #659

@my-ship-it

Description

@my-ship-it

Cloudberry Database version

main

What happened

Optimizer could produce better plan to dedup based on cost.

What you think should happen instead

No response

How to reproduce

For the planner, we need to create better plan which uses HashAggregate to dedup columns.
For example,

ebi_dwh=# explain analyze WITH UniqueIDs AS (
    SELECT DISTINCT gsp_id
    FROM eft_com_call_statistic
)
SELECT COUNT(*)
FROM UniqueIDs;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..1419.47 rows=1 width=8) (actual time=2510.079..2510.081 rows=1 loops=1)
   ->  Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..1419.47 rows=1 width=8) (actual time=1998.786..2510.054 rows=4 loops=1)
         ->  Partial Aggregate  (cost=0.00..1419.47 rows=1 width=8) (actual time=1993.802..1993.803 rows=1 loops=1)
               ->  HashAggregate  (cost=0.00..1419.47 rows=2315005 width=1) (actual time=1294.980..1855.013 rows=2339996 loops=1)
                     Group Key: gsp_id
                     ->  Seq Scan on eft_com_call_statistic  (cost=0.00..969.33 rows=2375680 width=37) (actual time=0.343..339.551 rows=2377251 loops=1)
 Planning Time: 5.357 ms
   (slice0)    Executor memory: 98354K bytes.
   (slice1)    Executor memory: 244665K bytes avg x 4x(0) workers, 244755K bytes max (seg2).  Work_mem: 286737K bytes max.
 Memory used:  2097152kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 2584.618 ms
(12 rows)

This plan uses HashAggregate to perform gsp_id deduplication, while

 ebi_dwh=# explain analyze select count(distinct gsp_id) from eft_com_call_statistic;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..1142.67 rows=1 width=8) (actual time=10573.416..10573.418 rows=1 loops=1)
   ->  Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..1142.67 rows=1 width=8) (actual time=10205.645..10573.387 rows=4 loops=1)
         ->  Partial Aggregate  (cost=0.00..1142.67 rows=1 width=8) (actual time=10437.005..10437.007 rows=1 loops=1)
               ->  Seq Scan on eft_com_call_statistic  (cost=0.00..969.33 rows=2375680 width=37) (actual time=0.269..548.673 rows=2377251 loops=1)
 Planning Time: 4.608 ms
   (slice0)    Executor memory: 37K bytes.
   (slice1)    Executor memory: 148480K bytes avg x 4x(0) workers, 148579K bytes max (seg2).
 Memory used:  2097152kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 10573.853 ms
(10 rows)

It seems the cost estimation is no accurate.

Operating System

No specific

Anything else

No

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Labels

type: BugSomething isn't workingtype: Performancecloudberry runs slow on some particular query

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions