Skip to content

The NOT IN Hash Join is not parallel #31

@avamingli

Description

@avamingli

Cloudberry Database version

No response

What happened

TPCH SQL16 plan shows degradation of performance . One reason: the not in part isn't parallel.

select
        p_brand,
        p_type,
        p_size,
        count(distinct ps_suppkey) as supplier_cnt
from
        partsupp,
        part
where
        p_partkey = ps_partkey
        and p_brand <> 'Brand#13'
        and p_type not like 'MEDIUM POLISHED%'
        and p_size in (40, 23, 12, 20, 48, 24, 21, 29)
        and ps_suppkey not in (
                select
                        s_suppkey
                from
                        supplier
                where
                        s_comment like '%Customer%Complaints%'
        )
group by
        p_brand,
        p_type,
        p_size
order by
        supplier_cnt desc,
        p_brand,
        p_type,
        p_size;


optimizer = off
enable_parallel = on
                                                                                          QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 6:1  (slice1; segments: 6)  (cost=60277.13..60277.37 rows=20 width=44)
   Merge Key: (count(DISTINCT partsupp.ps_suppkey)), part.p_brand, part.p_type, part.p_size
   ->  Sort  (cost=60277.13..60277.13 rows=3 width=44)
         Sort Key: (count(DISTINCT partsupp.ps_suppkey)) DESC, part.p_brand, part.p_type, part.p_size
         ->  GroupAggregate  (cost=60277.02..60277.10 rows=3 width=44)
               Group Key: part.p_brand, part.p_type, part.p_size
               ->  Sort  (cost=60277.02..60277.03 rows=3 width=40)
                     Sort Key: part.p_brand, part.p_type, part.p_size
                     ->  Redistribute Motion 6:6  (slice2; segments: 6)  (cost=48369.17..60276.99 rows=3 width=40)
                           Hash Key: part.p_brand, part.p_type, part.p_size
                           Hash Module: 3
                           ->  Hash Join  (cost=48369.17..60276.92 rows=3 width=40)
                                 Hash Cond: (part.p_partkey = partsupp.ps_partkey)
                                 ->  Parallel Seq Scan on part  (cost=0.00..11718.33 rows=50493 width=40)
                                       Filter: ((p_brand <> 'Brand#13'::bpchar) AND ((p_type)::text !~~ 'MEDIUM POLISHED%'::text) AND (p_size = ANY ('{40,23,12,20,48,24,21,29}'::integer[])))
                                 ->  Hash  (cost=48368.89..48368.89 rows=23 width=8)
                                       ->  Broadcast Motion 3:6  (slice3; segments: 3)  (cost=600.93..48368.89 rows=23 width=8)
                                             ->  Hash Left Anti Semi (Not-In) Join  (cost=600.93..48368.28 rows=8 width=8)
                                                   Hash Cond: (partsupp.ps_suppkey = supplier.s_suppkey)
                                                   ->  Seq Scan on partsupp  (cost=0.00..41100.55 rows=2666455 width=8)
                                                   ->  Hash  (cost=600.80..600.80 rows=10 width=4)
                                                         ->  Broadcast Motion 3:3  (slice4; segments: 3)  (cost=0.00..600.80 rows=10 width=4)
                                                               ->  Seq Scan on supplier  (cost=0.00..600.67 rows=3 width=4)
                                                                     Filter: ((s_comment)::text ~~ '%Customer%Complaints%'::text)

What you think should happen instead

For parallel-aware NOT IN HashJoin, it's hard to tell null values in which batches.
But the parallel-oblivious NOT IN HashJoin is possible as we will build all inner tables first.
Parallel-oblivious Hash Join should be enabled.

How to reproduce

Run TPCH16 SQL or just a NOT IN sql.

Operating System

ubuntu

Anything else

No response

Are you willing to submit PR?

  • Yes, I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions