Skip to content

Cost-based choice between nested loop join and hash join #7331

@dyemanov

Description

@dyemanov

Currently, only a nested loop join is used when streams have indexed relationships. But often this is sub-optimal and causes the dependent streams to be fetched more times than actually necessary. Hash join could be a better choice in these cases. Of course, cast-based approach should be used when choosing between the possible join algorithms.

Just an example from the TPC-R test suite:

select first 10
  l_orderkey, o_orderdate, o_shippriority,
  sum(l_extendedprice * (1 - l_discount)) as revenue,
from
  customer, orders, lineitem
where
  c_mktsegment = 'BUILDING'
  and c_custkey = o_custkey
  and l_orderkey = o_orderkey
  and o_orderdate < date '1995-03-15'
  and l_shipdate > date '1995-03-15'
group by
  l_orderkey, o_orderdate, o_shippriority
order by
  2 desc, o_orderdate;

PLAN SORT (
  SORT (
    JOIN (
      CUSTOMER NATURAL,
      ORDERS INDEX (ORDERS_CUSTKEY),
      LINEITEM INDEX (LINEITEM_PK))))

Elapsed time = 1.600 sec

vs

PLAN SORT (
  SORT (
    JOIN (
      HASH (
        ORDERS INDEX (ORDERS_ORDERDATE),
        CUSTOMER NATURAL),
      LINEITEM INDEX (LINEITEM_PK))))

Elapsed time = 1.031 sec

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions