Skip to content

partial_merge_join algorithm lead to duplicated rows #31009

@alesapin

Description

@alesapin

How to reproduce:

CREATE TABLE left
(
    key UInt32,
    value String
)
ENGINE = MergeTree ORDER BY key;

CREATE TABLE right
(
    key UInt32,
    value String
)
ENGINE = MergeTree ORDER BY tuple();

-- super secrect amount of numbers stolen from production
INSERT INTO left SELECT number, toString(number) FROM numbers(25367182);

INSERT INTO right SELECT number, toString(number) FROM numbers(23124707);

set join_algorithm='partial_merge_join';

SELECT
    key,
    count(1) AS cnt
FROM
(
    SELECT data.key
    FROM
    (
        SELECT key
        FROM left AS s
    ) AS data
    LEFT JOIN
    (
        SELECT key
        FROM right
        GROUP BY key
    ) AS promo ON promo.key = data.key
)
GROUP BY key
HAVING count(1) > 1
ORDER BY key ASC
LIMIT 10

Query id: 7d5f5792-f9f5-4766-b24d-649ec470d2ff

┌─────key─┬─cnt─┐
│  458752 │   2 │
│ 1835008 │   2 │
│ 3670016 │   2 │
│ 4128768 │   2 │
│ 4587520 │   2 │
│ 5046272 │   2 │
│ 5505024 │   2 │
│ 6487878 │   2 │
│ 6946630 │   2 │
│ 7405258 │   2 │
└─────────┴─────┘

Obviously SELECT query must return nothing because we don't have duplicated values in both tables.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed user-visible misbehaviour in official releasecomp-joinsJOINs end-to-end (planning hooks + runtime join operators/algorithms). Single bucket to avoid pla...

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions