Skip to content

Validate SummungMergeTree columns to sum are not in sort key, fix partition key validation. Fix auto resolve of summing cols.#78022

Merged
GrigoryPervakov merged 4 commits intoClickHouse:masterfrom
GrigoryPervakov:summing-keys
Apr 4, 2025
Merged

Validate SummungMergeTree columns to sum are not in sort key, fix partition key validation. Fix auto resolve of summing cols.#78022
GrigoryPervakov merged 4 commits intoClickHouse:masterfrom
GrigoryPervakov:summing-keys

Conversation

@GrigoryPervakov
Copy link
Copy Markdown
Member

@GrigoryPervakov GrigoryPervakov commented Mar 20, 2025

Closes: #76866

Except mentioned issue, it fixes cases like, when merged rows end up in the wrong partition.

CREATE TABLE t (key Int32, val Int32) Engine=SummingMergeTree PARTITION BY (key % 2) ORDER BY (key + 0);
INSERT INTO t VALUES (1, 1), (1, 2), (2, 3);
SELECT key % 2, _partition_id FROM t;

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Enhance SummingMergeTree validation to skip aggregation for columns used in partition or sort keys.

Documentation entry for user-facing changes

  • [*] Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 20, 2025

Workflow [PR], commit [7d73754]

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Mar 20, 2025
@GrigoryPervakov GrigoryPervakov added the can be tested Allows running workflows for external contributors label Mar 20, 2025
@den-crane
Copy link
Copy Markdown
Contributor

den-crane commented Mar 20, 2025

BTW it's possible to use SimpleAggregateFunction
https://fiddle.clickhouse.com/6e0087d5-1aee-4758-a870-badae7795d01
It also should be forbidden

CREATE TABLE t0 (c0 SimpleAggregateFunction(sum, Int64)) 
  ENGINE = SummingMergeTree() ORDER BY (c0%1);

INSERT INTO TABLE t0 (c0) VALUES (2),(1),(2);

OPTIMIZE TABLE t0 FINAL;

SELECT * FROM t0;

┌─c0─┐
│  5 │
└────┘

@den-crane
Copy link
Copy Markdown
Contributor

Ah, the same issue with AggregatingMergeTree

https://fiddle.clickhouse.com/86457454-d0f1-4ee9-aa6b-92557feb7b3f

CREATE TABLE t0 (c0 SimpleAggregateFunction(sum, Int64)) 
  ENGINE = AggregatingMergeTree() ORDER BY (c0%1);

INSERT INTO TABLE t0 (c0) VALUES (2),(1),(2);

OPTIMIZE TABLE t0 FINAL;

SELECT * FROM t0;


 ┌─c0─┐
 │  5 │
 └────┘

@GrigoryPervakov
Copy link
Copy Markdown
Member Author

BTW it's possible to use SimpleAggregateFunction https://fiddle.clickhouse.com/6e0087d5-1aee-4758-a870-badae7795d01 It also should be forbidden

CREATE TABLE t0 (c0 SimpleAggregateFunction(sum, Int64)) 
  ENGINE = SummingMergeTree() ORDER BY (c0%1);

INSERT INTO TABLE t0 (c0) VALUES (2),(1),(2);

OPTIMIZE TABLE t0 FINAL;

SELECT * FROM t0;

┌─c0─┐
│  5 │
└────┘

This one works fine after the patch, it returns single line 2 as expected

@den-crane
Copy link
Copy Markdown
Contributor

den-crane commented Mar 21, 2025

This one works fine after the patch, it returns single line 2 as expected

No, no, no. It should be forbidden { serverError BAD_ARGUMENTS }

It should behave the same as

CREATE TABLE t0 (c0 SimpleAggregateFunction(sum, Int64))  ENGINE = SummingMergeTree() ORDER BY (c0);

DB::Exception: Column with type SimpleAggregateFunction(sum, Int64) is not allowed in key expression. (DATA_TYPE_CANNOT_BE_USED_IN_KEY)

@GrigoryPervakov
Copy link
Copy Markdown
Member Author

No, no, no. It should be forbidden { serverError BAD_ARGUMENTS }

I am not sure, it's a different check on column comparability, wrapping it with expression casts it to the comparable type, and then it seems to work fine. With that patch in such case, SummingMergeTree won't mistakenly aggregate it, so despite it being a strange choice for the sorting key, it should work ok

@den-crane
Copy link
Copy Markdown
Contributor

den-crane commented Mar 21, 2025

OK. Fine. From my POV there is no difference:

CREATE TABLE t0 (c0 SimpleAggregateFunction(sum, Int64)) ENGINE = SummingMergeTree() ORDER BY (c0%1);

CREATE TABLE t0 (c0 Int64) ENGINE = SummingMergeTree() ORDER BY (c0%1);

@GrigoryPervakov
Copy link
Copy Markdown
Member Author

OK. Fine. From my POV there is no difference:

CREATE TABLE t0 (c0 SimpleAggregateFunction(sum, Int64)) ENGINE = SummingMergeTree() ORDER BY (c0%1);

CREATE TABLE t0 (c0 Int64) ENGINE = SummingMergeTree() ORDER BY (c0%1);

From my POV too, and both are valid and behave equally. The only difference is that in the second case, you can use SumMerge on c0 in a select query.

SimpleAggregateFunction won't be magically changed by merge, so rows. would be in the right order

@den-crane
Copy link
Copy Markdown
Contributor

den-crane commented Mar 21, 2025

SimpleAggregateFunction won't be magically changed by merge, so rows. would be in the right order

It obviously leads to a corruption of a Primary Key and must be forbidden.

select * from t0 where c0 = 9945;
0 rows in set. Elapsed: 0.005 sec.

select * from t0 where identity(c0) = 9945;
   ┌───c0─┐
1. │ 9945 │
   └──────┘

https://fiddle.clickhouse.com/a6541456-a4d4-4d5e-afc5-1c940dbaf3c2

@GrigoryPervakov
Copy link
Copy Markdown
Member Author

SimpleAggregateFunction won't be magically changed by merge, so rows. would be in the right order

It obviously leads to a corruption of a Primary Key and must be forbidden.

https://fiddle.clickhouse.com/a6541456-a4d4-4d5e-afc5-1c940dbaf3c2

In your example Int64 and SimpleAggregateFunction(sum, Int64) behave the same way with the same result. PK corrupted because column c0 is aggregated, however, it shouldn't and this PR fixes exactly that, for both cases.

@GrigoryPervakov GrigoryPervakov force-pushed the summing-keys branch 3 times, most recently from 2cbac96 to e6ce234 Compare March 28, 2025 14:21
@jkartseva jkartseva self-assigned this Mar 29, 2025
addSettingsChanges(merge_tree_settings_changes_history, "25.4",
{
{"max_merge_delayed_streams_for_parallel_write", 1000, 100, "New setting"},
{"allow_summing_columns_in_partition_or_order_key", true, false, "Don't allow summing of partition or sorting key columns"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify that this is a new setting and not a change of the value of an existing setting:
New setting to allow summing...
The information that the behavior is changed from allowing to denying is stated in true to false.

Comment on lines +1245 to +1247
std::set_intersection(columns_to_sum.begin(), columns_to_sum.end(),
sorting_key_columns.begin(), sorting_key_columns.end(),
std::back_inserter(names_intersection));
Copy link
Copy Markdown
Member

@jkartseva jkartseva Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are both columns_to_sum and sorting_key_columns ranges sorted to use std::set_intersection?

In the above snippet, sorting is applied to check for duplicates:

        auto columns_to_sum_copy = columns_to_sum;
        std::sort(columns_to_sum_copy.begin(), columns_to_sum_copy.end());
        if (const auto it = std::adjacent_find(columns_to_sum_copy.begin(), columns_to_sum_copy.end()); it != columns_to_sum_copy.end())
            throw Exception(ErrorCodes::BAD_ARGUMENTS, "Column {} is listed multiple times in the list of columns to sum", *it);

This implies that columns_to_sum may not be sorted.

We should use columns_to_sum_copy instead of columns_to_sum, and it's probably worth renaming it to columns_to_sum_sorted.
The type of sorting_key_columns is a vector of strings, I'm not sure about the ordering.
https://github.com/ClickHouse/ClickHouse/blob/master/src/Core/Names.h#L13

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@GrigoryPervakov GrigoryPervakov added this pull request to the merge queue Apr 4, 2025
Merged via the queue into ClickHouse:master with commit 4d2e00b Apr 4, 2025
116 of 119 checks passed
@GrigoryPervakov GrigoryPervakov deleted the summing-keys branch April 4, 2025 17:05
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 4, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Apr 5, 2025
baibaichen pushed a commit to apache/gluten that referenced this pull request Apr 5, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250405)

* Fix Build due to ClickHouse/ClickHouse#78515

* remove temp code due to #9192

* Fix test fail due to ClickHouse/ClickHouse#78022

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logical error: Sort order of blocks violated for column

4 participants