Skip to content

OPTIMIZE DEDUPLICATE BY COLUMNS#17846

Merged
alexey-milovidov merged 13 commits intoClickHouse:masterfrom
Enmk:Optimize_deduplicate
Dec 20, 2020
Merged

OPTIMIZE DEDUPLICATE BY COLUMNS#17846
alexey-milovidov merged 13 commits intoClickHouse:masterfrom
Enmk:Optimize_deduplicate

Conversation

@Enmk
Copy link
Copy Markdown
Contributor

@Enmk Enmk commented Dec 6, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on.
...

Detailed description / Documentation draft:
Following syntax variants are now supported:

OPTIMIZE TABLE table DEDUPLICATE; -- the old one
OPTIMIZE TABLE table DEDUPLICATE BY *; -- not the same as the old one, excludes MATERIALIZED columns (see the note below)
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);

Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion.
Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.
Column transformers other than EXCEPT are not supported.

Please see tests for examples.
...

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 6, 2020
@Enmk Enmk changed the title Optimize deduplicate OPTIMIZE DEDUPLICATE BY COLUMNS Dec 6, 2020
Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on.

Following syntax variants are now supported:

OPTIMIZE TABLE table DEDUPLICATE; -- the old one
OPTIMIZE TABLE table DEDUPLICATE BY *;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);

Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion.
Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.
Column transformers other than EXCEPT are not supported.
Enmk added 4 commits December 7, 2020 13:18
* no more undefined values for attributes in ReplicatedMergeTreeLogEntry
* validation of string serialization format
Also a minor cleanup of the test code.
Enmk added 2 commits December 8, 2020 19:44
Also logging expanded list of columns passed from `DEDUPLICATE BY` to actual deduplication routines.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite dangerous, because JSON does not support non-unicode data while we support arbitrary bytes for column names.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to idiosynchrasy as two different escaping methods will be used in the same file.
Let's reuse existing "tsv-escaped" method.

Copy link
Copy Markdown
Contributor Author

@Enmk Enmk Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I've done some research, and it looks like the method we use with TabSeparatedWithNames (I believe this is what you meant by 'tsv-secaped') wouldn't cut it.

INPUT as stated in C++ code:

{"name with space", "\"column\"", "'column'", "колонка", "\u30ab\u30e9\u30e0", "\x00\x01\x03 column \x10\x11\x12"},

writeEscapedString:

deduplicate_by_columns: [name with space,"column",\'column\',колонка,カラム,\0 column ]

writeCSV:

deduplicate_by_columns: ["name with space","""column""","'column'","колонка","カラム","\0 column "]

writeJSONString:

deduplicate_by_columns: ["name with space","\"column\"","'column'","колонка","カラム","\u0000\u0001\u0003 column \u0010\u0011\u0012"]

Please note how writeEscapedString and writeCSV eats all non-unicode binary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note how writeEscapedString and writeCSV eats all non-unicode binary.

No, they don't eat binary data. The bytes are written but they are lost in copy-paste.
Actually our JSON format is also binary safe, but it can be misleading to applications.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ clickhouse-client --query "SELECT '\x00\x01\x02\x03'"
\0
$ clickhouse-client --query "SELECT '\x00\x01\x02\x03'" | xxd
00000000: 5c30 0102 030a                           \0....

Copy link
Copy Markdown
Contributor Author

@Enmk Enmk Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, my bad. Changed to CSV format, since that makes it easier to parse list of columns. Right now serialized list of columns from the unit tests like this:

deduplicate_by_columns: "name with space","""column""","'column'","колонка","カラム"," column "

However, I had to ditch using \0 in test data since it makes impossible to validate result with std::regex.

@alexey-milovidov alexey-milovidov self-assigned this Dec 17, 2020
@filimonov
Copy link
Copy Markdown
Contributor

filimonov commented Dec 17, 2020

Test name | Test status | Test time, sec.
-- | -- | 
test_materialize_mysql_database/test.py::test_select_without_columns_5_7[clickhouse_node1] | FAIL | 189.75
test_materialize_mysql_database/test.py::test_select_without_columns_8_0[clickhouse_node1

Broken in master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants