MergeTree part-level insertion deduplication#8467
MergeTree part-level insertion deduplication#8467yuzhichang wants to merge 1 commit intoClickHouse:masterfrom infinivision:mergetree_deduplicate_parts
Conversation
|
@alexey-milovidov Any comments for this PR? By the way, I don't think the failed cases relate to my change. |
alexey-milovidov
left a comment
There was a problem hiding this comment.
The idea is Ok, but I prefer if the following changes will be made:
- Write separate histories for separate partitions and drop them when partition is dropped/detached.
- Don't use approximate filtering with bloom filter, simply use two hash tables swapped on overflow.
sparse_hashis Ok. - Don't use tricky file shorten method. Use two files and swap them in the same way as hash tables in memory.
|
@alexey-milovidov The failing case looks caused by unstable test environment. |
|
Probably off topic:
I'm conserned about scenarios where several inserts can have similar data. Currently there is only one solution to add one more column with something like UUID if your data can have same rows in different inserts. |
|
@qoega Here's a scenario where several inserts have same data: The JDBC client sends insert command, but gets broken connection exception. The client cannot tell if the exception happened before the insert, so it retry insert the same data. The server gets the same data twice, so later analysis queries may get incorrect result. |
|
The idea is clear to me, but I think we have to move this logic into |
|
Just give more information about duplicated records in clickHouse as below:
I see MergeTree is not useful in valid data due to it run background, data goes from invalid to valid then versa. The main purpose is exact data and it's hard to find a good solution to avoid duplicated data in a large of data (big data) if using clickHouse. Please advice! I hope we can handle the same solution in Cassandra. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):
Added part-level insertion deduplication for MergeTree just like ReplicatedMergeTree.