MergeTree part-level insertion deduplication by yuzhichang · Pull Request #8467 · ClickHouse/ClickHouse

yuzhichang · 2019-12-30T05:40:23Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

New Feature

Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):
Added part-level insertion deduplication for MergeTree just like ReplicatedMergeTree.

yuzhichang · 2020-01-03T10:33:28Z

@alexey-milovidov Any comments for this PR? By the way, I don't think the failed cases relate to my change.

libs/libcommon/include/common/fshorten.h

dbms/src/Storages/MergeTree/MergeTreeData.cpp

alexey-milovidov

The idea is Ok, but I prefer if the following changes will be made:

Write separate histories for separate partitions and drop them when partition is dropped/detached.
Don't use approximate filtering with bloom filter, simply use two hash tables swapped on overflow. sparse_hash is Ok.
Don't use tricky file shorten method. Use two files and swap them in the same way as hash tables in memory.

yuzhichang · 2020-02-15T01:49:56Z

@alexey-milovidov The failing case looks caused by unstable test environment.

docs/en/operations/table_engines/mergetree.md

qoega · 2020-05-12T15:16:49Z

Probably off topic:
Have you thought about more complex way to deduplicate inserts rather then data only hash? We can use some client insertID to deduplicate if client is able to set it on insert. And there can be several strategies

data_hash - your current way to deduplicate. Good for data with high entropy
id - client defined unique id for insert for further deduplication
both - we use both id and data to search for duplicated inserts.

I'm conserned about scenarios where several inserts can have similar data. Currently there is only one solution to add one more column with something like UUID if your data can have same rows in different inserts.

yuzhichang · 2020-05-14T02:27:59Z

@qoega Here's a scenario where several inserts have same data: The JDBC client sends insert command, but gets broken connection exception. The client cannot tell if the exception happened before the insert, so it retry insert the same data. The server gets the same data twice, so later analysis queries may get incorrect result.

alesapin · 2020-08-26T07:09:22Z

The idea is clear to me, but I think we have to move this logic into StorageMergeTree level because in the current implementation it will work both for replicated and non replicated engines. I'll try to make some improvements.

huynhphuong10284 · 2020-10-21T02:56:37Z

Just give more information about duplicated records in clickHouse as below:

We're consuming data from Kafka then saving to clickHouse (using Distributed table in cluster and ReplicatedMergeTree in each shard)
Some times Kafka has duplicated data ( in producer if a writing process is not received responses from Kafka cluster and it has to write for the second message)

I see MergeTree is not useful in valid data due to it run background, data goes from invalid to valid then versa. The main purpose is exact data and it's hard to find a good solution to avoid duplicated data in a large of data (big data) if using clickHouse. Please advice! I hope we can handle the same solution in Cassandra.

alesapin · 2021-04-02T17:15:41Z

#22514

alexey-milovidov added the can be tested label Dec 31, 2019

yuzhichang changed the title ~~MergeTree part-level insertion deduplication~~ [WIP]MergeTree part-level insertion deduplication Dec 31, 2019

yuzhichang changed the title ~~[WIP]MergeTree part-level insertion deduplication~~ MergeTree part-level insertion deduplication Jan 3, 2020