Skip to content

MergeTree part-level insertion deduplication#8467

Closed
yuzhichang wants to merge 1 commit intoClickHouse:masterfrom
infinivision:mergetree_deduplicate_parts
Closed

MergeTree part-level insertion deduplication#8467
yuzhichang wants to merge 1 commit intoClickHouse:masterfrom
infinivision:mergetree_deduplicate_parts

Conversation

@yuzhichang
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):
Added part-level insertion deduplication for MergeTree just like ReplicatedMergeTree.

@yuzhichang yuzhichang changed the title MergeTree part-level insertion deduplication [WIP]MergeTree part-level insertion deduplication Dec 31, 2019
@yuzhichang yuzhichang changed the title [WIP]MergeTree part-level insertion deduplication MergeTree part-level insertion deduplication Jan 3, 2020
@yuzhichang
Copy link
Copy Markdown
Contributor Author

@alexey-milovidov Any comments for this PR? By the way, I don't think the failed cases relate to my change.

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is Ok, but I prefer if the following changes will be made:

  1. Write separate histories for separate partitions and drop them when partition is dropped/detached.
  2. Don't use approximate filtering with bloom filter, simply use two hash tables swapped on overflow. sparse_hash is Ok.
  3. Don't use tricky file shorten method. Use two files and swap them in the same way as hash tables in memory.

@alexey-milovidov alexey-milovidov added the st-discussion When the implementation aspects are not clear or when the PR is on hold due to questions. label Jan 23, 2020
@qoega qoega added the doc-alert label Feb 6, 2020
@yuzhichang
Copy link
Copy Markdown
Contributor Author

@alexey-milovidov The failing case looks caused by unstable test environment.

@blinkov blinkov added the pr-feature Pull request with new product feature label Apr 1, 2020
@blinkov blinkov removed their assignment Apr 2, 2020
@qoega
Copy link
Copy Markdown
Member

qoega commented May 12, 2020

Probably off topic:
Have you thought about more complex way to deduplicate inserts rather then data only hash? We can use some client insertID to deduplicate if client is able to set it on insert. And there can be several strategies

  • data_hash - your current way to deduplicate. Good for data with high entropy
  • id - client defined unique id for insert for further deduplication
  • both - we use both id and data to search for duplicated inserts.

I'm conserned about scenarios where several inserts can have similar data. Currently there is only one solution to add one more column with something like UUID if your data can have same rows in different inserts.

@yuzhichang
Copy link
Copy Markdown
Contributor Author

@qoega Here's a scenario where several inserts have same data: The JDBC client sends insert command, but gets broken connection exception. The client cannot tell if the exception happened before the insert, so it retry insert the same data. The server gets the same data twice, so later analysis queries may get incorrect result.

@alesapin alesapin self-assigned this Jul 28, 2020
@alesapin
Copy link
Copy Markdown
Member

alesapin commented Aug 26, 2020

The idea is clear to me, but I think we have to move this logic into StorageMergeTree level because in the current implementation it will work both for replicated and non replicated engines. I'll try to make some improvements.

@huynhphuong10284
Copy link
Copy Markdown

Just give more information about duplicated records in clickHouse as below:

  • We're consuming data from Kafka then saving to clickHouse (using Distributed table in cluster and ReplicatedMergeTree in each shard)
  • Some times Kafka has duplicated data ( in producer if a writing process is not received responses from Kafka cluster and it has to write for the second message)

I see MergeTree is not useful in valid data due to it run background, data goes from invalid to valid then versa. The main purpose is exact data and it's hard to find a good solution to avoid duplicated data in a large of data (big data) if using clickHouse. Please advice! I hope we can handle the same solution in Cassandra.

@alesapin
Copy link
Copy Markdown
Member

alesapin commented Apr 2, 2021

#22514

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature st-discussion When the implementation aspects are not clear or when the PR is on hold due to questions.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants