Skip to content

migration deduplications hashes with insert_deduplication_version setting#95409

Merged
CheSema merged 10 commits intomasterfrom
chesema-deduplication-unify
Feb 9, 2026
Merged

migration deduplications hashes with insert_deduplication_version setting#95409
CheSema merged 10 commits intomasterfrom
chesema-deduplication-unify

Conversation

@CheSema
Copy link
Copy Markdown
Member

@CheSema CheSema commented Jan 28, 2026

Motivation: #95160

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

server setting insert_deduplication_version makes it possible to migrate on unified deduplication hash

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 28, 2026

Workflow [PR], commit [0ee4d4c]

Summary:

job_name test_name status info comment
Stateless tests (arm_asan, azure, parallel) failure
03100_lwu_40_cleanup_race FAIL cidb
Stress test (amd_msan) failure
Logical error: Block structure mismatch in A stream: different number of columns: (STID: 0993-38e6) FAIL cidb, issue
Upgrade check (amd_release) failure
Cannot start clickhouse-server FAIL cidb
Check failed failure cidb
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/feature_docs.py failure

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Jan 28, 2026
@CheSema CheSema changed the title declare migration setting migration deduplications hashes with deduplication_unification_stage setting Jan 28, 2026
@CheSema CheSema force-pushed the chesema-deduplication-unify branch 6 times, most recently from f493fd0 to f7f10d5 Compare January 30, 2026 15:16
@CheSema CheSema force-pushed the chesema-deduplication-unify branch from f7f10d5 to 49fb9e9 Compare January 30, 2026 17:06
@CheSema CheSema marked this pull request as ready for review January 30, 2026 17:07
@CheSema CheSema requested a review from Copilot January 30, 2026 17:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new server setting deduplication_unification_stage that enables migration from separate deduplication hashes for sync and async inserts to a unified deduplication hash scheme. The setting supports three stages: old_separate_hashes (default, backward compatible), compatible_double_hashes (transition stage using both hash types), and new_unified_hash (final state using only unified hashes).

Changes:

  • Added deduplication_unification_stage server setting with three migration stages
  • Introduced DeduplicationHash struct to encapsulate hash type and block ID generation
  • Modified deduplication logic across replicated and non-replicated tables to support multiple hash types
  • Added integration tests covering migration scenarios between different stages

Reviewed changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/Core/ServerSettings.h/cpp Added deduplication_unification_stage server setting definition
src/Core/SettingsEnums.h/cpp Added DeduplicationUnificationStage enum with three stages
src/Interpreters/InsertDeduplication.h/cpp Introduced DeduplicationHash struct and refactored hash generation logic
src/Storages/StorageReplicatedMergeTree.h/cpp Added deduplication_hashes_cache member and ZooKeeper paths for unified hashes
src/Storages/MergeTree/ReplicatedMergeTreeSink.h/cpp Updated commitPart signature to use DeduplicationHash instead of string block IDs
src/Storages/MergeTree/AsyncBlockIDsCache.h/cpp Generalized cache to work with DeduplicationHash objects and configurable directory names
src/Storages/MergeTree/ReplicatedMergeTreeCleanupThread.cpp Added cleanup logic for deduplication_hashes directory
tests/integration/test_migrtation_deduplication_hash/* Added integration tests for migration scenarios and sync/async deduplication

@azat azat self-assigned this Feb 2, 2026
Copy link
Copy Markdown
Member

@azat azat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked everything except for ReplicatedMergeTreeSink for now

@CheSema CheSema changed the title migration deduplications hashes with deduplication_unification_stage setting migration deduplications hashes with insert_deduplication_version setting Feb 6, 2026
@CheSema CheSema added this pull request to the merge queue Feb 9, 2026
Merged via the queue into master with commit ce10089 Feb 9, 2026
128 of 133 checks passed
@CheSema CheSema deleted the chesema-deduplication-unify branch February 9, 2026 10:18
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants