Add functions for content-defined-chuncking by aobelski · Pull Request #101264 · ClickHouse/ClickHouse

aobelski · 2026-03-30T18:47:24Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add SQL functions for content-defined chunking using a Buzhash rolling hash: contentDefinedChunks, contentDefinedChunksUTF8, contentDefinedChunkOffsets, and contentDefinedChunkOffsetsUTF8 with arguments (string, window_size, reverse_probability); UTF-8 variants cut only at code point boundaries.

Resolves #81183.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Embedded reference documentation is provided via REGISTER_FUNCTION in code (description, syntax, arguments, examples)

Summary

Functions split a string into substrings so that boundaries depend on content (CDC), which stays mostly stable under prepending/appending data—useful for deduplication and similarity of binary blobs (artifacts, repos, etc.).

Semantics

Returns Array(String) of chunks covering the whole input, or Array(UInt64) byte offsets of chunk starts.
window_size: sliding window for Buzhash, 1–256.
reverse_probability: cut when (hash % reverse_probability) == 0; larger values → larger average chunks (order of magnitude ~this parameter).

Example

SELECT contentDefinedChunks('SOME STRING', 4, 1000) AS chunks;

WITH
    'some_long_string' AS s1,
    'some_really_long_string' AS s2,
    4 AS window_size,
    1000 AS reverse_probability,
    contentDefinedChunks(s1, window_size, reverse_probability) AS c1,
    contentDefinedChunks(s2, window_size, reverse_probability) AS c2,
    arrayDistinct(c1) AS d1,
    arrayDistinct(c2) AS d2,
    arrayIntersect(d1, d2) AS inter,
    arrayDistinct(arrayConcat(d1, d2)) AS uni
SELECT
    length(inter) AS intersection_size,
    length(uni) AS union_size,
    if(length(uni) = 0, 1.0, length(inter) / toFloat64(length(uni))) AS jaccard;

CLAassistant · 2026-03-30T18:47:36Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

aobelski seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

clickhouse-gh · 2026-03-30T18:53:31Z

Workflow [PR], commit [4d8eb7e]

Summary: ❌

job_name	test_name	status	info
Stress test (arm_debug)		failure
	Hung check failed, possible deadlock found	FAIL	cidb
Stress test (arm_msan)		failure
	Logical error: '(isConst() \|\| isSparse() \|\| isReplicated()) ? getDataType() == rhs.getDataType() : typeid(*this) == typeid(rhs)' (STID: 2508-394e)	FAIL	cidb
Stress test (arm_ubsan)		failure
	Hung check failed, possible deadlock found	FAIL	cidb

AI Review

Summary

This PR adds four new SQL functions for content-defined chunking based on a Buzhash rolling hash (contentDefinedChunks, contentDefinedChunksUTF8, contentDefinedChunkOffsets, contentDefinedChunkOffsetsUTF8) with argument validation, docs metadata, and unit/stateless tests. After reviewing the full diff and modified files in context, I did not find any remaining blocker/major issues in the current PR head.

Missing context

⚠️ No CI run results/logs were provided in the review context.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	✅
Compilation time	✅

Final Verdict

Status: ✅ Approve

clickhouse-gh · 2026-03-30T19:15:41Z

src/Functions/FunctionsRollingHash.cpp

+    if (tentative <= chunk_start)
+        return tentative;
+    const UInt8 * p = data + tentative;
+    UTF8::syncBackward(p, data + chunk_start);


forceCutPositionUtf8 can pass p = data + data_size into UTF8::syncBackward when tentative == data_size (for example when chunk_start + max_chunk_size >= data_size). syncBackward dereferences *s before checking bounds, so calling it with one-past-end pointer is undefined behavior.

Concrete trace:

data_size = 10, chunk_start = 0, max_chunk_size = 16

tentative = min(16, 10) = 10

p = data + 10 (one-past-end)

syncBackward(p, data) executes isContinuationOctet(*s) on invalid memory.

Please avoid calling syncBackward with end pointer, e.g. start from data + tentative - 1 (when tentative > chunk_start) or guard tentative == data_size explicitly.

clickhouse-gh · 2026-03-30T20:36:27Z

src/Functions/FunctionsRollingHash.cpp

+
+size_t maxChunkSizeForCdc(UInt64 reverse_probability)
+{
+    const uint64_t scaled = std::min<uint64_t>(reverse_probability * 64, static_cast<uint64_t>(MAX_CDC_CHUNK_SIZE));


reverse_probability * 64 is evaluated in uint64_t before the clamp, so very large values overflow (wrap) and can produce a much smaller max_chunk_size than intended.

Please clamp before multiplication, for example:

if (reverse_probability > MAX_CDC_CHUNK_SIZE / 64) return MAX_CDC_CHUNK_SIZE; return std::max<size_t>(DEFAULT_MIN_MAX_CHUNK, reverse_probability * 64);

This keeps behavior monotonic and preserves the documented safety cap.

clickhouse-gh · 2026-03-31T11:13:28Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.20%	84.10%	-0.10%
Functions	90.90%	90.90%	+0.00%
Branches	76.70%	76.70%	+0.00%

Changed lines: 78.52% (435/554) · Uncovered code

Full report · Diff report

Add functions for contend-defined-chuncking

2cb22a5

alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 30, 2026

...

86a19cb

clickhouse-gh bot added the pr-feature Pull request with new product feature label Mar 30, 2026

clickhouse-gh bot reviewed Mar 30, 2026

View reviewed changes

fix & add test

be90fe5

clickhouse-gh bot reviewed Mar 30, 2026

View reviewed changes

fix maxchunksize overflow & add test

4d8eb7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functions for content-defined-chuncking#101264

Add functions for content-defined-chuncking#101264
aobelski wants to merge 4 commits intoClickHouse:masterfrom
aobelski:aobelski/rolling-hashes-cdc

aobelski commented Mar 30, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 30, 2026

Uh oh!

clickhouse-gh bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

clickhouse-gh bot Mar 30, 2026

Uh oh!

clickhouse-gh bot Mar 30, 2026

Uh oh!

clickhouse-gh bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aobelski commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

CLAassistant commented Mar 30, 2026

Uh oh!

clickhouse-gh bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context

ClickHouse Rules

Final Verdict

Uh oh!

clickhouse-gh bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot commented Mar 31, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aobelski commented Mar 30, 2026 •

edited

Loading

clickhouse-gh bot commented Mar 30, 2026 •

edited

Loading