Skip to content

Add functions for content-defined-chuncking#101264

Open
aobelski wants to merge 4 commits intoClickHouse:masterfrom
aobelski:aobelski/rolling-hashes-cdc
Open

Add functions for content-defined-chuncking#101264
aobelski wants to merge 4 commits intoClickHouse:masterfrom
aobelski:aobelski/rolling-hashes-cdc

Conversation

@aobelski
Copy link
Copy Markdown

@aobelski aobelski commented Mar 30, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add SQL functions for content-defined chunking using a Buzhash rolling hash: contentDefinedChunks, contentDefinedChunksUTF8, contentDefinedChunkOffsets, and contentDefinedChunkOffsetsUTF8 with arguments (string, window_size, reverse_probability); UTF-8 variants cut only at code point boundaries.

Resolves #81183.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Embedded reference documentation is provided via REGISTER_FUNCTION in code (description, syntax, arguments, examples)


Summary

Functions split a string into substrings so that boundaries depend on content (CDC), which stays mostly stable under prepending/appending data—useful for deduplication and similarity of binary blobs (artifacts, repos, etc.).

Semantics

  • Returns Array(String) of chunks covering the whole input, or Array(UInt64) byte offsets of chunk starts.
  • window_size: sliding window for Buzhash, 1–256.
  • reverse_probability: cut when (hash % reverse_probability) == 0; larger values → larger average chunks (order of magnitude ~this parameter).

Example

SELECT contentDefinedChunks('SOME STRING', 4, 1000) AS chunks;
WITH
    'some_long_string' AS s1,
    'some_really_long_string' AS s2,
    4 AS window_size,
    1000 AS reverse_probability,
    contentDefinedChunks(s1, window_size, reverse_probability) AS c1,
    contentDefinedChunks(s2, window_size, reverse_probability) AS c2,
    arrayDistinct(c1) AS d1,
    arrayDistinct(c2) AS d2,
    arrayIntersect(d1, d2) AS inter,
    arrayDistinct(arrayConcat(d1, d2)) AS uni
SELECT
    length(inter) AS intersection_size,
    length(uni) AS union_size,
    if(length(uni) = 0, 1.0, length(inter) / toFloat64(length(uni))) AS jaccard;

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


aobelski seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 30, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 30, 2026

Workflow [PR], commit [4d8eb7e]

Summary:

job_name test_name status info comment
Stress test (arm_debug) failure
Hung check failed, possible deadlock found FAIL cidb
Stress test (arm_msan) failure
Logical error: '(isConst() || isSparse() || isReplicated()) ? getDataType() == rhs.getDataType() : typeid(*this) == typeid(rhs)' (STID: 2508-394e) FAIL cidb
Stress test (arm_ubsan) failure
Hung check failed, possible deadlock found FAIL cidb

AI Review

Summary

This PR adds four new SQL functions for content-defined chunking based on a Buzhash rolling hash (contentDefinedChunks, contentDefinedChunksUTF8, contentDefinedChunkOffsets, contentDefinedChunkOffsetsUTF8) with argument validation, docs metadata, and unit/stateless tests. After reviewing the full diff and modified files in context, I did not find any remaining blocker/major issues in the current PR head.

Missing context
  • ⚠️ No CI run results/logs were provided in the review context.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Mar 30, 2026
if (tentative <= chunk_start)
return tentative;
const UInt8 * p = data + tentative;
UTF8::syncBackward(p, data + chunk_start);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forceCutPositionUtf8 can pass p = data + data_size into UTF8::syncBackward when tentative == data_size (for example when chunk_start + max_chunk_size >= data_size). syncBackward dereferences *s before checking bounds, so calling it with one-past-end pointer is undefined behavior.

Concrete trace:

  • data_size = 10, chunk_start = 0, max_chunk_size = 16
  • tentative = min(16, 10) = 10
  • p = data + 10 (one-past-end)
  • syncBackward(p, data) executes isContinuationOctet(*s) on invalid memory.

Please avoid calling syncBackward with end pointer, e.g. start from data + tentative - 1 (when tentative > chunk_start) or guard tentative == data_size explicitly.


size_t maxChunkSizeForCdc(UInt64 reverse_probability)
{
const uint64_t scaled = std::min<uint64_t>(reverse_probability * 64, static_cast<uint64_t>(MAX_CDC_CHUNK_SIZE));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse_probability * 64 is evaluated in uint64_t before the clamp, so very large values overflow (wrap) and can produce a much smaller max_chunk_size than intended.

Please clamp before multiplication, for example:

if (reverse_probability > MAX_CDC_CHUNK_SIZE / 64)
    return MAX_CDC_CHUNK_SIZE;
return std::max<size_t>(DEFAULT_MIN_MAX_CHUNK, reverse_probability * 64);

This keeps behavior monotonic and preserves the documented safety cap.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 31, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.10% -0.10%
Functions 90.90% 90.90% +0.00%
Branches 76.70% 76.70% +0.00%

Changed lines: 78.52% (435/554) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Functions for rolling hashes

3 participants