Add functions for content-defined-chuncking#101264
Add functions for content-defined-chuncking#101264aobelski wants to merge 4 commits intoClickHouse:masterfrom
Conversation
|
aobelski seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Workflow [PR], commit [4d8eb7e] Summary: ❌
AI ReviewSummaryThis PR adds four new SQL functions for content-defined chunking based on a Missing context
ClickHouse Rules
Final Verdict
|
| if (tentative <= chunk_start) | ||
| return tentative; | ||
| const UInt8 * p = data + tentative; | ||
| UTF8::syncBackward(p, data + chunk_start); |
There was a problem hiding this comment.
forceCutPositionUtf8 can pass p = data + data_size into UTF8::syncBackward when tentative == data_size (for example when chunk_start + max_chunk_size >= data_size). syncBackward dereferences *s before checking bounds, so calling it with one-past-end pointer is undefined behavior.
Concrete trace:
data_size = 10,chunk_start = 0,max_chunk_size = 16tentative = min(16, 10) = 10p = data + 10(one-past-end)syncBackward(p, data)executesisContinuationOctet(*s)on invalid memory.
Please avoid calling syncBackward with end pointer, e.g. start from data + tentative - 1 (when tentative > chunk_start) or guard tentative == data_size explicitly.
|
|
||
| size_t maxChunkSizeForCdc(UInt64 reverse_probability) | ||
| { | ||
| const uint64_t scaled = std::min<uint64_t>(reverse_probability * 64, static_cast<uint64_t>(MAX_CDC_CHUNK_SIZE)); |
There was a problem hiding this comment.
reverse_probability * 64 is evaluated in uint64_t before the clamp, so very large values overflow (wrap) and can produce a much smaller max_chunk_size than intended.
Please clamp before multiplication, for example:
if (reverse_probability > MAX_CDC_CHUNK_SIZE / 64)
return MAX_CDC_CHUNK_SIZE;
return std::max<size_t>(DEFAULT_MIN_MAX_CHUNK, reverse_probability * 64);This keeps behavior monotonic and preserves the documented safety cap.
LLVM Coverage Report
Changed lines: 78.52% (435/554) · Uncovered code |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add SQL functions for content-defined chunking using a Buzhash rolling hash:
contentDefinedChunks,contentDefinedChunksUTF8,contentDefinedChunkOffsets, andcontentDefinedChunkOffsetsUTF8with arguments(string, window_size, reverse_probability); UTF-8 variants cut only at code point boundaries.Resolves #81183.
Documentation entry for user-facing changes
Embedded reference documentation is provided via
REGISTER_FUNCTIONin code (description,syntax,arguments, examples)Summary
Functions split a string into substrings so that boundaries depend on content (CDC), which stays mostly stable under prepending/appending data—useful for deduplication and similarity of binary blobs (artifacts, repos, etc.).
Semantics
Array(String)of chunks covering the whole input, orArray(UInt64)byte offsets of chunk starts.window_size: sliding window for Buzhash, 1–256.reverse_probability: cut when(hash % reverse_probability) == 0; larger values → larger average chunks (order of magnitude ~this parameter).Example
WITH 'some_long_string' AS s1, 'some_really_long_string' AS s2, 4 AS window_size, 1000 AS reverse_probability, contentDefinedChunks(s1, window_size, reverse_probability) AS c1, contentDefinedChunks(s2, window_size, reverse_probability) AS c2, arrayDistinct(c1) AS d1, arrayDistinct(c2) AS d2, arrayIntersect(d1, d2) AS inter, arrayDistinct(arrayConcat(d1, d2)) AS uni SELECT length(inter) AS intersection_size, length(uni) AS union_size, if(length(uni) = 0, 1.0, length(inter) / toFloat64(length(uni))) AS jaccard;