Implement automatic buckets in Map data type MergeTree serialization#99200
Implement automatic buckets in Map data type MergeTree serialization#99200
Conversation
…s. Enable new serialization by default for testing
|
Workflow [PR], commit [25b41d9] Summary: ❌
AI ReviewSummaryThis PR introduces bucketed Missing context
Findings
Tests
ClickHouse Rules
Performance & SafetyBucketed reads can materially improve single-key lookup performance, but current offset/bucket-boundary defects create correctness/safety risk in both write and read paths. The feature should not be enabled broadly until those blockers are resolved and corruption guards are complete. Final VerdictStatus: ❌ Block Minimum required actions:
|
…in MergeTree writer, split dynamic structure and synamic subcolumns logic, split taking dynamic structure and statistics logic
|
I don't see a test for the repeating keys. something like: |
I believe we should have existing tests with repeated keys. And by default |
There was a problem hiding this comment.
I enabled new serialization for all maps by default for testing purposes in the CI. Later I will change defaults back to normal.
| if (min_avg_size > 0 && statistics->avg < static_cast<Float64>(min_avg_size)) | ||
| return 1; | ||
|
|
||
| UInt64 result; |
There was a problem hiding this comment.
calculateNumberOfBuckets can use an uninitialized result value.
result is declared without initialization, and the switch has no default. If strategy ever contains an unexpected enum value (e.g., future enum extension, corrupted metadata, or mismatch across binaries), result is read uninitialized in the return expression, which is undefined behavior.
Please make this path explicit, e.g. add a default branch that throws, or initialize result to a safe value before the switch.
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: 88.40% (2911/3293, 13 noise lines excluded) |
…Stream memory fix - Add entry for bucketed serialization for Map columns (#99200) under New Feature. - Move #98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Cherry pick #99200 to 26.3: Implement automatic buckets in Map data type MergeTree serialization
… MergeTree serialization
Backport #99200 to 26.3: Implement automatic buckets in Map data type MergeTree serialization
…Stream memory fix - Add entry for bucketed serialization for Map columns (ClickHouse#99200) under New Feature. - Move ClickHouse#98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added bucketed serialization for Map columns in MergeTree (
map_serialization_version = 'with_buckets'). Keys are split into hash-based buckets so that reading a single key (m['key']) only reads one bucket instead of the entire column, providing 2-49x speedup for single-key lookups depending on map size. The number of buckets and the bucketing strategy can be controlled by new MergeTree settings:map_serialization_version,max_buckets_in_map,map_buckets_strategy,map_buckets_coefficient, andmap_buckets_min_avg_size. The settingmap_serialization_version_for_zero_level_partsallows keeping basic serialization for inserts to avoid write overhead while merged parts use buckets. Skip indexes onmapKeys/mapValuesare supported with bucketed serialization. Theoptimize_functions_to_subcolumnspass rewritesm['key']to a per-key subcolumn read when buckets are available.Documentation entry for user-facing changes
Bucketed Map Serialization in MergeTree
By default, a
Mapcolumn in MergeTree is stored as a singleArray(Tuple(K, V))stream.Reading a single key with
m['key']requires scanning the entire column — every key-value pair for every row — even if only one key is needed.For maps with many distinct keys this becomes a bottleneck.
Bucketed serialization (
with_buckets) splits the key-value pairs into multiple independent substreams (buckets) by hashing the key.When a query accesses
m['key'], only the bucket that contains that key is read from disk, skipping all other buckets.Enabling Bucketed Serialization
To avoid slowing down inserts, you can keep
basicserialization for zero-level parts (created duringINSERT) and only usewith_bucketsfor merged parts:How It Works
When a data part is written with
with_bucketsserialization:bucket = hash(key) % num_buckets.buckets_infometadata stream records the bucket count and statistics.When a query reads a specific key (
m['key']), the optimizer rewrites the expression to a key subcolumn (m.key_<serialized_key>).The serialization layer computes which bucket the requested key belongs to and reads only that single bucket from disk.
When the full map is read (e.g.,
SELECT m), all buckets are read and reassembled into the original map. This is slower thanbasicserialization due to the overhead of reading and merging multiple substreams.The bucket count can vary between parts. When parts with different bucket counts are merged, the new part's bucket count is recalculated from the merged statistics. Parts with
basicandwith_bucketsserialization can coexist in the same table and are merged transparently.Settings
map_serialization_versionbasicMapcolumns.basicstores as a single array stream.with_bucketssplits keys into buckets for faster single-key reads.map_serialization_version_for_zero_level_partsbasicINSERT). Allows keepingbasicfor inserts to avoid write overhead, while merged parts usewith_buckets.max_buckets_in_map32map_buckets_strategy.map_buckets_strategysqrtconstant— always usemax_buckets_in_map;sqrt— useround(coefficient * sqrt(avg_size));linear— useround(coefficient * avg_size). Result is clamped to[1, max_buckets_in_map].map_buckets_coefficient1.0sqrtandlinearstrategies. Ignored when strategy isconstant.map_buckets_min_avg_size320to disable the threshold.Performance Tradeoffs
The following table summarizes the performance impact of
with_bucketscompared tobasicserialization at various map sizes (10 to 10,000 keys per row). The bucket count was determined by thesqrtstrategy capped at 32. The exact numbers depend on key/value types, data distribution, and hardware.m['key'])SELECT m WHERE m['key'] = ...)SELECT m)More detailed benchmark summary: https://gist.github.com/Avogar/5bc5867a1a3e587b916b520d06f92185
Recommendations
Small maps (< 32 keys on average): Keep
basicserialization. The overhead of bucketing is not justified for small maps. The defaultmap_buckets_min_avg_size = 32enforces this automatically.Medium maps (32–100 keys): Use
with_bucketswithsqrtstrategy if queries frequently access individual keys. The speedup is 4–8x for single-key lookups.Large maps (100+ keys): Use
with_buckets. Single-key lookups are 16–49x faster. Considermap_serialization_version_for_zero_level_parts = 'basic'to keep insert speed close to the baseline.Full map scans dominate the workload: Keep
basic. Bucketed serialization adds ~2x overhead for full scans.Mixed workload (some key lookups, some full scans): Use
with_bucketswith zero-level parts set tobasic. ThePREWHEREoptimization reads only the relevant bucket for the filter, then reads the full map only for matching rows, giving a significant net speedup.Documentation is written (mandatory for new features)
Note
High Risk
High risk because it changes MergeTree on-disk serialization/stream layout for
Map(new bucket substreams + metadata) and extends the analyzer to rewritem['key']into dynamic subcolumn reads, impacting query planning and part merge behavior.Overview
Adds bucketed on-disk serialization for
Mapcolumns in MergeTree (map_serialization_version='with_buckets') that hashes key/value pairs into multiple bucket substreams plus abuckets_infometadata stream, so single-key access can read only the relevant bucket.Introduces new MergeTree settings/enums to control map bucketing (
map_serialization_version[_for_zero_level_parts],max_buckets_in_map,map_buckets_strategy,map_buckets_coefficient,map_buckets_min_avg_size) and persists the chosen Map serialization version inSerializationInfo.Enables key-level reads via dynamic subcolumns by teaching
DataTypeMapto resolvem.key_<serialized_key>and addingDataTypeMapHelpers::extractKeyValueFromMapfor fast constant-key extraction; the analyzer pass now rewritesm['const'](internalarrayElement) to these key subcolumns, including cases involving indexes/full-column reads.Refactors dynamic-structure APIs and adds statistics plumbing across
IColumnand multiple column types (Array/Tuple/Variant/Dynamic/Object/Map/etc.), splitting “take exact structure” vs “choose structure for merge” and addingtakeOrCalculateStatisticsFrom/hasStatisticsto support bucket/structure decisions and invalidation when structures change.Docs add a new section describing bucketed Map serialization, settings, and trade-offs.
Written by Cursor Bugbot for commit 9518ddc. This will update automatically on new commits. Configure here.