Skip to content

Implement automatic buckets in Map data type MergeTree serialization#99200

Merged
Avogar merged 30 commits intomasterfrom
buckets-in-map-2
Mar 25, 2026
Merged

Implement automatic buckets in Map data type MergeTree serialization#99200
Avogar merged 30 commits intomasterfrom
buckets-in-map-2

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Mar 10, 2026

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added bucketed serialization for Map columns in MergeTree (map_serialization_version = 'with_buckets'). Keys are split into hash-based buckets so that reading a single key (m['key']) only reads one bucket instead of the entire column, providing 2-49x speedup for single-key lookups depending on map size. The number of buckets and the bucketing strategy can be controlled by new MergeTree settings: map_serialization_version, max_buckets_in_map, map_buckets_strategy, map_buckets_coefficient, and map_buckets_min_avg_size. The setting map_serialization_version_for_zero_level_parts allows keeping basic serialization for inserts to avoid write overhead while merged parts use buckets. Skip indexes on mapKeys/mapValues are supported with bucketed serialization. The optimize_functions_to_subcolumns pass rewrites m['key'] to a per-key subcolumn read when buckets are available.

Documentation entry for user-facing changes

Bucketed Map Serialization in MergeTree

By default, a Map column in MergeTree is stored as a single Array(Tuple(K, V)) stream.
Reading a single key with m['key'] requires scanning the entire column — every key-value pair for every row — even if only one key is needed.
For maps with many distinct keys this becomes a bottleneck.

Bucketed serialization (with_buckets) splits the key-value pairs into multiple independent substreams (buckets) by hashing the key.
When a query accesses m['key'], only the bucket that contains that key is read from disk, skipping all other buckets.

Enabling Bucketed Serialization

CREATE TABLE tab (id UInt64, m Map(String, UInt64))
ENGINE = MergeTree ORDER BY id
SETTINGS
    map_serialization_version = 'with_buckets',
    max_buckets_in_map = 32,
    map_buckets_strategy = 'sqrt';

To avoid slowing down inserts, you can keep basic serialization for zero-level parts (created during INSERT) and only use with_buckets for merged parts:

CREATE TABLE tab (id UInt64, m Map(String, UInt64))
ENGINE = MergeTree ORDER BY id
SETTINGS
    map_serialization_version = 'with_buckets',
    map_serialization_version_for_zero_level_parts = 'basic',
    max_buckets_in_map = 32,
    map_buckets_strategy = 'sqrt';

How It Works

When a data part is written with with_buckets serialization:

  1. The average number of keys per row is computed from the block statistics.
  2. The number of buckets is determined by the configured strategy.
  3. Each key-value pair is assigned to a bucket by hashing the key: bucket = hash(key) % num_buckets.
  4. Each bucket is stored as an independent substream with its own keys, values, and offsets.
  5. A buckets_info metadata stream records the bucket count and statistics.

When a query reads a specific key (m['key']), the optimizer rewrites the expression to a key subcolumn (m.key_<serialized_key>).
The serialization layer computes which bucket the requested key belongs to and reads only that single bucket from disk.

When the full map is read (e.g., SELECT m), all buckets are read and reassembled into the original map. This is slower than basic serialization due to the overhead of reading and merging multiple substreams.

The bucket count can vary between parts. When parts with different bucket counts are merged, the new part's bucket count is recalculated from the merged statistics. Parts with basic and with_buckets serialization can coexist in the same table and are merged transparently.

Settings

Setting Default Description
map_serialization_version basic Serialization format for Map columns. basic stores as a single array stream. with_buckets splits keys into buckets for faster single-key reads.
map_serialization_version_for_zero_level_parts basic Serialization format for zero-level parts (created by INSERT). Allows keeping basic for inserts to avoid write overhead, while merged parts use with_buckets.
max_buckets_in_map 32 Upper bound on the number of buckets. The actual count depends on map_buckets_strategy.
map_buckets_strategy sqrt Strategy for computing bucket count from average map size: constant — always use max_buckets_in_map; sqrt — use round(coefficient * sqrt(avg_size)); linear — use round(coefficient * avg_size). Result is clamped to [1, max_buckets_in_map].
map_buckets_coefficient 1.0 Multiplier for sqrt and linear strategies. Ignored when strategy is constant.
map_buckets_min_avg_size 32 Minimum average keys per row to enable bucketing. If the average is below this threshold, a single bucket is used regardless of other settings. Set to 0 to disable the threshold.

Performance Tradeoffs

The following table summarizes the performance impact of with_buckets compared to basic serialization at various map sizes (10 to 10,000 keys per row). The bucket count was determined by the sqrt strategy capped at 32. The exact numbers depend on key/value types, data distribution, and hardware.

Operation 10 keys 100 keys 1,000 keys 10,000 keys Notes
Single key lookup (m['key']) 1.6–3.2x faster 4.5–7.7x faster 16–39x faster 21–49x faster Reads only one bucket instead of the entire column.
5 key lookups ~1x 1.5–3.1x faster 2.9–8.3x faster 4.5–6.7x faster Each key reads its own bucket; some buckets may overlap.
PREWHERE (SELECT m WHERE m['key'] = ...) 1.5–3.0x faster 2.9–7.3x faster 5.3–31x faster 20–45x faster PREWHERE filter reads only one bucket; full map read only for matching rows. Speedup depends on selectivity — fewer matching granules means less full-map I/O.
Full map scan (SELECT m) ~2x slower ~2x slower ~2x slower ~2x slower Must read and reassemble all buckets.
INSERT 1.5–2.5x slower 1.5–2.5x slower 1.5–2.5x slower 1.5–2.5x slower Overhead of hashing keys and writing to multiple substreams.

More detailed benchmark summary: https://gist.github.com/Avogar/5bc5867a1a3e587b916b520d06f92185

Recommendations

  • Small maps (< 32 keys on average): Keep basic serialization. The overhead of bucketing is not justified for small maps. The default map_buckets_min_avg_size = 32 enforces this automatically.

  • Medium maps (32–100 keys): Use with_buckets with sqrt strategy if queries frequently access individual keys. The speedup is 4–8x for single-key lookups.

  • Large maps (100+ keys): Use with_buckets. Single-key lookups are 16–49x faster. Consider map_serialization_version_for_zero_level_parts = 'basic' to keep insert speed close to the baseline.

  • Full map scans dominate the workload: Keep basic. Bucketed serialization adds ~2x overhead for full scans.

  • Mixed workload (some key lookups, some full scans): Use with_buckets with zero-level parts set to basic. The PREWHERE optimization reads only the relevant bucket for the filter, then reads the full map only for matching rows, giving a significant net speedup.

  • Documentation is written (mandatory for new features)


Note

High Risk
High risk because it changes MergeTree on-disk serialization/stream layout for Map (new bucket substreams + metadata) and extends the analyzer to rewrite m['key'] into dynamic subcolumn reads, impacting query planning and part merge behavior.

Overview
Adds bucketed on-disk serialization for Map columns in MergeTree (map_serialization_version='with_buckets') that hashes key/value pairs into multiple bucket substreams plus a buckets_info metadata stream, so single-key access can read only the relevant bucket.

Introduces new MergeTree settings/enums to control map bucketing (map_serialization_version[_for_zero_level_parts], max_buckets_in_map, map_buckets_strategy, map_buckets_coefficient, map_buckets_min_avg_size) and persists the chosen Map serialization version in SerializationInfo.

Enables key-level reads via dynamic subcolumns by teaching DataTypeMap to resolve m.key_<serialized_key> and adding DataTypeMapHelpers::extractKeyValueFromMap for fast constant-key extraction; the analyzer pass now rewrites m['const'] (internal arrayElement) to these key subcolumns, including cases involving indexes/full-column reads.

Refactors dynamic-structure APIs and adds statistics plumbing across IColumn and multiple column types (Array/Tuple/Variant/Dynamic/Object/Map/etc.), splitting “take exact structure” vs “choose structure for merge” and adding takeOrCalculateStatisticsFrom/hasStatistics to support bucket/structure decisions and invalidation when structures change.

Docs add a new section describing bucketed Map serialization, settings, and trade-offs.

Written by Cursor Bugbot for commit 9518ddc. This will update automatically on new commits. Configure here.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 10, 2026

Workflow [PR], commit [25b41d9]

Summary:

job_name test_name status info comment
Integration tests (amd_asan_ubsan, targeted) failure
test_ldap_follow_referrals/test.py::test_follow_referrals_false FAIL cidb IGNORED
test_ldap_follow_referrals/test.py::test_follow_referrals_true FAIL cidb IGNORED

AI Review

Summary

This PR introduces bucketed Map serialization in MergeTree (with_buckets), new settings controlling bucket selection, and optimizer rewrites for m['key'] key-subcolumn reads. The change is valuable but currently not merge-ready: there are multiple correctness issues in bucket row/offset handling and corrupted-stream guards (already raised inline), plus a rollout-compliance gap: the new feature is not protected by an explicit experimental gate.

Missing context
  • ⚠️ No CI report/log analysis was provided in this review context, so runtime coverage is inferred only from test diffs and code inspection.
Findings
  • ❌ Blockers

    • [src/Columns/ColumnMap.cpp:467] Out-of-bounds read for empty ranges due to offsets[end - 1] / offsets[start - 1] when start == 0 or start == end. This can produce incorrect statistics or exception paths.
      • Suggested fix: special-case start == end early and use start == 0 ? 0 : offsets[start - 1] style boundaries.
    • [src/DataTypes/Serializations/SerializationMap.cpp:972,1095] First-row bucket offsets use offsets[i - 1] at i == 0, causing invalid memory access and wrong reconstruction/splitting.
      • Suggested fix: use i == 0 ? 0 : offsets[i - 1] consistently across split/collect paths.
    • [src/DataTypes/DataTypeMapHelpers.cpp:113; src/DataTypes/Serializations/SerializationMapKeysOrValues.cpp:180] Same first-row offsets[-1] pattern in helper/subcolumn extraction path.
      • Suggested fix: unify safe offset-boundary helper and reuse it in all map offset traversals.
    • [src/DataTypes/Serializations/SerializationMap.cpp:1087; src/DataTypes/Serializations/SerializationMapSize.cpp:164] Multi-bucket collectors assume non-empty bucket arrays and/or valid first bucket without robust corrupted-metadata validation.
      • Suggested fix: validate bucket count from metadata (>=1) and fail with clear exception before indexing bucket 0.
    • [src/Storages/MergeTree/MergeTreeSettings.cpp:352,387] Defaults in settings block were left in testing values and diverge from SettingsChangesHistory/docs (basic, min avg size 32), creating compatibility/documentation mismatch.
      • Suggested fix: restore production defaults and keep testing overrides only in tests.
    • [src/DataTypes/Serializations/SerializationMap.cpp:894,903] calculateNumberOfBuckets has robustness gaps (uninitialized/default handling and unsafe FP-to-UInt64 conversion for non-finite/invalid values).
      • Suggested fix: initialize with safe default, handle unknown strategy explicitly, and validate computed value before cast.
    • [src/Storages/MergeTree/MergeTreeSettings.cpp:355] New feature rollout is missing an explicit experimental gate for enabling with_buckets serialization.
      • Suggested fix: require a dedicated allow_experimental_* setting (or equivalent gate) when map_serialization_version / _for_zero_level_parts uses with_buckets.
  • 💡 Nits

    • [src/Storages/MergeTree/MergeTreeSettings.cpp:353] Typo in docs string: "take affect" should be "take effect".
Tests
  • ⚠️ Add compatibility tests for mixed-part scenarios with Map serialization versions (basic + with_buckets) across merges and reads.
  • ⚠️ Add negative/corruption tests for invalid MapBucketsInfo metadata (for example buckets == 0) to ensure deterministic exception instead of invalid memory access.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate New user-facing Map bucketed serialization mode is not behind an explicit experimental gate.
No magic constants
Backward compatibility ⚠️ Testing defaults in settings block diverged from documented/history defaults in the PR state under review.
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout ⚠️ Missing explicit experimental gate weakens staged rollout control.
Compilation time ⚠️ PR touches many high-fan-out headers; risk is acceptable only if unavoidable, but should be monitored.
Performance & Safety

Bucketed reads can materially improve single-key lookup performance, but current offset/bucket-boundary defects create correctness/safety risk in both write and read paths. The feature should not be enabled broadly until those blockers are resolved and corruption guards are complete.

Final Verdict

Status: ❌ Block

Minimum required actions:

  1. Fix all first-row/empty-range offset boundary issues in Map statistics/split/collect/helper paths.
  2. Add strict corrupted-metadata validation for bucket counts before bucket indexing.
  3. Restore non-testing defaults and keep settings/history/docs aligned.
  4. Add an explicit experimental gate for with_buckets activation.

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Mar 10, 2026
…in MergeTree writer, split dynamic structure and synamic subcolumns logic, split taking dynamic structure and statistics logic
@den-crane
Copy link
Copy Markdown
Contributor

den-crane commented Mar 11, 2026

I don't see a test for the repeating keys.
Maybe I overlooked, then sorry.

something like:

insert into t(id,m_uint) values (66, {1 : 'a', 1 : 'b', 1 : 'c'});

select * from t where id = 66;
┌─id─┬─m_uint──────────────┬─m_int─┬─m_date─┬─m_uuid─┬─m_ipv4─┬─m_fs─┬─m_lc─┐
│ 66 │ {1:'a',1:'b',1:'c'} │ {}    │ {}     │ {}     │ {}     │ {}   │ {}   │
└────┴─────────────────────┴───────┴────────┴────────┴────────┴──────┴──────┘


select m_uint[1] from t;
┌─arrayElement(m_uint, 1)─┐
│ a                       │
└─────────────────────────┘

select mapKeys(m_uint), mapValues(m_uint) from t;
┌─mapKeys(m_uint)─┬─mapValues(m_uint)─┐
│ [1,1,1]         │ ['a','b','c']     │
└─────────────────┴───────────────────┘

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Mar 11, 2026

I don't see a test for the repeating keys.
Maybe I overlooked, then sorry.

I believe we should have existing tests with repeated keys. And by default map[key] will be always rewritten to a subcolumn (unless optimize_functions_to_subcolums is disabled), so existing tests should cover this. Subcolumn extraction should behave exactly like arrayElement function would behave.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I enabled new serialization for all maps by default for testing purposes in the CI. Later I will change defaults back to normal.

if (min_avg_size > 0 && statistics->avg < static_cast<Float64>(min_avg_size))
return 1;

UInt64 result;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calculateNumberOfBuckets can use an uninitialized result value.

result is declared without initialization, and the switch has no default. If strategy ever contains an unexpected enum value (e.g., future enum extension, corrupted metadata, or mismatch across binaries), result is read uninitialized in the return expression, which is undefined behavior.

Please make this path explicit, e.g. add a default branch that throws, or initialize result to a safe value before the switch.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 25, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.20% +0.10%
Functions 24.60% 24.40% -0.20%
Branches 76.70% 76.70% +0.00%

PR changed lines: PR changed-lines coverage: 88.40% (2911/3293, 13 noise lines excluded)
Diff coverage report
Uncovered code

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Mar 25, 2026

@Avogar Avogar added this pull request to the merge queue Mar 25, 2026
Merged via the queue into master with commit f597822 Mar 25, 2026
151 of 153 checks passed
@Avogar Avogar deleted the buckets-in-map-2 branch March 25, 2026 21:08
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 25, 2026
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Mar 25, 2026
alexey-milovidov added a commit that referenced this pull request Mar 26, 2026
…Stream memory fix

- Add entry for bucketed serialization for Map columns (#99200) under New Feature.
- Move #98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
robot-ch-test-poll added a commit that referenced this pull request Mar 27, 2026
Cherry pick #99200 to 26.3: Implement automatic buckets in Map data type MergeTree serialization
robot-clickhouse added a commit that referenced this pull request Mar 27, 2026
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Mar 27, 2026
Avogar added a commit that referenced this pull request Mar 27, 2026
Backport #99200 to 26.3: Implement automatic buckets in Map data type MergeTree serialization
Desel72 pushed a commit to Desel72/ClickHouse that referenced this pull request Mar 30, 2026
…Stream memory fix

- Add entry for bucketed serialization for Map columns (ClickHouse#99200) under New Feature.
- Move ClickHouse#98893 (ArrowStream excessive memory during format auto-detection) from Bug Fix to Performance Improvement, since it is a performance/resource usage improvement rather than a correctness bug fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-feature Pull request with new product feature pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v26.3-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants