Skip to content

Significantly improve performance of JSON subcolumns reading from shared data in MergeTree#83777

Merged
Avogar merged 47 commits intoClickHouse:masterfrom
Avogar:json-shared-data-v2
Aug 25, 2025
Merged

Significantly improve performance of JSON subcolumns reading from shared data in MergeTree#83777
Avogar merged 47 commits intoClickHouse:masterfrom
Avogar:json-shared-data-v2

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Jul 15, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Significantly improve performance of JSON subcolumns reading from shared data in MergeTree by implementing new serializations for JSON shared data in MergeTree.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Now there are 3 possible JSON shared data serializations in MergeTree:

  • map - serializes shared data as a single column with type Map(String, String) (old serialization). This serialization is efficient for writing data and reading the whole JSON column, but it's not efficient for reading paths sub-columns.
  • map_with_buckets - serializes shared data as N columns ("buckets") with type Map(String, String) where each bucket contains only subset of paths. This serialization is less efficient for writing data and reading the whole JSON column, but it's more efficient for reading paths sub-columns because it reads data only from required buckets.
  • advanced - serializes shared data in a special data structure that maximizes the performance
    of paths sub-columns reading by storing some additional information that allows to read only the data of requested paths. This serialization also supports buckets, so each bucket contains only subset of paths. This serialization is quite inefficient for writing data (so it's not recommended to use this serialization for zero-level parts), reading the whole JSON column is slightly less efficient compared to map serialization, but it's very efficient for reading paths sub-columns.

Type of serialization is controlled by MergeTree
settings object_shared_data_serialization_version
and object_shared_data_serialization_version_for_zero_level_parts (zero level part is the part created during inserting data into the table, during merges parts have higher level). Default serialization will be map for next few releases, but after that it will be changed to advanced (for zero-level parts it will be map or map_with_buckets). Note: changing shared data serialization version is supported only for object serialization version v3 (setting object_serialization_version) which is currently v2 by default (but will be changed after a few releases to v3).

Number of buckets is controlled by MergeTree settings object_shared_data_buckets_for_compact_part (8 by default)
and object_shared_data_buckets_for_wide_part (32 by default).

See the performance comparison of JSON paths subcolumns below in the comments

@Avogar Avogar mentioned this pull request Jul 15, 2025
56 tasks
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 15, 2025

Workflow [PR], commit [e36c1ea]

Summary:

job_name test_name status info comment
Stateless tests (amd_asan, distributed plan, sequential) failure
01150_ddl_guard_rwr FAIL
Install packages (amd_debug) failure
Install clickhouse binary in deb failure
Stress test (arm_asan) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Stress test (amd_msan) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Jul 15, 2025
@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Jul 15, 2025

Performance comparison of reading JSON paths sub-columns on local filesystem

In this comparison we have simple JSONs {"key0" : 0, "key1" : 1, "key2" : 2, ..., "keyN" : N} stored using:

  • Map
  • String
  • JSON with all paths separated into sub-columns
  • JSON with all paths stored in shared data with map serialization
  • JSON with all paths stored in shared data with map_with_buckets serialization
  • JSON with all paths stored in shared data with advanced serialization

For map_with_buckets and advanced shared data serializations 8 buckets was used for Compact parts and 32 for Wide.

JSON with 10 paths, 5,000,000 rows

Results

Compact part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON no shared data                 │     0.04 │ 73.41 MiB    │ select json.key0                    │
│ JSON shared data "advanced"         │     0.05 │ 78.84 MiB    │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.084 │ 77.76 MiB    │ select json.key0                    │
│ JSON shared data "map"              │    0.095 │ 128.95 MiB   │ select json.key0                    │
│ Map                                 │    0.097 │ 637.94 MiB   │ select json['key0']                 │
│ String                              │    0.138 │ 463.17 MiB   │ select JSONExtractInt(json, 'key0') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON no shared data                 │    0.043 │ 74.18 MiB    │ select json.non_existing_key                    │
│ JSON shared data "map_with_buckets" │    0.043 │ 82.75 MiB    │ select json.non_existing_key                    │
│ JSON shared data "advanced"         │    0.049 │ 78.82 MiB    │ select json.non_existing_key                    │
│ JSON shared data "map"              │    0.065 │ 101.99 MiB   │ select json.non_existing_key                    │
│ Map                                 │    0.087 │ 599.00 MiB   │ select json['non_existing_key']                 │
│ String                              │     0.13 │ 428.92 MiB   │ select JSONExtractInt(json, 'non_existing_key') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON no shared data                 │     0.05 │ 154.72 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "advanced"         │    0.069 │ 222.61 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    0.164 │ 698.95 MiB   │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ JSON shared data "map_with_buckets" │    0.247 │ 254.73 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map"              │    0.253 │ 237.10 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ String                              │    0.472 │ 463.18 MiB   │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │    0.059 │ 384.38 MiB   │ select json │
│ JSON no shared data                 │    0.067 │ 363.50 MiB   │ select json │
│ JSON shared data "map"              │    0.085 │ 583.25 MiB   │ select json │
│ Map                                 │    0.086 │ 570.58 MiB   │ select json │
│ JSON shared data "advanced"         │    0.127 │ 632.79 MiB   │ select json │
│ JSON shared data "map_with_buckets" │    0.314 │ 652.40 MiB   │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

Wide part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON no shared data                 │    0.007 │ 61.07 MiB    │ select json.key0                    │
│ JSON shared data "advanced"         │     0.01 │ 67.58 MiB    │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.048 │ 102.98 MiB   │ select json.key0                    │
│ Map                                 │    0.075 │ 100.66 MiB   │ select json['key0']                 │
│ String                              │    0.121 │ 119.15 MiB   │ select JSONExtractInt(json, 'key0') │
│ JSON shared data "map"              │    0.124 │ 775.28 MiB   │ select json.key0                    │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON shared data "map_with_buckets" │    0.004 │ 31.95 MiB    │ select json.non_existing_key                    │
│ JSON shared data "advanced"         │    0.004 │ 29.49 MiB    │ select json.non_existing_key                    │
│ JSON no shared data                 │    0.005 │ 32.51 MiB    │ select json.non_existing_key                    │
│ Map                                 │    0.062 │ 97.11 MiB    │ select json['non_existing_key']                 │
│ JSON shared data "map"              │    0.091 │ 762.46 MiB   │ select json.non_existing_key                    │
│ String                              │    0.116 │ 119.15 MiB   │ select JSONExtractInt(json, 'non_existing_key') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON no shared data                 │    0.034 │ 119.84 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "advanced"         │    0.052 │ 289.89 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    0.116 │ 109.39 MiB   │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ JSON shared data "map_with_buckets" │    0.236 │ 536.31 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map"              │    0.257 │ 794.16 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ String                              │    0.503 │ 118.41 MiB   │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │     0.04 │ 92.00 MiB    │ select json │
│ Map                                 │    0.061 │ 95.35 MiB    │ select json │
│ JSON shared data "map"              │    0.067 │ 99.96 MiB    │ select json │
│ JSON no shared data                 │    0.069 │ 100.20 MiB   │ select json │
│ JSON shared data "advanced"         │    0.083 │ 108.27 MiB   │ select json │
│ JSON shared data "map_with_buckets" │    0.441 │ 480.54 MiB   │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

JSON with 100 paths, 2,000,000 rows

Results

Compact part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON no shared data                 │     0.02 │ 65.28 MiB    │ select json.key0                    │
│ JSON shared data "advanced"         │    0.028 │ 69.52 MiB    │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.066 │ 102.94 MiB   │ select json.key0                    │
│ JSON shared data "map"              │    0.304 │ 417.05 MiB   │ select json.key0                    │
│ String                              │    0.488 │ 3.46 GiB     │ select JSONExtractInt(json, 'key0') │
│ Map                                 │    0.509 │ 4.89 GiB     │ select json['key0']                 │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON shared data "advanced"         │    0.027 │ 31.72 MiB    │ select json.non_existing_key                    │
│ JSON no shared data                 │    0.032 │ 31.81 MiB    │ select json.non_existing_key                    │
│ JSON shared data "map_with_buckets" │    0.033 │ 31.49 MiB    │ select json.non_existing_key                    │
│ JSON shared data "map"              │    0.252 │ 413.62 MiB   │ select json.non_existing_key                    │
│ Map                                 │    0.488 │ 4.60 GiB     │ select json['non_existing_key']                 │
│ String                              │      0.5 │ 3.61 GiB     │ select JSONExtractInt(json, 'non_existing_key') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON no shared data                 │    0.039 │ 185.00 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "advanced"         │    0.068 │ 195.54 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map_with_buckets" │     0.25 │ 308.87 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map"              │     0.44 │ 548.58 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    0.488 │ 5.31 GiB     │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ String                              │    1.337 │ 3.62 GiB     │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │    0.276 │ 2.37 GiB     │ select json │
│ JSON no shared data                 │    0.365 │ 3.48 GiB     │ select json │
│ Map                                 │     0.48 │ 4.75 GiB     │ select json │
│ JSON shared data "advanced"         │     0.55 │ 5.51 GiB     │ select json │
│ JSON shared data "map"              │    0.552 │ 5.21 GiB     │ select json │
│ JSON shared data "map_with_buckets" │     1.25 │ 6.19 GiB     │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

Wide part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON no shared data                 │    0.004 │ 25.55 MiB    │ select json.key0                    │
│ JSON shared data "advanced"         │    0.007 │ 31.82 MiB    │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.029 │ 434.77 MiB   │ select json.key0                    │
│ Map                                 │     0.29 │ 355.89 MiB   │ select json['key0']                 │
│ JSON shared data "map"              │    0.357 │ 6.68 GiB     │ select json.key0                    │
│ String                              │     0.38 │ 437.79 MiB   │ select JSONExtractInt(json, 'key0') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON no shared data                 │    0.003 │ 10.12 MiB    │ select json.non_existing_key                    │
│ JSON shared data "advanced"         │    0.003 │ 17.07 MiB    │ select json.non_existing_key                    │
│ JSON shared data "map_with_buckets" │     0.01 │ 122.53 MiB   │ select json.non_existing_key                    │
│ Map                                 │    0.298 │ 364.51 MiB   │ select json['non_existing_key']                 │
│ JSON shared data "map"              │    0.345 │ 6.65 GiB     │ select json.non_existing_key                    │
│ String                              │    0.388 │ 428.46 MiB   │ select JSONExtractInt(json, 'non_existing_key') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON no shared data                 │    0.016 │ 127.10 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "advanced"         │    0.029 │ 295.10 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map_with_buckets" │    0.156 │ 1.60 GiB     │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    0.301 │ 361.95 MiB   │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ JSON shared data "map"              │    0.549 │ 6.66 GiB     │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ String                              │    1.225 │ 424.91 MiB   │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │    0.192 │ 508.18 MiB   │ select json │
│ Map                                 │    0.286 │ 370.25 MiB   │ select json │
│ JSON shared data "map"              │    0.314 │ 432.68 MiB   │ select json │
│ JSON no shared data                 │     0.32 │ 1.02 GiB     │ select json │
│ JSON shared data "advanced"         │    0.384 │ 397.92 MiB   │ select json │
│ JSON shared data "map_with_buckets" │    1.323 │ 1.07 GiB     │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

JSON with 1000 paths, 1,000,000 rows

Results

Compact part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON shared data "advanced"         │    0.056 │ 16.74 MiB    │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.226 │ 126.73 MiB   │ select json.key0                    │
│ JSON no shared data                 │    0.259 │ 16.13 MiB    │ select json.key0                    │
│ JSON shared data "map"              │    1.366 │ 483.22 MiB   │ select json.key0                    │
│ String                              │    2.014 │ 9.70 GiB     │ select JSONExtractInt(json, 'key0') │
│ Map                                 │    2.236 │ 9.47 GiB     │ select json['key0']                 │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON shared data "advanced"         │    0.074 │ 9.55 MiB     │ select json.non_existing_key                    │
│ JSON shared data "map_with_buckets" │    0.233 │ 8.61 MiB     │ select json.non_existing_key                    │
│ JSON no shared data                 │    0.259 │ 8.74 MiB     │ select json.non_existing_key                    │
│ JSON shared data "map"              │    1.364 │ 491.05 MiB   │ select json.non_existing_key                    │
│ String                              │     2.11 │ 9.70 GiB     │ select JSONExtractInt(json, 'non_existing_key') │
│ Map                                 │    2.371 │ 9.89 GiB     │ select json['non_existing_key']                 │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON shared data "advanced"         │    0.092 │ 122.48 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON no shared data                 │    0.273 │ 115.47 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map_with_buckets" │    0.727 │ 242.62 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map"              │    1.548 │ 509.09 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    2.255 │ 9.76 GiB     │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ String                              │    5.755 │ 11.28 GiB    │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │    1.263 │ 8.90 GiB     │ select json │
│ JSON no shared data                 │    1.974 │ 8.46 GiB     │ select json │
│ Map                                 │    2.328 │ 10.20 GiB    │ select json │
│ JSON shared data "map"              │     2.45 │ 9.84 GiB     │ select json │
│ JSON shared data "advanced"         │    2.497 │ 9.74 GiB     │ select json │
│ JSON shared data "map_with_buckets" │    6.063 │ 10.77 GiB    │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

Wide part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON no shared data                 │    0.006 │ 9.05 MiB     │ select json.key0                    │
│ JSON shared data "advanced"         │    0.013 │ 18.20 MiB    │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.054 │ 450.86 MiB   │ select json.key0                    │
│ Map                                 │    1.363 │ 507.60 MiB   │ select json['key0']                 │
│ JSON shared data "map"              │    1.508 │ 8.98 GiB     │ select json.key0                    │
│ String                              │    1.588 │ 630.03 MiB   │ select JSONExtractInt(json, 'key0') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON no shared data                 │    0.005 │ 4.65 MiB     │ select json.non_existing_key                    │
│ JSON shared data "advanced"         │    0.005 │ 11.13 MiB    │ select json.non_existing_key                    │
│ JSON shared data "map_with_buckets" │    0.038 │ 273.85 MiB   │ select json.non_existing_key                    │
│ Map                                 │     1.43 │ 530.90 MiB   │ select json['non_existing_key']                 │
│ JSON shared data "map"              │    1.489 │ 9.19 GiB     │ select json.non_existing_key                    │
│ String                              │    1.617 │ 615.93 MiB   │ select JSONExtractInt(json, 'non_existing_key') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON no shared data                 │    0.018 │ 48.25 MiB    │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "advanced"         │    0.056 │ 139.46 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map_with_buckets" │    0.271 │ 1.68 GiB     │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    1.452 │ 514.23 MiB   │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ JSON shared data "map"              │    1.722 │ 18.53 GiB    │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ String                              │    5.044 │ 567.16 MiB   │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │    1.206 │ 629.30 MiB   │ select json │
│ Map                                 │    1.466 │ 559.75 MiB   │ select json │
│ JSON shared data "map"              │    1.751 │ 645.92 MiB   │ select json │
│ JSON shared data "advanced"         │    2.096 │ 668.15 MiB   │ select json │
│ JSON no shared data                 │    5.343 │ 8.52 GiB     │ select json │
│ JSON shared data "map_with_buckets" │    6.222 │ 1.32 GiB     │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

JSON with 10000 paths, 200,000 rows:

Results

Compact part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON shared data "advanced"         │    0.242 │ 4.63 MiB     │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.464 │ 4.92 MiB     │ select json.key0                    │
│ JSON shared data "map"              │    3.577 │ 483.57 MiB   │ select json.key0                    │
│ Map                                 │    4.391 │ 7.83 GiB     │ select json['key0']                 │
│ JSON no shared data                 │    4.404 │ 128.02 MiB   │ select json.key0                    │
│ String                              │    5.194 │ 9.77 GiB     │ select JSONExtractInt(json, 'key0') │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON shared data "advanced"         │    0.314 │ 3.68 MiB     │ select json.non_existing_key                    │
│ JSON shared data "map_with_buckets" │    0.552 │ 121.90 MiB   │ select json.non_existing_key                    │
│ JSON shared data "map"              │    3.417 │ 516.85 MiB   │ select json.non_existing_key                    │
│ String                              │    3.778 │ 9.53 GiB     │ select JSONExtractInt(json, 'non_existing_key') │
│ JSON no shared data                 │    4.389 │ 128.02 MiB   │ select json.non_existing_key                    │
│ Map                                 │    4.511 │ 7.88 GiB     │ select json['non_existing_key']                 │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON shared data "advanced"         │    0.605 │ 21.11 MiB    │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map_with_buckets" │    1.325 │ 227.89 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map"              │    3.403 │ 502.56 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    4.344 │ 7.76 GiB     │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ JSON no shared data                 │    4.437 │ 127.24 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ String                              │   11.144 │ 11.06 GiB    │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │    2.015 │ 9.81 GiB     │ select json │
│ Map                                 │    4.279 │ 8.09 GiB     │ select json │
│ JSON shared data "advanced"         │    5.595 │ 10.52 GiB    │ select json │
│ JSON shared data "map"              │    5.643 │ 10.92 GiB    │ select json │
│ JSON shared data "map_with_buckets" │    15.55 │ 14.59 GiB    │ select json │
│ JSON no shared data                 │   23.845 │ 13.65 GiB    │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

Wide part:

Q1: Reading single key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────┐
│ JSON no shared data                 │    0.027 │ 18.64 MiB    │ select json.key0                    │
│ JSON shared data "advanced"         │    0.063 │ 3.89 MiB     │ select json.key0                    │
│ JSON shared data "map_with_buckets" │    0.087 │ 403.55 MiB   │ select json.key0                    │
│ String                              │    3.216 │ 582.70 MiB   │ select JSONExtractInt(json, 'key0') │
│ Map                                 │    3.594 │ 538.37 MiB   │ select json['key0']                 │
│ JSON shared data "map"              │     3.63 │ 12.53 GiB    │ select json.key0                    │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────┘

Q2: Reading non existing key

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────┐
│ JSON shared data "advanced"         │     0.01 │ 3.36 MiB     │ select json.non_existing_key                    │
│ JSON no shared data                 │    0.027 │ 19.38 MiB    │ select json.non_existing_key                    │
│ JSON shared data "map_with_buckets" │    0.081 │ 378.06 MiB   │ select json.non_existing_key                    │
│ String                              │    3.388 │ 578.70 MiB   │ select JSONExtractInt(json, 'non_existing_key') │
│ JSON shared data "map"              │    3.624 │ 12.00 GiB    │ select json.non_existing_key                    │
│ Map                                 │    3.821 │ 547.12 MiB   │ select json['non_existing_key']                 │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────┘

Q3: Reading 5 different keys

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────────────────────────────────────────────────────────────────────┐
│ JSON no shared data                 │      0.1 │ 92.35 MiB    │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "advanced"         │     0.27 │ 123.30 MiB   │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ JSON shared data "map_with_buckets" │    0.453 │ 1.89 GiB     │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ Map                                 │    3.665 │ 536.60 MiB   │ select json['key0'], json['key1'], json['key2'], json['key3'], json['key4'] │
│ JSON shared data "map"              │    3.828 │ 22.54 GiB    │ select json.key0, json.key1, json.key2, json.key3, json.key4                │
│ String                              │    9.911 │ 561.36 MiB   │ select JSONExtractInt(json, 'key0'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key1'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key2'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key3'),                                       ↴│
│                                     │          │              │↳       JSONExtractInt(json, 'key4')                                         │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────────────────────────────────────────────────────────────────────┘

Q4: Reading the whole JSON

┌─table_type──────────────────────────┬─time_sec─┬─memory_usage─┬─query───────┐
│ String                              │    2.734 │ 581.15 MiB   │ select json │
│ Map                                 │    3.869 │ 550.14 MiB   │ select json │
│ JSON shared data "map"              │    4.182 │ 618.78 MiB   │ select json │
│ JSON shared data "advanced"         │    4.774 │ 683.27 MiB   │ select json │
│ JSON shared data "map_with_buckets" │   15.271 │ 1.45 GiB     │ select json │
│ JSON no shared data                 │        - │ OOM          │ select json │
└─────────────────────────────────────┴──────────┴──────────────┴─────────────┘

@UnamedRus
Copy link
Copy Markdown
Contributor

Use case, when all paths are read is also important. (And i assume, String still will be winner here)
May be, it make sense to benchmark it as well?

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Jul 15, 2025

May be, it make sense to benchmark it as well?

Done

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Jul 15, 2025

~ Half of the added lines is tests

@Avogar Avogar force-pushed the json-shared-data-v2 branch from 5f48060 to 0dfc6b9 Compare July 15, 2025 22:42
@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Jul 17, 2025

Screenshot 2025-07-17 at 15 40 09 Screenshot 2025-07-17 at 15 38 04

@Avogar Avogar added this pull request to the merge queue Aug 21, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to a conflict with the base branch Aug 21, 2025
@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Aug 25, 2025

Stateless tests (amd_asan, distributed plan, sequential) - #85781
Stress test (arm_asan) - #84669
Stress test (amd_msan) - #81144
Install packages (amd_debug) - Error DB::Exception: Not enough space for clickhouse binary in /usr/bin, required 5.68 GiB, available 4.24 GiB. (NOT_ENOUGH_SPACE), I don't think it's caused by this PR

@Avogar Avogar enabled auto-merge August 25, 2025 00:07
@Avogar Avogar added this pull request to the merge queue Aug 25, 2025
Merged via the queue into ClickHouse:master with commit e6e5d88 Aug 25, 2025
117 of 122 checks passed
@Avogar Avogar deleted the json-shared-data-v2 branch August 25, 2025 00:37
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 25, 2025
@piyushsingariya
Copy link
Copy Markdown

hi @Avogar would like to test this improvement, can you share a relevant docker tag for this? Thanks!

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Aug 25, 2025

hi @Avogar would like to test this improvement, can you share a relevant docker tag for this? Thanks!

It's not included in any release yet, so right now it's visible only in master. To test master I think you can use tag head. Or you can download the binary from master using curl https://clickhouse.com/ | sh.

The closest release is 25.8 and it will include this change

@xzx0xzx
Copy link
Copy Markdown

xzx0xzx commented Oct 10, 2025

@Avogar Hi, When should we use map_with_buckets or advanced?
If we don’t take storage space into consideration, should we always use advanced?

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Oct 10, 2025

@Avogar Hi, When should we use map_with_buckets or advanced?
If we don’t take storage space into consideration, should we always use advanced?

For object_shared_data_serialization_version it's better to always use advanced, yes. For object_shared_data_serialization_version_for_zero_level_parts better to use map or map_with_buckets. Using advanced for zero level parts makes insertion very slow, using map_with_buckets makes insertion slower compared to map, but not as slow as advanced, so if you need good performance for zero level parts as well and ready to sacrifice some insertion performance, you can use map_with_buckets for them.

Also, as for all new features, make sure you test it out on your use case before using it on production data to identify any possible problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants