Significantly improve performance of JSON subcolumns reading from shared data in MergeTree#83777
Conversation
|
Workflow [PR], commit [e36c1ea] Summary: ❌
|
Performance comparison of reading JSON paths sub-columns on local filesystemIn this comparison we have simple JSONs
For JSON with 10 paths, 5,000,000 rowsResultsCompact part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON Wide part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON JSON with 100 paths, 2,000,000 rowsResultsCompact part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON Wide part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON JSON with 1000 paths, 1,000,000 rowsResultsCompact part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON Wide part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON JSON with 10000 paths, 200,000 rows:ResultsCompact part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON Wide part:Q1: Reading single key Q2: Reading non existing key Q3: Reading 5 different keys Q4: Reading the whole JSON |
|
Use case, when all paths are read is also important. (And i assume, String still will be winner here) |
Done |
|
~ Half of the added lines is tests |
…to json-shared-data-v2
5f48060 to
0dfc6b9
Compare
|
Stateless tests (amd_asan, distributed plan, sequential) - #85781 |
e6e5d88
|
hi @Avogar would like to test this improvement, can you share a relevant docker tag for this? Thanks! |
It's not included in any release yet, so right now it's visible only in master. To test master I think you can use tag The closest release is 25.8 and it will include this change |
|
@Avogar Hi, When should we use map_with_buckets or advanced? |
For Also, as for all new features, make sure you test it out on your use case before using it on production data to identify any possible problems. |


Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Significantly improve performance of
JSONsubcolumns reading from shared data in MergeTree by implementing new serializations forJSONshared data in MergeTree.Documentation entry for user-facing changes
Now there are 3 possible
JSONshared data serializations in MergeTree:map- serializes shared data as a single column with typeMap(String, String)(old serialization). This serialization is efficient for writing data and reading the wholeJSONcolumn, but it's not efficient for reading paths sub-columns.map_with_buckets- serializes shared data asNcolumns ("buckets") with typeMap(String, String)where each bucket contains only subset of paths. This serialization is less efficient for writing data and reading the wholeJSONcolumn, but it's more efficient for reading paths sub-columns because it reads data only from required buckets.advanced- serializes shared data in a special data structure that maximizes the performanceof paths sub-columns reading by storing some additional information that allows to read only the data of requested paths. This serialization also supports buckets, so each bucket contains only subset of paths. This serialization is quite inefficient for writing data (so it's not recommended to use this serialization for zero-level parts), reading the whole
JSONcolumn is slightly less efficient compared tomapserialization, but it's very efficient for reading paths sub-columns.Type of serialization is controlled by MergeTree
settings
object_shared_data_serialization_versionand
object_shared_data_serialization_version_for_zero_level_parts(zero level part is the part created during inserting data into the table, during merges parts have higher level). Default serialization will bemapfor next few releases, but after that it will be changed toadvanced(for zero-level parts it will bemapormap_with_buckets). Note: changing shared data serialization version is supported only for object serialization versionv3(settingobject_serialization_version) which is currentlyv2by default (but will be changed after a few releases tov3).Number of buckets is controlled by MergeTree settings
object_shared_data_buckets_for_compact_part(8 by default)and
object_shared_data_buckets_for_wide_part(32 by default).See the performance comparison of JSON paths subcolumns below in the comments