size-separated String serialization for MergeTree#82850
size-separated String serialization for MergeTree#82850Avogar merged 41 commits intoClickHouse:masterfrom
Conversation
|
Workflow [PR], commit [01e4bfe] Summary: ❌
|
846deef to
140644b
Compare
|
How wasteful would it be to store each string's size twice, in the original column and in |
140644b to
ce0f1e0
Compare
It's probably not a good design direction. In that case we should manually store a string length column instead. |
1751ff5 to
0323eb9
Compare
229b90a to
69dadc7
Compare
0069811 to
15c215d
Compare
15c215d to
2166229
Compare
|
test_storage_delta/test.py::test_replicated_database_and_unavailable_s3[1] |
|
|
Avogar
left a comment
There was a problem hiding this comment.
Amazing work! Just 2 final small comments and ready to be merged
Co-authored-by: Pavel Kruglov <[email protected]>
|
|
Integration tests (arm_binary, distributed plan, 1/4) - #87995 |
59d18f4



Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add optional
.sizesubcolumn serialization for top-level String columns in MergeTree tables to improve compression and enable efficient subcolumn access. Introduce new MergeTree settings for serialization version control and expression optimization for empty strings.Documentation entry for user-facing changes
New MergeTree settings:
serialization_info_version– Controls serialization info format when writingserialization.json. Required for cluster upgrades.DEFAULT– Legacy format, compatible with old servers during rolling upgrades.WITH_TYPES– New format withtypes_serialization_versions, enabling per-type serialization settings likestring_serialization_version. Switch to this after upgrades.string_serialization_version– Controls top-levelStringcolumn serialization (effective only whenserialization_info_version = WITH_TYPES).DEFAULT– Standard inline size format.WITH_SIZE_STREAM– Serialize top-levelStringcolumns with separate.sizestream for better compression. Backward incompatible.Subcolumn support:
.sizesubcolumn across both legacy and new String formats, supporting mixed-format queries.Expression optimizations:
optimize_empty_string_comparisonsrewritesstr = ''intoisEmpty(str)/isNotEmpty(str).FunctionToSubcolumnsPassextended to rewritelength(str)asstr.size.Sparse encoding enhancements: