-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Subcolumn storage for Map #78373
Copy link
Copy link
Open
Description
Company or project name
No response
Use case
Improved storage compression and query performance for map-structured data
Describe the solution you'd like
It would be nice to have the underlying storage for, e.g. Map(LowCardinality(String), LowCardinality(String)), to split into multiple subcolumns, where each subcolumn only contains the values corresponding to a single key. It can lead to much better compression if there are long runs of identical values (for example, common key-value labels attached to observability data). A reasonable limit should be in place to prevent an explosion of columns on disk.
Describe alternatives you've considered
There are plenty of alternatives, here are some thoughts on each of them:
- Plain
Map- As I understand it, the storage for a map right now is to splat out each key and value consecutively in a single file, like (key1, value1, key2, value2, ...). This is okay for compression, but less performant when aggregating on a single key (like
my_map['my-key']) because them[k]syntax scans the whole map, so is O(n) with the size of the map.
- As I understand it, the storage for a map right now is to splat out each key and value consecutively in a single file, like (key1, value1, key2, value2, ...). This is okay for compression, but less performant when aggregating on a single key (like
- Parallel arrays
- Very similar in concept, you can store two arrays
my_keysandmy_values, which always have the same length. This compresses better than a map in my experience, because the keys tend to change less frequently and you get more long runs of identical values in both arrays. But the syntax is much worse, something likemy_values[indexOf(my_keys, 'key')]. It's a lot less intuitive for users, and I suspect it is still requires reading the entire arrays (now coordinating across two files).
- Very similar in concept, you can store two arrays
- JSON type
- This type has the ideal storage for great compression of each key's values and accommodating a dynamic number of keys. In one sense, this feature request is asking to decouple the special subcolumn storage and the JSON front-end, allowing the optimal storage to be used for more types. For example, a Map can be stricter than normal JSON, by only allowing a fixed level of nesting, and constraining the value types. Maybe it's possible to combine the flexibility of Maps (especially with the -Map aggregation combinator) with the good storage of JSON, and without special-casing the JSON format. Practically, I can probably just use JSON while being mindful of parsing edge cases like Output of toJSONString is not recognized as valid JSON #75017 and waiting for it to be possible to set the field type for all fields (to avoid Dynamic and Variant indirection).
Additional context
No response
Reactions are currently unavailable