Skip to content

Subcolumn storage for Map #78373

@EmeraldShift

Description

@EmeraldShift

Company or project name

No response

Use case

Improved storage compression and query performance for map-structured data

Describe the solution you'd like

It would be nice to have the underlying storage for, e.g. Map(LowCardinality(String), LowCardinality(String)), to split into multiple subcolumns, where each subcolumn only contains the values corresponding to a single key. It can lead to much better compression if there are long runs of identical values (for example, common key-value labels attached to observability data). A reasonable limit should be in place to prevent an explosion of columns on disk.

Describe alternatives you've considered

There are plenty of alternatives, here are some thoughts on each of them:

  • Plain Map
    • As I understand it, the storage for a map right now is to splat out each key and value consecutively in a single file, like (key1, value1, key2, value2, ...). This is okay for compression, but less performant when aggregating on a single key (like my_map['my-key']) because the m[k] syntax scans the whole map, so is O(n) with the size of the map.
  • Parallel arrays
    • Very similar in concept, you can store two arrays my_keys and my_values, which always have the same length. This compresses better than a map in my experience, because the keys tend to change less frequently and you get more long runs of identical values in both arrays. But the syntax is much worse, something like my_values[indexOf(my_keys, 'key')]. It's a lot less intuitive for users, and I suspect it is still requires reading the entire arrays (now coordinating across two files).
  • JSON type
    • This type has the ideal storage for great compression of each key's values and accommodating a dynamic number of keys. In one sense, this feature request is asking to decouple the special subcolumn storage and the JSON front-end, allowing the optimal storage to be used for more types. For example, a Map can be stricter than normal JSON, by only allowing a fixed level of nesting, and constraining the value types. Maybe it's possible to combine the flexibility of Maps (especially with the -Map aggregation combinator) with the good storage of JSON, and without special-casing the JSON format. Practically, I can probably just use JSON while being mindful of parsing edge cases like Output of toJSONString is not recognized as valid JSON #75017 and waiting for it to be possible to set the field type for all fields (to avoid Dynamic and Variant indirection).

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions