Skip to content

New form of default value keyword,PROMOTED, for handling materialization of map values into dedicated columns #80434

@tanner-bruce

Description

@tanner-bruce

Company or project name

Tracing and logging data, anything making use of maps

Use case

For tracing/logging data, using maps is a very simple method to store relatively unstructured, user supplied data. There are multiple downfalls with maps however: the linear scan, and requirement of decompressing the entire map to access it.

A common approach to speeding up queries is to promote the column, aka create another column for that attribute, and then use MATERIALIZED string_map['my_attribute'] to populate it. This keeps insertion queries simple, and allows ease of use at query time.

The downside with this approach is when a user supplies one or more very large attribute, or there are a large number of materialized columns. One common example we see is the graphql.document, or db.statement. These are generally useful attributes, but they can be quite large. Materializing them helps with access time, but it requires storing them twice. For very large datasets this is a non-trivial cost.

The current method of working around this is to instead write directly to a new column, and no longer include it in the map column. This comes at a query complexity cost, which is when querying the data prior to making this change, it will be in the map, and afterwards it will be in the new column.

Describe the solution you'd like

Add a new form of default value, I'm calling it PROMOTED as a placeholder, that works similarly to MATERIALZED, but is specific to maps. When used, it will work the same as MATERIALIZED, but will also remove the original value from the map. Ideally, this could be done transparently as well, e.g a query could still run my_map['my_attribute'], and the query planner would be aware of the PROMOTED column, and instead access that column directly and not access the map.

OPTIMIZE statements could perform this for historic data. I'm not sure how it would work for parts created prior to the existence of the column at query time. Could they contain metadata around which columns have been promoted?

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions