-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New form of default value keyword,PROMOTED, for handling materialization of map values into dedicated columns #80434
Description
Company or project name
Tracing and logging data, anything making use of maps
Use case
For tracing/logging data, using maps is a very simple method to store relatively unstructured, user supplied data. There are multiple downfalls with maps however: the linear scan, and requirement of decompressing the entire map to access it.
A common approach to speeding up queries is to promote the column, aka create another column for that attribute, and then use MATERIALIZED string_map['my_attribute'] to populate it. This keeps insertion queries simple, and allows ease of use at query time.
The downside with this approach is when a user supplies one or more very large attribute, or there are a large number of materialized columns. One common example we see is the graphql.document, or db.statement. These are generally useful attributes, but they can be quite large. Materializing them helps with access time, but it requires storing them twice. For very large datasets this is a non-trivial cost.
The current method of working around this is to instead write directly to a new column, and no longer include it in the map column. This comes at a query complexity cost, which is when querying the data prior to making this change, it will be in the map, and afterwards it will be in the new column.
Describe the solution you'd like
Add a new form of default value, I'm calling it PROMOTED as a placeholder, that works similarly to MATERIALZED, but is specific to maps. When used, it will work the same as MATERIALIZED, but will also remove the original value from the map. Ideally, this could be done transparently as well, e.g a query could still run my_map['my_attribute'], and the query planner would be aware of the PROMOTED column, and instead access that column directly and not access the map.
OPTIMIZE statements could perform this for historic data. I'm not sure how it would work for parts created prior to the existence of the column at query time. Could they contain metadata around which columns have been promoted?
Describe alternatives you've considered
No response
Additional context
No response