-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Automatic LowCardinality representation of columns #69916
Description
(I hope this was not opened already elsewhere.)
Logical LowCardinality (dictionary) compression must currently be enabled explicitly for a column when the table is created. This is too inconvenient. Instead, we should apply LowCardinality compression more aggressively.
The simplest way is to add LowCardinality implicitly during CREATE TABLE, perhaps combined with NONE as a codec (to avoid compressing data twice). Logical and physical compression have different space - performance trade-offs, most analytics databases these days prefer dictionary compression (also see #69362). But that is ideally configurable.
Another way is more subtle and similar to how we apply sparse compression to columns. We would check the distinctness of the data when a part is written, and apply dictionary compression on-the-fly if it is below a configurable threshold. This was mentioned on the Roadmap 2024 as "Uniform treatment of LowCardinality, Sparse, and Const columns" and discussed in detail in this issue.