Skip to content

Feature: best effort sorting to optimize compression #4413

@arctica

Description

@arctica

Use case:

A MergeTree table with free modifications of colums (add/drop) can't make good use of a ORDER BY sorting expression in order to sort rows so that compression can work well. For example dropping column2 from a table which has an ORDER BY expression of (column1, column2, column3) does not seem possible currently. One can only drop the last column of the table. This makes sense as dropping a column from the middle of the table would result in data in the columns after the dropped one being not properly sorted.
If the ORDER BY expression does not contain the columns, then dropping them works.

This would be an optimization for tables which collect metrics by dimensions which might change every now and then.

Proposed solution:

A setting that would allow the MergeTree table engine to sort rows by an implicit expression that contains all columns which are not already in the sorting expression but only on a best effort basis. The sorting would be performed during insertion of data and merging. E.g. if a table has columns column1 ... column4 with explicit strong sorting expression (column1, column2), then this setting would add an implicit soft sorting expression (column1, column2, column3, column4). Ordering of column3 and column4 are not guaranteed.

For tables with many columns, sorting them by all columns might not be efficient so explicitly specifying columns for the soft sort expression might be ideal. ALTER queries would allow modification of the columns that are in the best effort sorting expression freely while still having the same restrictions for the current ORDER BY expression.

Example:

CREATE TABLE metrics (date Date, country String, domain String, browser String, hits UInt64, bandwidth Uint64)
ORDER BY (date, country)
SOFTORDER BY (date, country, domain, path)

Dropping the column domain would be allowed.

Workaround considered:

One can of course create a new table with only column1 and column2 and then insert the data from the old table into it but that gets painful for large tables.

Comment:

I am not sure if my proposed solution is ideal. Maybe there are easier ways to improve the compression of data without requiring strict sorting with its limitations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions