Skip to content

Efficient storage to get last non null value of each column #3852

@bgranvea

Description

@bgranvea

We have a table with this structure:

id VARCHAR, Timestamp DateTime, v0 Nullable(DOUBLE), v1 Nullable(DOUBLE),...

The key is (id,Timestamp).

We receive our data v0,v1,... in several parts so that inserted rows are not complete, but we want the final result to be complete rows. For example:

Inserted:

id   Timestamp v0   v1   v2   v3
================================
id1  ts1       1    1    null null
id1  ts1       null null 2    2

Expected result:

id   Timestamp v0   v1   v2   v3
================================
id1  ts1       1    1    2    2

We also want to be able to update values, so we added a 'version' column. The rule we want to implement is "keep the latest non-null value for each column".

We tried different solutions but none is really perfect:

  • dynamically compute the final result in a view with argMax(v0,version): the performance are bad as soon as we request the view with extra criteria
  • AggregatingMergeTree with argMaxState(v0, version): this also requires a view to extract the real value, so same problem as above, and the compression ratio is not very good because of the way -State is stored compared to a simple DOUBLE.
  • ReplacingMergeTree: that's our current solution but we have to reload previous rows for a given set of keys, merge with new data and insert the result.

The ideal solution for us would be to have a "ignoreNullValues" option in ReplacingMergeTree so that instead of just keeping the latest row, the engine replaces a value only if the new value is not null (for each column).

Is this something that could be added?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions