-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Efficient storage to get last non null value of each column #3852
Description
We have a table with this structure:
id VARCHAR, Timestamp DateTime, v0 Nullable(DOUBLE), v1 Nullable(DOUBLE),...
The key is (id,Timestamp).
We receive our data v0,v1,... in several parts so that inserted rows are not complete, but we want the final result to be complete rows. For example:
Inserted:
id Timestamp v0 v1 v2 v3
================================
id1 ts1 1 1 null null
id1 ts1 null null 2 2
Expected result:
id Timestamp v0 v1 v2 v3
================================
id1 ts1 1 1 2 2
We also want to be able to update values, so we added a 'version' column. The rule we want to implement is "keep the latest non-null value for each column".
We tried different solutions but none is really perfect:
- dynamically compute the final result in a view with argMax(v0,version): the performance are bad as soon as we request the view with extra criteria
- AggregatingMergeTree with argMaxState(v0, version): this also requires a view to extract the real value, so same problem as above, and the compression ratio is not very good because of the way -State is stored compared to a simple DOUBLE.
- ReplacingMergeTree: that's our current solution but we have to reload previous rows for a given set of keys, merge with new data and insert the result.
The ideal solution for us would be to have a "ignoreNullValues" option in ReplacingMergeTree so that instead of just keeping the latest row, the engine replaces a value only if the new value is not null (for each column).
Is this something that could be added?