Skip to content

RFC: Unique key constraint #70589

@alexey-milovidov

Description

@alexey-milovidov

Use case

Checking for duplicate records on insertion.
Allow INSERT IGNORE queries.

Describe the solution you'd like

The feature will be supported for tables of the MergeTree family, including Replicated and Shared MergeTree.

A user creates a constraint in the table:

CONSTRAINT check_unique_ab UNIQUE (a, b) TYPE impl_engine_name

The constraint will maintain a data structure containing a set of all unique values of the corresponding columns tuple (a, b in this example). The data structure of the set depends on the implementation engine, which could be specified inside TYPE.

The implementation could be: - a hash table in memory with 128-bit hashes of the columns; - a rocksdb table.

The set is not persistent. On server restart, it is reconstructed by reading the corresponding columns from table's data and filling it from scratch. However, we can do it asynchronously and throw exceptions or wait on INSERT queries while it is being reconstructed. We will also introduce a setting to ignore the constraint. Also, we can dump a snapshot on server shutdown, to improve the startup time if the snapshot exists.

We should not reuse the previous state of the data structure after an unclean restart, because the insertion into the table and the set is not atomic. This justifies why the data structure should be ephemeral and be able to be reconstructed from scratch.

The set is updated on each insert and on the replication of data parts. Whenever a new data part is added, we have to read the corresponding columns and fill their values into the set. In the case of an exception during the updating of the data structure, the transaction with data parts in memory should also be reverted.

The main difficulty is the concurrent and distributed operation. For non-replicated MergeTree, as well as for concurrent inserts within the single server, consistency will be achieved by serializing inserts with a mutex.

For replicated and shared MergeTree, each server maintains their own copy of the set independently. Consistency will be achieved by serializing transactions by optimistic transactions with retries. We obtain the set of data parts, then do the insertion and update the data structure, and on the final commit to Keeper, atomically check that the set of data parts was not updated in behind. If it was updated, roll back the changes to the set and repeat the insertion attempt of the chunk of data.

This serialization of operations should be done independently for each partition of the table.

Describe alternatives you've considered

Put the whole set into a replicated state machine and unify it with Keeper.
As of today, this is impractical and inconvenient, because we use Keeper only for smaller amount of metadata, and the number of table replicas is often less than the number of Keepers.

Additional context

Positional reads. If the data structure maintains not only the presence of values of the key, but also their row numbers in the data parts, it will allow fast key-value requests by positional reads.

Index of type Set or Join. This is much simpler and is another idea. We can allow a MergeTree table to also act as a Set or Join table by specifying indices of type Set or Join. The data structures behind these indexes should be ephemeral and reconstructed on load, as described above.

We can implement REPLACE, MERGE, and ON DUPLICATE KEY UPDATE queries, but it will require running a SELECT query before the INSERT and, more importantly, guarantee that the SELECT uses the same fresh set of data parts. This increases the time of "optimistic transaction" and makes INSERTs rather slow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions