Self-balancing architecture: Moving parts between shards

## Issue:

So far ClickHouse as a data store shifts the responsibility of the balancing to an operator to provide inbound data flow in
a balanced manner - that is to say, once data is written to any of shards, it's hard to move it around involving manual interventions 
either tricking read pipeline by building a table with MergeEngine on top and copying partitions around or manually move parts with 
inevitable maintenance windows with partial response. Needless to say, it's cumbersome and exausting.

## Proposal:

The proposal doesn’t cover entire self-balancing architecture, rather suggests an extension of `ALTER TABLE table MOVE PARTITION|PART`
to be able to move parts between shards seamless.

Similar to `MOVING PARTITION|PART to a disk|volume to DISK|VOLUME,` proposed syntax is:

```
ALTER TABLE table_name MOVE PARTITION|PART partition_expr TO SHARD 'shard_id'
```

The process is sketchy explained as follows:
1. ClickHouse won't include in upcoming merges or stops running merges with moving parts
2. Copy data to a designated shard (all live replica) to detached directory
3. When the copy is done, initiate a part attachment on a new node:
    a. mark moving parts on an old node as OLD, during SELECTs return to a root executor all OLD part ids that were processed during query
    b. mark moving parts on a new node as NEW and ATTACH part locally on replicas, during SELECTs return to a root executor all NEW part ids that were processed during a query
    c. remove part from an old node
    d. remove NEW mark from parts on a new node.
During SELECTS against the cluster, root executor (a node that gets initial query from the client) will retry query to an old leaf node, if the result from the cluster has intersections between OLD and NEW part ids,  e.g. using virtual columns:
```
where _part != '202007_7145_7145_0
```

This is a pessimistic scenario when a SELECT hits both parts in old and new nodes during steps 3.b - 3.c. Depending on the data access pattern it's possible to consider a further optimization where OLD node can precompute both results with and without OLD parts.

UPDATE:

IRL @nvartolomei raised a few concerns regarding the concept:
1. Parts don’t necessary have the same names between shards
2. Re-attaching a part changes it’s name by renumbering a block_id suffix.

Both of those can be covered by extending a metadata of a part during movement: 
not only add an OLD/NEW marker, but also uniq identifier, that can be discarded later when the movement is complete.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self-balancing architecture: Moving parts between shards #13574

Issue:

Proposal:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Self-balancing architecture: Moving parts between shards #13574

Description

Issue:

Proposal:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions