Skip to content

Self-balancing architecture: Moving parts between shards #13574

@xjewer

Description

@xjewer

Issue:

So far ClickHouse as a data store shifts the responsibility of the balancing to an operator to provide inbound data flow in
a balanced manner - that is to say, once data is written to any of shards, it's hard to move it around involving manual interventions
either tricking read pipeline by building a table with MergeEngine on top and copying partitions around or manually move parts with
inevitable maintenance windows with partial response. Needless to say, it's cumbersome and exausting.

Proposal:

The proposal doesn’t cover entire self-balancing architecture, rather suggests an extension of ALTER TABLE table MOVE PARTITION|PART
to be able to move parts between shards seamless.

Similar to MOVING PARTITION|PART to a disk|volume to DISK|VOLUME, proposed syntax is:

ALTER TABLE table_name MOVE PARTITION|PART partition_expr TO SHARD 'shard_id'

The process is sketchy explained as follows:

  1. ClickHouse won't include in upcoming merges or stops running merges with moving parts
  2. Copy data to a designated shard (all live replica) to detached directory
  3. When the copy is done, initiate a part attachment on a new node:
    a. mark moving parts on an old node as OLD, during SELECTs return to a root executor all OLD part ids that were processed during query
    b. mark moving parts on a new node as NEW and ATTACH part locally on replicas, during SELECTs return to a root executor all NEW part ids that were processed during a query
    c. remove part from an old node
    d. remove NEW mark from parts on a new node.
    During SELECTS against the cluster, root executor (a node that gets initial query from the client) will retry query to an old leaf node, if the result from the cluster has intersections between OLD and NEW part ids,  e.g. using virtual columns:
where _part != '202007_7145_7145_0

This is a pessimistic scenario when a SELECT hits both parts in old and new nodes during steps 3.b - 3.c. Depending on the data access pattern it's possible to consider a further optimization where OLD node can precompute both results with and without OLD parts.

UPDATE:

IRL @nvartolomei raised a few concerns regarding the concept:

  1. Parts don’t necessary have the same names between shards
  2. Re-attaching a part changes it’s name by renumbering a block_id suffix.

Both of those can be covered by extending a metadata of a part during movement:
not only add an OLD/NEW marker, but also uniq identifier, that can be discarded later when the movement is complete.

Metadata

Metadata

Assignees

No one assigned

    Labels

    featurenot plannedKnown issue, no plans to fix it currenlty

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions