-
Notifications
You must be signed in to change notification settings - Fork 8.3k
"Tape" style backups. #8841
Description
Use case
Import and export tables data and metadata to s3, local filesystem or another ClickHouse server.
Allow incremental backups. Backups can be restored automatically but they should be simple enough to always allow user to restore them manually if needed.
Describe the solution you'd like
Backup contains data file and metadata file.
Data file is tar with directory structure that resembles directory structure on clickhouse-server.
It contains data, metadata; database and tables with data parts for MergeTree and other tables.
If we're using multiple "disks" on server, the backup will contain all data in single directory regardless to distribution on disks. If there are symlinks, they will be dereferenced instead of putting into backup as is. Data parts for MergeTree can be prepared in shadow directory, but in backup they will reside in usual data directory. Another example: if MergeTree is using remote storage (s3, HDFS - it's in development), the data will be read and written to backup.
(the principle: backup contains files as if it was restored on a server without custom storage configuration)
Metadata file contains list of all paths inside tar with 128bit checksums for every file (and offset and length in tar for random access?). Also it contains one total checksum. For incremental backups it may contain total checksum of previous backup, it's name and suggested location. If backup is incremental, then metadata file will contain records about every file with checksum, but data tar may lack some of these files because they to be found in previous backups. The format of metadata file is JSON.
The user may ask to backup all tables, one or multiple tables or subset of partitions of one table. On restore, the user may ask to put data parts of MergeTree tables to detached directory or immediately replace data.
It should be possible to use backups for importing of example datasets from s3. Allow to proceed if backup doesn't contain metadata file.
For Replicated tables, backup is created from single replica (of user choice, where the command is run). Distributed backups are out of scope of this task (backups will be run on per shard basis).
Caveats?