Skip to content

Add DatabaseBackup#75725

Merged
kssenii merged 14 commits intoClickHouse:masterfrom
kitaisreal:add-database-backup
Feb 19, 2025
Merged

Add DatabaseBackup#75725
kssenii merged 14 commits intoClickHouse:masterfrom
kitaisreal:add-database-backup

Conversation

@kitaisreal
Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Add new Database engine DatabaseBackup that allows to instantly attach table/database from backup.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Feb 7, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 7, 2025

Workflow [PR], commit [0eb298b]

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Feb 7, 2025
@nikitamikhaylov
Copy link
Copy Markdown
Member

Duplicate? #64222

@kitaisreal
Copy link
Copy Markdown
Contributor Author

I thought that backups implementation is very similar to disk implementation internally in ClickHouse. And it can work similar to web disks. Except that even if you adapt backup engine to disk interface, you cannot just use this disk in MergeTree table storage policy, because underlying data in backup is stored in format similar to DatabaseOrdinary and you cannot attach such a table to an DatabaseAtomic that expects that table data path will be store/first_uuid_characters/uuid.

So then I decided that it makes sense to implement a new Backup database engine. That way we actually solve a lot of problems:

  1. You can have backup with any amount of databases and tables. You just need to specify a concrete database name in the Backup database engine.
  2. Table backup looks exactly the same as a database backup. We store the database in backup from which we created the table. So that way the interface is very generic and will work with databases and tables.
  3. You do not need to specify table types, because all metadata for all tables is already stored in backup.
  4. During database load it is possible to properly bootstrap tables, when we read table metadata, so we can change engines in ReplicatedMergeTree tables, change storage policy, change views/materialized views queries.

Although, we still have the problem that the database relative path will be completely different from the backup database relative path. So we need to also have a path translation layer somewhere inside disk implementation.

So current design has following components:

  1. BackupDisk - adapts backup interface to disk interface and performs path translation.
  2. DatabaseBackup - uses backup to load tables metadata, changes tables metadata to properly bootstrap tables, creates storage policy with backup disk and sets this storage policy to bootstrap tables.

When the MergeTree table inside the database Backup tries to read some file, it uses the relative path of the Backup database that gets translated inside the DiskBackup to a proper path that can be used to query files/directories in backup.

@kitaisreal
Copy link
Copy Markdown
Contributor Author

DatabaseBackup provides a very familiar interface to people who work with backups.

You have backup query:

BACKUP DATABASE/TABLE name TO BACKUP_ENGINE

Backup engine can be anything like S3, Azure, File, Disk. With Disk backup engine it looks like this:

BACKUP DATABASE/TABLE name TO Disk(‘disk_name’, ‘backup_name’)

To create database Backup you can just use such query:

CREATE DATABASE backup ENGINE = Backup(‘database_name_inside_backup’, BACKUP_ENGINE)

For example with Disk backup engine query will be like this:

CREATE DATABASE backup ENGINE = Backup(‘database_name’, Disk(‘disk_name’, ‘backup_name’))

The goal was for the client to specify exactly the same engine that he used during backup creation. For all backup engines database Backup will work.

@kitaisreal
Copy link
Copy Markdown
Contributor Author

kitaisreal commented Feb 7, 2025

Duplicate? #64222

I discussed with @alexey-milovidov and this pull request looks abandoned. Last commit May 22, 2024 .

Plus only Backup disk implementation is not useful at all, because you need to have implementation of full Backup attach mechanism.

In this pull request I implemented it on database layer, so it can work with any table/database from backup. I wrote more motivation above.

@kssenii kssenii self-assigned this Feb 7, 2025
@kitaisreal
Copy link
Copy Markdown
Contributor Author

Database Backup works with both incremental and non-incremental backups.

Let’s make an example with a Disk engine because it looks more like production configuration. Let’s first setup backups disk in storage.xml:

<storage_configuration>
	<disks>
		<backups>
			<type>local</type>
			<path>/home/ubuntu/ClickHouseWorkDir/backups/</path>
		</backups>
	</disks>
</storage_configuration>
<backups>
	<allowed_disk>backups</allowed_disk>
	<allowed_path>/home/ubuntu/ClickHouseWorkDir/backups/</allowed_path>
</backups>

Example of usage. Let’s create test database, tables, insert some data and then create a backup:

CREATE DATABASE test_database;

CREATE TABLE test_database.test_table_1 (id UInt64, value String) ENGINE=MergeTree ORDER BY id;
INSERT INTO test_database.test_table_1 VALUES (0, 'test_database.test_table_1');

CREATE TABLE test_database.test_table_2 (id UInt64, value String) ENGINE=MergeTree ORDER BY id;
INSERT INTO test_database.test_table_2 VALUES (0, 'test_database.test_table_2');

CREATE TABLE test_database.test_table_3 (id UInt64, value String) ENGINE=MergeTree ORDER BY id;
INSERT INTO test_database.test_table_3 VALUES (0, 'test_database.test_table_3');

BACKUP DATABASE test_database TO Disk('backups', 'test_database_backup');

So now we have test_database_backup backup and let’s create database Backup:

CREATE DATABASE test_database_backup ENGINE = Backup('test_database', Disk('backups', 'test_database_backup'));

Now we can query any table from database:

SELECT id, value FROM test_database.test_table_1;

┌─id─┬─value──────────────────────┐
│  0 │ test_database.test_table_1 │
└────┴────────────────────────────┘

SELECT id, value FROM test_database.test_table_2;

┌─id─┬─value──────────────────────┐
│  0 │ test_database.test_table_2 │
└────┴────────────────────────────┘

SELECT id, value FROM test_database.test_table_3;

┌─id─┬─value──────────────────────┐
│  0 │ test_database.test_table_3 │
└────┴────────────────────────────┘

It is also possible to work with this database as with any ordinary database. For example query tables in it:
SELECT database, name FROM system.tables WHERE database = 'test_database_backup':

┌─database─────────────┬─name─────────┐
│ test_database_backup │ test_table_1 │
│ test_database_backup │ test_table_2 │
│ test_database_backup │ test_table_3 │
└──────────────────────┴──────────────┘

Copy link
Copy Markdown
Member

@kssenii kssenii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some documentation for this feature in docs/en/engines/database-engines/.

@kitaisreal
Copy link
Copy Markdown
Contributor Author

Please add some documentation for this feature in docs/en/engines/database-engines/.

I added documentation.

@kitaisreal kitaisreal requested a review from kssenii February 16, 2025 08:40
SELECT '--';

DROP DATABASE IF EXISTS 03279_test_database_backup_database;
CREATE DATABASE 03279_test_database_backup_database ENGINE = Backup('03279_test_database', Disk('backups', '03279_test_database'));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add one integration test with server restart? Just to test that server restarts successfully with created DatabaseBackup and tables are accessible after restart.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added integation tests.

@kitaisreal kitaisreal requested a review from kssenii February 17, 2025 13:03
@kitaisreal kitaisreal requested a review from kssenii February 17, 2025 16:52
@kssenii
Copy link
Copy Markdown
Member

kssenii commented Feb 18, 2025

@kssenii
Copy link
Copy Markdown
Member

kssenii commented Feb 18, 2025

Could you please merge this branch with master one more time - there was a change in style check (#76082) and now style check fails in the private synchronization PR of this PR, because checks were started there on a branch including the mentioned change in the style check, but this PR's checks were run without it, but it will fail in this PR as well once you merge with master branch.
The concrete failure reason is this

# Left/right single/double quotation marks (too easy to confuse with standard single/double quotation marks)
EXCLUDE_QUOTATION_MARKS='src/Functions/HTMLCharacterReference.generated.cpp|src/Parsers/Lexer.cpp'
find $ROOT_PATH/{base,src,programs,utils,docs/en} -name '*.md' -or -name '*.h' -or -name '*.cpp' -or -name '*.js' -or -name '*.py' -or -name '*.html' | grep -vP $EXCLUDE_QUOTATION_MARKS | xargs grep -l -P '”|“|‘|’' && echo "^ Files contain left or right single/double quotation marks: ”, “, ‘, or ’."

@kssenii kssenii enabled auto-merge February 18, 2025 15:13
auto-merge was automatically disabled February 19, 2025 10:34

Head branch was pushed to by a user without write access

@kssenii
Copy link
Copy Markdown
Member

kssenii commented Feb 19, 2025

Integration tests (asan, old analyzer, 2/6)

test_quorum_inserts

Integration tests (asan, old analyzer, 3/6)

test_storage_rabbitmq

CH Inc sync

It was broken in general, no relevant failures

@kssenii kssenii added this pull request to the merge queue Feb 19, 2025
Merged via the queue into ClickHouse:master with commit 638e7fe Feb 19, 2025
120 of 127 checks passed
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 19, 2025
@azat azat mentioned this pull request Mar 18, 2025
31 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants