Skip to content

Fix recovering replicated databases with long moving metadata file#85177

Merged
tuanpach merged 2 commits intoClickHouse:masterfrom
tuanpach:fix-recover-replicated-database-with-long-moving-file
Aug 15, 2025
Merged

Fix recovering replicated databases with long moving metadata file#85177
tuanpach merged 2 commits intoClickHouse:masterfrom
tuanpach:fix-recover-replicated-database-with-long-moving-file

Conversation

@tuanpach
Copy link
Copy Markdown
Member

@tuanpach tuanpach commented Aug 7, 2025

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix recovering replicated databases when moving the metadata file takes a long time.

When creating a Replicated DB, at this point, the DB is considered started because the loading tasks are finished.
The DDLWorker can process the DDLs. If there is a create table DDL, it needs to acquire the table ddl_guard here. The ddl_guard requires the shared db mutex here.

If moving the metadata file takes a long time, the DB ddl_guard is still on hold. When DDLWorker is creating a table, it cannot acquire the DB shared mutex and throws an exception:

throw Exception(ErrorCodes::UNKNOWN_DATABASE, "Database {} is currently dropped or renamed", database_name);

This PR:

  • When acquiring the DB lock fails, if the DB exists, retry with a timeout.
  • Moves the moving file before checking if the table loading tasks are finished

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@tuanpach tuanpach added the can be tested Allows running workflows for external contributors label Aug 7, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 7, 2025

Workflow [PR], commit [4e17c72]

Summary:

job_name test_name status info comment
Integration tests (amd_asan, old analyzer, 4/6) failure
test_restore_db_replica/test.py::test_query_after_restore_db_replica[rename table-with exists table-with restart] FAIL
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Aug 7, 2025
@tuanpach tuanpach force-pushed the fix-recover-replicated-database-with-long-moving-file branch 2 times, most recently from cfbaabd to 4ce8c5a Compare August 7, 2025 08:37
@jkartseva jkartseva self-assigned this Aug 7, 2025
Copy link
Copy Markdown
Member

@jkartseva jkartseva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

if (!is_database_guard)
{
static constexpr int MAX_TRY = 10;
static constexpr int INTERVAL_MS = 100;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: base/base/sleep.h :void sleepForMilliseconds(uint64_t milliseconds); parameter is unsigned

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

…meout.

- Moves the moving file before checking if the table loading tasks are finished
@tuanpach tuanpach force-pushed the fix-recover-replicated-database-with-long-moving-file branch from 710e533 to 4e17c72 Compare August 15, 2025 02:18
@tuanpach
Copy link
Copy Markdown
Member Author

@tuanpach tuanpach enabled auto-merge August 15, 2025 08:33
@tuanpach tuanpach added this pull request to the merge queue Aug 15, 2025
Merged via the queue into ClickHouse:master with commit cf137c2 Aug 15, 2025
119 of 122 checks passed
@tuanpach tuanpach deleted the fix-recover-replicated-database-with-long-moving-file branch August 15, 2025 10:54
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants