Skip to content

Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory#94892

Merged
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
myeongjjun:fix-race-cleardir-remove
Jan 30, 2026
Merged

Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory#94892
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
myeongjjun:fix-race-cleardir-remove

Conversation

@myeongjjun
Copy link
Copy Markdown
Contributor

@myeongjjun myeongjjun commented Jan 23, 2026

During ReplicatedMergeTree startup, clearOldTemporaryDirectories (called from AttachThread with delete_tmp_ prefix, added in #37906) can concurrently delete a directory that loadOutdatedDataParts is also removing via clearDirectory. The fallback fs::remove_all throws filesystem_error when files disappear mid-traversal, propagating to std::terminate.

Applied the same catch (const fs::filesystem_error & e) pattern already used in clearOldTemporaryDirectories (MergeTreeData.cpp:3130).

Fixes #94891

Changelog category (leave one):

  • Critical Bug Fix (crash, data loss, RBAC) or LOGICAL_ERROR

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixes a crash during ReplicatedMergeTree startup caused by concurrent removal of delete_tmp_* directories.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jan 23, 2026

CLA assistant check
All committers have signed the CLA.

@tiandiwonder tiandiwonder requested a review from Copilot January 23, 2026 05:31
@tiandiwonder tiandiwonder self-assigned this Jan 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash during ReplicatedMergeTree startup caused by concurrent directory removal. When clearOldTemporaryDirectories (running in AttachThread) and loadOutdatedDataParts both attempt to delete the same delete_tmp_* directory, the fallback fs::remove_all can throw a filesystem_error when files disappear mid-traversal, leading to termination.

Changes:

  • Added exception handling in clearDirectory to gracefully handle concurrent directory removal by catching filesystem_error with no_such_file_or_directory error code

@tiandiwonder tiandiwonder added the can be tested Allows running workflows for external contributors label Jan 23, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 23, 2026

Workflow [PR], commit [fbc2f28]

Summary:

job_name test_name status info comment
Stateless tests (amd_debug, parallel) failure
00625_arrays_in_nested FAIL cidb
AST fuzzer (amd_ubsan) failure
UndefinedBehaviorSanitizer: undefined behavior (STID: 4795-66c7) FAIL cidb
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting: the query: (STID: 1941-1bfa) FAIL cidb, issue
Finish Workflow failure
python3 ./ci/jobs/scripts/workflow_hooks/new_tests_check.py failure

@clickhouse-gh clickhouse-gh bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Jan 23, 2026
@ClickHouse ClickHouse deleted a comment from Copilot AI Jan 23, 2026
Copy link
Copy Markdown
Contributor

@tiandiwonder tiandiwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really good analysis!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle it too.

try
{
can_remove_description.emplace(can_remove_callback());
disk->removeSharedRecursive(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to handle it too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Applied the same catch to both call sites.

@alexey-milovidov alexey-milovidov merged commit 90eee86 into ClickHouse:master Jan 30, 2026
128 of 134 checks passed
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jan 30, 2026
@robot-ch-test-poll robot-ch-test-poll added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jan 30, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #94892 to 25.3: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #94892 to 25.8: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #94892 to 25.10: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #94892 to 25.11: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #94892 to 25.12: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
robot-ch-test-poll2 added a commit that referenced this pull request Jan 30, 2026
Cherry pick #94892 to 26.1: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
robot-clickhouse added a commit that referenced this pull request Jan 30, 2026
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jan 30, 2026
clickhouse-gh bot added a commit that referenced this pull request Jan 30, 2026
Backport #94892 to 26.1: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
clickhouse-gh bot added a commit that referenced this pull request Jan 30, 2026
Backport #94892 to 25.12: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
clickhouse-gh bot added a commit that referenced this pull request Jan 30, 2026
Backport #94892 to 25.11: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
clickhouse-gh bot added a commit that referenced this pull request Jan 30, 2026
Backport #94892 to 25.8: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
alexey-milovidov added a commit that referenced this pull request Mar 17, 2026
Backport #94892 to 25.3: Fix crash in clearDirectory on concurrent removal of delete_tmp_ directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Server crash (std::terminate) during ReplicatedMergeTree startup due to race in delete_tmp_ directory removal

8 participants