Back/Restore concurrency check on previous fails by SmitaRKulkarni · Pull Request #48726 · ClickHouse/ClickHouse

SmitaRKulkarni · 2023-04-12T18:28:13Z

Changelog category (leave one):

Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md): 
Updated to add error or completed status in zookeeper for a cluster for backup/restore, to avoid interpreting previously failed backup/restore when zookeeper is unable to remove nodes resolves #45486

…or backup/restore, to avoid interpreting previously failed backup/restore when zookeeper is unable to remove nodes

…_node_2

…orCluster

vitlibar · 2023-04-28T19:30:40Z

src/Backups/IBackupCoordination.h


    /// Sets the current stage and waits for other hosts to come to this stage too.
    virtual void setStage(const String & new_stage, const String & message) = 0;
+    virtual void setStageForCluster(const String & new_stage) = 0;


What is the "stage for cluster"? Each host has its own stage.

When we check for concurrency and allow a backup/restore to proceed, we need to inform zookeeper that this backup/restore is going to be processed even before each host gets the query. As mentioned in above comment, here having a state is needed.

vitlibar · 2023-04-28T19:35:40Z

src/Backups/RestoreCoordinationRemote.cpp

+
 void RestoreCoordinationRemote::setError(const Exception & exception)
 {
+    stage_sync->setStageForCluster(Stage::ERROR);


stage_sync->setError() already adds a ZooKeeper node <backup_path>/stage/error with information about the error. Why isn't it enough?

In some of our CI fails, I see that zookeeper is not able to delete all the nodes. And at times, it deletes the error node, but the stage (which was added for concurrency check) remains, causing misinterpretation.

vitlibar · 2023-04-28T20:38:16Z

Can we just use alive nodes? I mean while a backup or restore process is working there are always alive nodes in ZooKeeper. So all you need is to check if there is other backup/restore process with alive nodes.

SmitaRKulkarni · 2023-05-01T07:02:13Z

I dont think we can use just alive nodes. It is a good indication to check if there are ongoing backup/restore, but if we decide to proceed with a backup/restore, we need to inform zookeeper (other hosts), and this happens before each host gets the query & we cannot create alive nodes yet. So I had added 'SCHEDULED_TO_START' phase, if this is present then no other backp/restore will start. The changes in this PR update this stage to COMPLETED or ERROR after success or failure of backup/restore.

src/Backups/IBackupCoordination.h

src/Backups/BackupCoordinationStage.h

…_node_2

robot-ch-test-poll2 · 2023-05-05T15:55:14Z

This is an automated comment for commit 1e52926 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🟡 pending

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	🟡 pending
Mergeable Check	Checks if all other necessary checks are successful	🟢 success

…_node_2

…r cluster

vitlibar · 2023-05-16T11:19:32Z

@SmitaRKulkarni Did you notice that one of the CI builds failed?

SmitaRKulkarni · 2023-05-16T14:25:52Z

@SmitaRKulkarni Did you notice that one of the CI builds failed?

Fixed the build issue

tavplubix · 2023-05-17T10:57:46Z

Unit tests (asan) - IOResourceDynamicResourceManager was broken in master

…_node_2

Updated to add error or completed status in zookeeper for a cluster f…

49c95a5

…or backup/restore, to avoid interpreting previously failed backup/restore when zookeeper is unable to remove nodes

SmitaRKulkarni requested a review from vitlibar April 12, 2023 18:28

robot-ch-test-poll2 added the pr-bugfix Pull request with bugfix, not backported by default label Apr 12, 2023

SmitaRKulkarni and others added 4 commits April 13, 2023 09:46

Merge branch 'master' into Follow_up_Backup_Restore_concurrency_check…

6568c33

…_node_2

Fixed comment

d4b2297

Removed line from test_disallow_concurrrency for CI checks

74c6ca5

Removed parameter from setStage function and added function setStageF…

93572ab

…orCluster

vitlibar self-assigned this Apr 24, 2023

vitlibar reviewed Apr 28, 2023

View reviewed changes

SmitaRKulkarni requested a review from vitlibar May 1, 2023 07:02

vitlibar reviewed May 5, 2023

View reviewed changes

src/Backups/IBackupCoordination.h Outdated Show resolved Hide resolved

vitlibar reviewed May 5, 2023

View reviewed changes

src/Backups/BackupCoordinationStage.h Show resolved Hide resolved

Merge branch 'master' into Follow_up_Backup_Restore_concurrency_check…

b0c408f

…_node_2

SmitaRKulkarni added 2 commits May 8, 2023 13:53

Merge branch 'master' into Follow_up_Backup_Restore_concurrency_check…

f20901d

…_node_2

Removed setStageForCluster and added option all_hosts to set stage fo…

49ecba6

…r cluster

SmitaRKulkarni requested a review from vitlibar May 8, 2023 16:59

vitlibar approved these changes May 16, 2023

View reviewed changes

Fixed clang build

9a2645a

Merge branch 'master' into Follow_up_Backup_Restore_concurrency_check…

1e52926

…_node_2

tavplubix mentioned this pull request May 17, 2023

Improve concurrent parts removal with zero copy replication #49630

Merged

tavplubix merged commit c4d074a into master May 17, 2023

tavplubix deleted the Follow_up_Backup_Restore_concurrency_check_node_2 branch May 17, 2023 11:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Back/Restore concurrency check on previous fails#48726

Back/Restore concurrency check on previous fails#48726
tavplubix merged 10 commits intomasterfrom
Follow_up_Backup_Restore_concurrency_check_node_2

SmitaRKulkarni commented Apr 12, 2023 •

edited

Loading

Uh oh!

vitlibar Apr 28, 2023

Uh oh!

SmitaRKulkarni May 1, 2023

Uh oh!

vitlibar Apr 28, 2023

Uh oh!

SmitaRKulkarni May 1, 2023

Uh oh!

vitlibar commented Apr 28, 2023

Uh oh!

SmitaRKulkarni commented May 1, 2023

Uh oh!

Uh oh!

Uh oh!

robot-ch-test-poll2 commented May 5, 2023 •

edited by robot-ch-test-poll3

Loading

Uh oh!

vitlibar commented May 16, 2023 •

edited

Loading

Uh oh!

SmitaRKulkarni commented May 16, 2023 •

edited

Loading

Uh oh!

tavplubix commented May 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SmitaRKulkarni commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitlibar Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

SmitaRKulkarni May 1, 2023

Choose a reason for hiding this comment

Uh oh!

vitlibar Apr 28, 2023

Choose a reason for hiding this comment

Uh oh!

SmitaRKulkarni May 1, 2023

Choose a reason for hiding this comment

Uh oh!

vitlibar commented Apr 28, 2023

Uh oh!

SmitaRKulkarni commented May 1, 2023

Uh oh!

Uh oh!

Uh oh!

robot-ch-test-poll2 commented May 5, 2023 • edited by robot-ch-test-poll3 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vitlibar commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SmitaRKulkarni commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tavplubix commented May 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SmitaRKulkarni commented Apr 12, 2023 •

edited

Loading

robot-ch-test-poll2 commented May 5, 2023 •

edited by robot-ch-test-poll3

Loading

vitlibar commented May 16, 2023 •

edited

Loading

SmitaRKulkarni commented May 16, 2023 •

edited

Loading