Back/Restore concurrency check on previous fails#48726
Conversation
…or backup/restore, to avoid interpreting previously failed backup/restore when zookeeper is unable to remove nodes
src/Backups/IBackupCoordination.h
Outdated
|
|
||
| /// Sets the current stage and waits for other hosts to come to this stage too. | ||
| virtual void setStage(const String & new_stage, const String & message) = 0; | ||
| virtual void setStageForCluster(const String & new_stage) = 0; |
There was a problem hiding this comment.
What is the "stage for cluster"? Each host has its own stage.
There was a problem hiding this comment.
When we check for concurrency and allow a backup/restore to proceed, we need to inform zookeeper that this backup/restore is going to be processed even before each host gets the query. As mentioned in above comment, here having a state is needed.
|
|
||
| void RestoreCoordinationRemote::setError(const Exception & exception) | ||
| { | ||
| stage_sync->setStageForCluster(Stage::ERROR); |
There was a problem hiding this comment.
stage_sync->setError() already adds a ZooKeeper node <backup_path>/stage/error with information about the error. Why isn't it enough?
There was a problem hiding this comment.
In some of our CI fails, I see that zookeeper is not able to delete all the nodes. And at times, it deletes the error node, but the stage (which was added for concurrency check) remains, causing misinterpretation.
|
Can we just use |
|
I dont think we can use just alive nodes. It is a good indication to check if there are ongoing backup/restore, but if we decide to proceed with a backup/restore, we need to inform zookeeper (other hosts), and this happens before each host gets the query & we cannot create alive nodes yet. So I had added 'SCHEDULED_TO_START' phase, if this is present then no other backp/restore will start. The changes in this PR update this stage to COMPLETED or ERROR after success or failure of backup/restore. |
|
This is an automated comment for commit 1e52926 with description of existing statuses. It's updated for the latest CI running
|
|
@SmitaRKulkarni Did you notice that one of the CI builds failed? |
Fixed the build issue |
|
Unit tests (asan) - |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Updated to add error or completed status in zookeeper for a cluster for backup/restore, to avoid interpreting previously failed backup/restore when zookeeper is unable to remove nodes resolves #45486