Skip to content

Back/Restore concurrency check on previous fails#48726

Merged
tavplubix merged 10 commits intomasterfrom
Follow_up_Backup_Restore_concurrency_check_node_2
May 17, 2023
Merged

Back/Restore concurrency check on previous fails#48726
tavplubix merged 10 commits intomasterfrom
Follow_up_Backup_Restore_concurrency_check_node_2

Conversation

@SmitaRKulkarni
Copy link
Copy Markdown
Member

@SmitaRKulkarni SmitaRKulkarni commented Apr 12, 2023

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Updated to add error or completed status in zookeeper for a cluster for backup/restore, to avoid interpreting previously failed backup/restore when zookeeper is unable to remove nodes resolves #45486

…or backup/restore, to avoid interpreting previously failed backup/restore when zookeeper is unable to remove nodes
@SmitaRKulkarni SmitaRKulkarni requested a review from vitlibar April 12, 2023 18:28
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-bugfix Pull request with bugfix, not backported by default label Apr 12, 2023
@vitlibar vitlibar self-assigned this Apr 24, 2023

/// Sets the current stage and waits for other hosts to come to this stage too.
virtual void setStage(const String & new_stage, const String & message) = 0;
virtual void setStageForCluster(const String & new_stage) = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "stage for cluster"? Each host has its own stage.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we check for concurrency and allow a backup/restore to proceed, we need to inform zookeeper that this backup/restore is going to be processed even before each host gets the query. As mentioned in above comment, here having a state is needed.


void RestoreCoordinationRemote::setError(const Exception & exception)
{
stage_sync->setStageForCluster(Stage::ERROR);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stage_sync->setError() already adds a ZooKeeper node <backup_path>/stage/error with information about the error. Why isn't it enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some of our CI fails, I see that zookeeper is not able to delete all the nodes. And at times, it deletes the error node, but the stage (which was added for concurrency check) remains, causing misinterpretation.

@vitlibar
Copy link
Copy Markdown
Member

Can we just use alive nodes? I mean while a backup or restore process is working there are always alive nodes in ZooKeeper. So all you need is to check if there is other backup/restore process with alive nodes.

@SmitaRKulkarni
Copy link
Copy Markdown
Member Author

I dont think we can use just alive nodes. It is a good indication to check if there are ongoing backup/restore, but if we decide to proceed with a backup/restore, we need to inform zookeeper (other hosts), and this happens before each host gets the query & we cannot create alive nodes yet. So I had added 'SCHEDULED_TO_START' phase, if this is present then no other backp/restore will start. The changes in this PR update this stage to COMPLETED or ERROR after success or failure of backup/restore.

@SmitaRKulkarni SmitaRKulkarni requested a review from vitlibar May 1, 2023 07:02
@robot-ch-test-poll2
Copy link
Copy Markdown
Contributor

robot-ch-test-poll2 commented May 5, 2023

This is an automated comment for commit 1e52926 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🟡 pending

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR🟡 pending
Mergeable CheckChecks if all other necessary checks are successful🟢 success

@SmitaRKulkarni SmitaRKulkarni requested a review from vitlibar May 8, 2023 16:59
@vitlibar
Copy link
Copy Markdown
Member

vitlibar commented May 16, 2023

@SmitaRKulkarni Did you notice that one of the CI builds failed?

@SmitaRKulkarni
Copy link
Copy Markdown
Member Author

SmitaRKulkarni commented May 16, 2023

@SmitaRKulkarni Did you notice that one of the CI builds failed?

Fixed the build issue

@tavplubix
Copy link
Copy Markdown
Member

Unit tests (asan) - IOResourceDynamicResourceManager was broken in master

@tavplubix tavplubix merged commit c4d074a into master May 17, 2023
@tavplubix tavplubix deleted the Follow_up_Backup_Restore_concurrency_check_node_2 branch May 17, 2023 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test_backup_restore_on_cluster/test_disallow_concurrency.py

4 participants