Make asynchronous replica re-initialization reliable#8324
Merged
dyemanov merged 1 commit intov5.0-releasefrom Nov 25, 2024
Merged
Make asynchronous replica re-initialization reliable#8324dyemanov merged 1 commit intov5.0-releasefrom
dyemanov merged 1 commit intov5.0-releasefrom
Conversation
Member
Author
|
It appears something went wrong with the diff, sorry. Will fix ASAP. |
Member
Author
|
Wrong branch was initially selected, the patch is against v5 but can be (should be, I'd say) back- and front-ported. |
dyemanov
added a commit
that referenced
this pull request
Dec 10, 2024
dyemanov
added a commit
that referenced
this pull request
Dec 10, 2024
|
::: QA NOTE ::: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently when a physical backup is performed, journal segment is switched from N to N+1 at the backup start so that backup file is ensured to contain only data up to sequence N (including it). However, some long-running writeable transaction could already have some its changes stored in segments <= N while a commit event will be stored in some later segment. After re-initialization at the replica side, we continue with segment N+1 and (a) have older changes lost and (b) error "Transaction X is not found" usually happens. It means that the replica is inconsistent and must be re-initialized again. But if the primary is under high load, this may happen over and over.
The solution is to not delete segments <= N immediately, but instead scan them to find the active transactions at the end of N, calculate the new replication OAT, delete everything < OAT and replay the journal (active transactions only) starting with OAT, then proceed normally with N+1 and beyond.