-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PatrickRen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wallkop Thanks for the patch! LGTM.
Could you rebase the latest master to resolve the conflict, and squash all commits into one? Also it looks like the author and email of your commits don't match your profile on GitHub, and you may want to modify them before pushing.
|
hi @PatrickRen It looks like the master branch of debezium was upgraded to I will solve the all problem you describe later. |
0445b8d to
b4ce0dc
Compare
…p mode Author: wallkop <[email protected]> Date: Sat Jun 17 22:36:54 2023 +0800
PatrickRen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! LGTM.
…-2062 * origin/feat/issue-2062: [docs] Update connector link to 2.5-SNAPSHOT in docs [build] Bump version to 2.5-SNAPSHOT [hotfix][mysql] remove unused code (apache#2231) [hotfix] Add vitess connector to the release profile (apache#2233) [docs][hotfix] Update debezium reference links to 1.9 version [build] Update the copyright year to 2023 (apache#2205) [postgres] Fix postgres e2e test [postgres] scan.incremental.snapshot.enabled is closed by default [postgres] Backfill task will be able to end when there is not new change data but read the ending lsn [postgres] Create slot for backfill task before snapshot reading [postgres] Prepare a slot for the unique global stream split [mysql] Fix GTID issues to recover from checkpoint normally in specifying startup mode (apache#2220)
|
@wallkop
|
…ying startup mode (apache#2220)
|
@wallkop @PatrickRen I believe the fix logic for the "old uuid" when gtid.new.channel.position=EARLIEST configuration should equally apply to LATEST. Let’s examine this example. 1.Obtain the available GTIDs, i.e., show master status. 2.Obtain the checkpoint GTIDs. 3.Obtain the purged GTIDs, i.e., @@global.gtid_purged. When gtid.new.channel.position=EARLIEST, mergedGtidSet whould be When gtid.new.channel.position=LATEST, mergedGtidSet whould be The gtid.new.channel.position configuration should only affect new channels. In the example above, only 7aec1281-719c-11eb-afcf-00163e06a35c is the new channel. Therefore, the results for UUID 106a4bb6-ec0d-11ec-a2d4-00163e279211 should remain unchanged. I believe when LATEST is configured, the mergedGtidSet should be Additionally, with the current logic, if purged GTIDs is empty, configuring LATEST would return This would disrupt the expected GTID consumption order - causing transactions 203495054-204182173 to be processed before 1-203495053, which could lead to errors. |
Resolve #1944
Issue #1944 reports a bug where recovery from checkpoint fails when starting from offset-binlogfile&pos. After conducting further tests, I found that this bug can occur in Earliest, Timestamp, BinlogFile&Pos startup modes, as well as when incomplete Gtids are incorrectly set. The root cause of this bug is that when MySQL has GTID enabled and the aforementioned startup modes are used, the complete GTIDs cannot be recorded in the offsetContext. Instead, an empty GtidSet is initialized and subsequently appended by received Gtid Events. This leads to recovery failure when restoring from checkpoint due to incomplete GTIDs being recorded.
Test premise: MySQL needs to have GTID enabled, and due to settings such as master-slave switching and multiple masters, there are multiple sourceIDs for GTIDs
PoC From @PatrickRen: PatrickRen@2189da3
The solution comes from the POC of @PatrickRen. I have carefully reviewed this POC and conducted many tests. In the end, I believe this solution is viable.
The POC implementation is very elegant, and I only supplemented some comments and added a few test cases. The modified code has been validated using the company's production environment, and it has proven to perform very well under the classic master-slave architecture, correctly recovering from various start-mode checkpoints.
However, there are still two potential risks:
When MySQL is under a dual-master architecture, GTIDs may have gaps, similar to
A:1-102, 105-150. Such gaps are temporary but will eventually be consistent. But if the CDC happens to recover from the checkpoint when there are gaps in MySQL's GTID, it may access non-existent transactions, leading to recovery failure. This is because our GtidUtils' fixRestoredGtidSet method does not fix the gaps in the server GTID. I initially wanted to optimize this issue, but after weighing it, I believe that in most cases, the occurrence of GTID gaps is a problem with MySQL itself, which should be fixed by the DBA in MySQL, not to be compatible with CDC, to avoid triggering other unpredictable issues.The current implementation will correct all GTIDs, including those set by the user during the initial startup of CDC via StartupOptions. This may cause a problem: the user manually sets the GTID-offset to
A:300-500, the original intention might be to expect CDC to consume dataA:1-299:501~xxx, but CDC corrects it toA:1-500, only starting consumption from500. Solving this problem is not difficult, the key is whether it is necessary to solve it. I finally convinced myself to form a unified specification and inform the users: CDC only cares about the maximum GTID position and starts from it. This standard will make our program more user-friendly, but to some extent, it will violate MySQL's replication protocol.It's worth mentioning that I'm just raising the above two risks for discussion, and I don't suggest immediately solving and fixing them, as the current implementation is already very well done.