[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

wallkop · 2023-06-17T15:00:16Z

Resolve #1944

Problem analysis

Issue #1944 reports a bug where recovery from checkpoint fails when starting from offset-binlogfile&pos. After conducting further tests, I found that this bug can occur in Earliest, Timestamp, BinlogFile&Pos startup modes, as well as when incomplete Gtids are incorrectly set. The root cause of this bug is that when MySQL has GTID enabled and the aforementioned startup modes are used, the complete GTIDs cannot be recorded in the offsetContext. Instead, an empty GtidSet is initialized and subsequently appended by received Gtid Events. This leads to recovery failure when restoring from checkpoint due to incomplete GTIDs being recorded.

Test premise: MySQL needs to have GTID enabled, and due to settings such as master-slave switching and multiple masters, there are multiple sourceIDs for GTIDs

startupMode	How to initialize GTID	Restore from a checkpoint
initial	Obtain the GTID value through SHOW MASTER STATUS before starting the snapshot	√
latest	Obtain the GTID value through SHOW MASTER STATUS before starting to consume	√
offset-gtid	Use the user-set GTID before starting to consume.Note: The user needs to set the complete GTID, otherwise it will not pass the startup verification	√
offset-binlogfile&pos	Set the consumption anchor of the binlog through binlog_file='xxx', offset=yyy before starting to consume, but the GTID is null after initialization	X
earliest	Set the consumption anchor of the binlog through binlog_file='', offset=0 before starting to consume, but the GTID is null after initialization	X
timestamp	Set the consumption anchor of the binlog through binlog_file='', offset=0 before starting to consume, but the GTID is null after initialization. After starting the consumption, all binlog events with timestamps less than the specified timestamp will be skipped until an event with a timestamp greater than or equal to timestamp is found, and the GTID will be recorded at that point	X

Solution approach

PoC From @PatrickRen: PatrickRen@2189da3

The solution comes from the POC of @PatrickRen. I have carefully reviewed this POC and conducted many tests. In the end, I believe this solution is viable.

The POC implementation is very elegant, and I only supplemented some comments and added a few test cases. The modified code has been validated using the company's production environment, and it has proven to perform very well under the classic master-slave architecture, correctly recovering from various start-mode checkpoints.

However, there are still two potential risks:

When MySQL is under a dual-master architecture, GTIDs may have gaps, similar to A:1-102, 105-150. Such gaps are temporary but will eventually be consistent. But if the CDC happens to recover from the checkpoint when there are gaps in MySQL's GTID, it may access non-existent transactions, leading to recovery failure. This is because our GtidUtils' fixRestoredGtidSet method does not fix the gaps in the server GTID. I initially wanted to optimize this issue, but after weighing it, I believe that in most cases, the occurrence of GTID gaps is a problem with MySQL itself, which should be fixed by the DBA in MySQL, not to be compatible with CDC, to avoid triggering other unpredictable issues.
The current implementation will correct all GTIDs, including those set by the user during the initial startup of CDC via StartupOptions. This may cause a problem: the user manually sets the GTID-offset to A:300-500, the original intention might be to expect CDC to consume data A:1-299:501~xxx, but CDC corrects it to A:1-500, only starting consumption from 500. Solving this problem is not difficult, the key is whether it is necessary to solve it. I finally convinced myself to form a unified specification and inform the users: CDC only cares about the maximum GTID position and starts from it. This standard will make our program more user-friendly, but to some extent, it will violate MySQL's replication protocol.

It's worth mentioning that I'm just raising the above two risks for discussion, and I don't suggest immediately solving and fixing them, as the current implementation is already very well done.

PatrickRen

@wallkop Thanks for the patch! LGTM.

Could you rebase the latest master to resolve the conflict, and squash all commits into one? Also it looks like the author and email of your commits don't match your profile on GitHub, and you may want to modify them before pushing.

wallkop · 2023-06-19T08:31:27Z

hi @PatrickRen It looks like the master branch of debezium was upgraded to 1.9.7 a few hours ago. I may need to do some retesting after resolving code conflicts to make sure this change works fine with debezium 1.9.7.

I will solve the all problem you describe later.

…p mode Author: wallkop <[email protected]> Date: Sat Jun 17 22:36:54 2023 +0800

PatrickRen

Thanks for the patch! LGTM.

…-2062 * origin/feat/issue-2062: [docs] Update connector link to 2.5-SNAPSHOT in docs [build] Bump version to 2.5-SNAPSHOT [hotfix][mysql] remove unused code (apache#2231) [hotfix] Add vitess connector to the release profile (apache#2233) [docs][hotfix] Update debezium reference links to 1.9 version [build] Update the copyright year to 2023 (apache#2205) [postgres] Fix postgres e2e test [postgres] scan.incremental.snapshot.enabled is closed by default [postgres] Backfill task will be able to end when there is not new change data but read the ending lsn [postgres] Create slot for backfill task before snapshot reading [postgres] Prepare a slot for the unique global stream split [mysql] Fix GTID issues to recover from checkpoint normally in specifying startup mode (apache#2220)

SML0127 · 2023-08-05T07:55:34Z

@wallkop
Thx to fix this issue! (I had a similar issue)

I have one question in your description.
what is complete GTID?
as i understand, it means below one of them
~~1. GTIDs like instance:111-2222 (not start from 1)~~
~~2. { kind: "SPECIFIC", binlogfile: "binlog.000123", pos: "3834747", gtids: "instance:123-45667"}, { kind: "NON_STOPPING", binlogfile: "", pos: "-982387918273", gtids: ""} (in checkpoint)~~

…ying startup mode (apache#2220)

…x method

xyw0537 · 2025-06-19T23:41:57Z

@wallkop @PatrickRen I believe the fix logic for the "old uuid" when gtid.new.channel.position=EARLIEST configuration should equally apply to LATEST. Let’s examine this example.

1.Obtain the available GTIDs, i.e., show master status.

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-204479617,
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

2.Obtain the checkpoint GTIDs.

106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173

3.Obtain the purged GTIDs, i.e., @@global.gtid_purged.

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-203495053,
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

When gtid.new.channel.position=EARLIEST, mergedGtidSet whould be

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-204182173
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

When gtid.new.channel.position=LATEST, mergedGtidSet whould be

106a4bb6-ec0d-11ec-a2d4-00163e279211:203495054-204182173
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

The gtid.new.channel.position configuration should only affect new channels. In the example above, only 7aec1281-719c-11eb-afcf-00163e06a35c is the new channel. Therefore, the results for UUID 106a4bb6-ec0d-11ec-a2d4-00163e279211 should remain unchanged. I believe when LATEST is configured, the mergedGtidSet should be

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-204182173
7aec1281-719c-11eb-afcf-00163e06a35c:1-147359662

Additionally, with the current logic, if purged GTIDs is empty, configuring LATEST would return

106a4bb6-ec0d-11ec-a2d4-00163e279211:1-203495053
106a4bb6-ec0d-11ec-a2d4-00163e279211:204182174-204479617

This would disrupt the expected GTID consumption order - causing transactions 203495054-204182173 to be processed before 1-203495053, which could lead to errors.

This was referenced Jun 17, 2023

[Bug] Flink CDC 2.3.0 set startupOptions = specificOffset set specificOffsetFile and specificOffsetPos then can not start from checkpoint #1944

Closed

[mysql] Fix issue #1944: Initialize complete GTIDs to ensure subsequent recovery from checkpoint #2063

Closed

PatrickRen self-assigned this Jun 19, 2023

PatrickRen reviewed Jun 19, 2023

View reviewed changes

wallkop force-pushed the fix_1944 branch 6 times, most recently from 0445b8d to b4ce0dc Compare June 19, 2023 12:10

Fix issue#1944, recover from checkpoint normally in specifying startu…

8e46aaa

…p mode Author: wallkop <[email protected]> Date: Sat Jun 17 22:36:54 2023 +0800

wallkop force-pushed the fix_1944 branch from 2fcdbd1 to 8e46aaa Compare June 19, 2023 12:31

Merge branch 'master' into fix_1944

718d844

PatrickRen approved these changes Jun 20, 2023

View reviewed changes

PatrickRen merged commit debd6ef into apache:master Jun 20, 2023

wallkop mentioned this pull request Jun 21, 2023

[hotfix] remove unused code in PR#2220 #2231

Merged

This was referenced Jun 22, 2023

[Bug][MySql] Job restart failed from savepoint When set 'scan.startup.mode' = 'timestamp' #2230

Closed

[INLONG-8307][Sort] Fix job restart failed from savepoint When set 'scan.startup.mode' = 'timestamp|earliest-offset|specific-offset' apache/inlong#8308

Merged

ChaomingZhangCN pushed a commit to ChaomingZhangCN/flink-cdc that referenced this pull request Jan 13, 2025

[mysql] Fix GTID issues to recover from checkpoint normally in specif…

d0c128c

…ying startup mode (apache#2220)

jw-itq mentioned this pull request Jan 16, 2025

[Fix][mysql-cdc] Fix GTIDs on startup to correctly recover from checkpoint apache/seatunnel#8528

Merged

4 tasks

jw-itq added a commit to jw-itq/seatunnel that referenced this pull request Jan 19, 2025

refer to flink pr（apache/flink-cdc#3065 and apache/flink-cdc#2220 ）fi…

d4f04b2

…x method

lvyanquan mentioned this pull request Aug 7, 2025

[FLINK-37065]: MySQL cdc can lose/skip data during recovering from the checkpoint #3845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

Uh oh!

wallkop commented Jun 17, 2023

Uh oh!

PatrickRen left a comment

Uh oh!

wallkop commented Jun 19, 2023

Uh oh!

PatrickRen left a comment

Uh oh!

SML0127 commented Aug 5, 2023 •

edited

Loading

Uh oh!

xyw0537 commented Jun 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

[mysql] Fix issue #1944: Fix GTIDs on startup to correctly recover from checkpoint #2220

Uh oh!

Conversation

wallkop commented Jun 17, 2023

Uh oh!

PatrickRen left a comment

Choose a reason for hiding this comment

Uh oh!

wallkop commented Jun 19, 2023

Uh oh!

PatrickRen left a comment

Choose a reason for hiding this comment

Uh oh!

SML0127 commented Aug 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xyw0537 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SML0127 commented Aug 5, 2023 •

edited

Loading

xyw0537 commented Jun 19, 2025 •

edited

Loading