当zk出现session time out可能会导致instance出现多个消费者，从而出现乱序，数据丢失 by lwd-coding · Pull Request #5270 · alibaba/canal

lwd-coding · 2024-09-12T16:10:55Z

问题现象：

kafka已经配置了max.in.flight.requests.per.connection = 1但仍发现canal deploy往kafka发送消息的时候会存在乱序现象，发生乱序的时候存在如下关键日志信息：
instance日志：

zk日志：

问题分析：

每个instance都会通过监听各自在zk的临时数据节点runningData来实现HA机制，当出现极端的网络波动或者假死比如频繁full gc等，就会导致zk和客户端的会话无法正常续期，从而出现会话超时导致临时节点删除；当故障恢复此时客户端会触发没有改数据节点事件，改事件会触发ServerRunningMonitor.initRunning

initRunning如果抢占runningData成功则会调用CanalMQStarter.startDestination，startDestination的逻辑是先stop再start，由于stop只是设置一个标识位，只有下次轮询才会退出，这时候立马start一个线程就会出现多线程进行get，commit，rollback，存在乱序，丢失数据的可能

同个instance出现多线程生产者问题案例：

乱序场景：
send的时候出现并行发送但是ack的时候确是有序进行，此时不会报错，依然满足按最小batchId进行ack

线程1：get batchId 257
线程2：get batchId 258
线程1：send batchId 257
线程2：send batchId 258
出现并发发送，此时kafka改parttion的dequeue就已经乱序了
线程1：ack 257
线程2：ack 258

无法正常commit和rollback场景：
is not the firstly可能会导致无法ack进而触发事件buffer积压从而无法消费；无法rollback进而丢失数据

线程1：get batchId 257
线程2：get batchId 258
线程1：send batchId 257
线程2：send batchId 258
线程2：rollback 258，由于最小batchId是257此时会报batchId:258 is not the firstly:257，此时258这个batchId发生丢失，导致后续无法再进行正常commit，rollback
线程1：ack 257
线程2：get batchId 259
线程2：commit 259 ，由于最小batchId是258此时会报batchId:259 is not the firstly:258
线程1：get batchId 260
线程1：coomit 230 ，由于最小batchId是258此时会报batchId:260 is not the firstly:258
.。。。。

解决思路：

抢占成功以后如果mq生产者已经启动就没必要stop再start，跳过就行了。这里重启个人觉得没有啥特殊意义

agapple · 2024-09-13T09:02:23Z

tks

agapple · 2024-09-13T10:05:42Z

优化了下stop/start的并发控制，增加stop by latch.wait()的动作，确保正常退出

当zk出现session time out可能会导致instance出现多个消费者，从而出现乱序

1e50295

agapple merged commit be0d945 into alibaba:master Sep 13, 2024

agapple added a commit that referenced this pull request Sep 13, 2024

fixed #5270 , add CanalMQStarter stop latch

80ca436

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

当zk出现session time out可能会导致instance出现多个消费者，从而出现乱序，数据丢失#5270

当zk出现session time out可能会导致instance出现多个消费者，从而出现乱序，数据丢失#5270
agapple merged 1 commit into
alibaba:masterfrom
lwd-coding:master

lwd-coding commented Sep 12, 2024 •

edited

Loading

Uh oh!

agapple commented Sep 13, 2024

Uh oh!

agapple commented Sep 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lwd-coding commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

问题现象：

问题分析：

同个instance出现多线程生产者问题案例：

解决思路：

Uh oh!

agapple commented Sep 13, 2024

Uh oh!

agapple commented Sep 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lwd-coding commented Sep 12, 2024 •

edited

Loading