You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is expansion of the report made here: #4308 - I created a separate card because I don't have edit rights to the original one, and I find that this info correlates to not only the shutdown of the Ruby process but also to any close attempt on the consumer during the sticky cooperative rebalance.
If you consider it part of #4308 please merge them 🙏
How to reproduce
Initialize a few consumers in the same process (multiple threads) or many processes.
Let them rebalance and during a cooperative-sticky rebalance, try to close them
Profit (crash)
Here is the simplified code I used to reproduce this. It reproduces in 100% of cases:
As a temporary mitigation, it is possible to call rd_kafka_unsubscribe and wait for the time of the rebalance. This works well, however, is not reliable and drastically increases the time needed to shut down the consumer as there is no direct way (aside from maybe using metrics via poll) to establish that the consumer is not under a rebalance "exactly at the moment of running rd_kafka_destroy. The wait is needed despite rd_kafka_assignment returning no assignments as it seems that post revocation but prior to re-assignment the TPL is empty. This gives us a "fake" info that there is (and will not be) any TPLs assigned.
Locking the rebalance callback via mutex seems to solve the problem; however, my integration suite under load still showed that it crashes (though way less).
Waiting after rd_kafka_consumer_close also partially mitigates this due to the fact, that rd_kafka_consumer_close will unsubscribe automatically. This may mitigate this on a long living consumers (ref: Segfaults during consumer shutdown karafka/rdkafka-ruby#254), however this does not solve the problem for short-lived consumers (don't know why) that are in the middle of getting the first assignment.
This issue does not exist with the standard assignment strategy (only cooperative sticky).
When using a custom rebalance callback with mutex this issue is also partially mitigated, especially when having a global state on a process that would indicate that we're in a rebalance process, however there is a case, where the rebalance callback is not yet triggered but the rdkafka_sticky_assignor is already created. If we attempt to close and destroy consumer then, crash happens as well. This is less likely because there is a short time in between the initialization of the rdkafka_sticky_assignor and it handing to rebalance callback however issue persists.
It can affects more complex consumer frameworks like karafka less because they do a lot of stuff around the consumer before running the close, hence the probability of being in the rebalance state is lower (thought it can happen).
I did not find similar assertion checks in rdkafka_assignor.c nor in rdkafka_roundrobin_assignor.
Not using rd_kafka_destroy on a process that is anyhow going to be closed (long running processes under shutdown) can also partially mitigate this (9/10 times).
Really long initial poll on the first rebalance for a consumer instance (more than 2 seconds) seems to also mitigate this at least partially. This may be because of the fact that this poll takes longer than the whole rebalance, but hard for me to give 100% result here.
Suggested fix
The consumer should probably wait for rebalance to finish before fully closing itself, however this may introduce a potential closing lag on a long running rebalance. The second thing would be to drop out of CG and just let the rebalance go, but I have no idea what will be the effect of this on the consumer group.
Logs
Here is the debugall info tail (if you need more just ping me, I can generate it on the spot):
D, [2023-06-09T09:35:10.457239 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457246 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: JoinGroup response clean-up
D, [2023-06-09T09:35:10.457254 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-join -> wait-metadata (state up)
D, [2023-06-09T09:35:10.457262 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: Request metadata for 1 topic(s): partition assignor
D, [2023-06-09T09:35:10.457271 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent MetadataRequest (v9, 41 bytes @ 0, CorrId 5)
D, [2023-06-09T09:35:10.457278 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received MetadataResponse (v9, 341 bytes, CorrId 5, rtt 0.51ms)
D, [2023-06-09T09:35:10.457286 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ===== Received metadata (for 1 requested topics): partition assignor =====
D, [2023-06-09T09:35:10.457294 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: ClusterId: o9n0mScnRY-4mvOHIvPnZQ, ControllerId: 1001
D, [2023-06-09T09:35:10.457302 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1 brokers, 1 topics
D, [2023-06-09T09:35:10.457309 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: Topic 6bad105f430f with 10 partitions
D, [2023-06-09T09:35:10.457317 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: Broker #0/1: localhost:9092 NodeId 1001
D, [2023-06-09T09:35:10.457325 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: 1/1 requested topic(s) seen in metadata
D, [2023-06-09T09:35:10.457332 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" running cooperative-sticky assignor for 1 member(s) and 1 eligible subscribed topic(s):
D, [2023-06-09T09:35:10.457340 #2162967] DEBUG -- : rdkafka: [thrd:main]: Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) with 0 owned partition(s) and 1 subscribed topic(s):
D, [2023-06-09T09:35:10.457349 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [-1]
D, [2023-06-09T09:35:10.457356 #2162967] DEBUG -- : rdkafka: [thrd:main]: Sort 10 partitions in existing assignment
D, [2023-06-09T09:35:10.457364 #2162967] DEBUG -- : rdkafka: [thrd:main]: Prepared balanced reassignment for 1 consumers, 0 available partition(s) where of 0 are unassigned (initializing=true, revocationRequired=false, 1 fixed assignments)
D, [2023-06-09T09:35:10.457372 #2162967] DEBUG -- : rdkafka: [thrd:main]: Reassignment not performed after 1 iteration(s) of 0 reassignable partition(s)
D, [2023-06-09T09:35:10.457380 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" cooperative-sticky assignment for 1 member(s) finished in 0.037ms:
D, [2023-06-09T09:35:10.457387 #2162967] DEBUG -- : rdkafka: [thrd:main]: Member "rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69" (me) assigned 10 partition(s):
D, [2023-06-09T09:35:10.457395 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [0]
D, [2023-06-09T09:35:10.457403 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [1]
D, [2023-06-09T09:35:10.457410 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [2]
D, [2023-06-09T09:35:10.457418 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [3]
D, [2023-06-09T09:35:10.457425 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [4]
D, [2023-06-09T09:35:10.457432 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [5]
D, [2023-06-09T09:35:10.457440 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [6]
D, [2023-06-09T09:35:10.457460 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [7]
D, [2023-06-09T09:35:10.457470 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [8]
D, [2023-06-09T09:35:10.457477 #2162967] DEBUG -- : rdkafka: [thrd:main]: 6bad105f430f [9]
D, [2023-06-09T09:35:10.457485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": "cooperative-sticky" assignor run for 1 member(s)
D, [2023-06-09T09:35:10.457493 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Partitions owned by members: 0, partitions assigned by assignor: 10
D, [2023-06-09T09:35:10.457501 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": COOPERATIVE protocol collection sizes: maybe revoking: 0, ready to migrate: 10, unknown but owned: 0
D, [2023-06-09T09:35:10.457508 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": 10 partitions assigned to consumers
D, [2023-06-09T09:35:10.457518 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-metadata -> wait-sync (state up)
D, [2023-06-09T09:35:10.457526 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Sent SyncGroupRequest (v3, 221 bytes @ 0, CorrId 6)
D, [2023-06-09T09:35:10.457533 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: Broadcasting state change
D, [2023-06-09T09:35:10.457541 #2162967] DEBUG -- : rdkafka: [thrd:GroupCoordinator]: GroupCoordinator/1001: Received SyncGroupResponse (v3, 10 bytes, CorrId 6, rtt 0.21ms)
D, [2023-06-09T09:35:10.457549 #2162967] DEBUG -- : rdkafka: [thrd:main]: GroupCoordinator/1001: SyncGroupRequest failed: Broker: Group rebalance in progress: actions Permanent
D, [2023-06-09T09:35:10.457556 #2162967] DEBUG -- : rdkafka: [thrd:main]: SyncGroup response: Broker: Group rebalance in progress (0 bytes of MemberState data)
D, [2023-06-09T09:35:10.457564 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": synchronization failed: Broker: Group rebalance in progress: rejoining
D, [2023-06-09T09:35:10.457572 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": Rejoining group without an assignment: SyncGroup error: Broker: Group rebalance in progress
D, [2023-06-09T09:35:10.457580 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" changed join state wait-sync -> init (state up)
I, [2023-06-09T09:35:10.458469 #2162967] INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.185397148132324 ms
D, [2023-06-09T09:35:10.458501 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>8, :topic=>"6bad105f430f", :payload=>"9"}
I, [2023-06-09T09:35:10.468784 #2162967] INFO -- : [691ad1f2b56f] Sync producing of a message to '6bad105f430f' topic took 10.214381694793701 ms
D, [2023-06-09T09:35:10.468830 #2162967] DEBUG -- : [691ad1f2b56f] {:partition=>9, :topic=>"6bad105f430f", :payload=>"10"}
D, [2023-06-09T09:35:10.941137 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941271 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941342 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state init)
D, [2023-06-09T09:35:10.941424 #2162967] DEBUG -- : rdkafka: [thrd:main]: Terminating group "node.rb-e4ef35230e0d_app" in state up with 0 partition(s)
D, [2023-06-09T09:35:10.941485 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": unsubscribe from current subscription of size 1 (leave group=true, has joined=true, rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69, join-state init)
D, [2023-06-09T09:35:10.941568 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": clearing subscribed topics list (1)
D, [2023-06-09T09:35:10.941634 #2162967] DEBUG -- : rdkafka: [thrd:app]: Closing consumer
D, [2023-06-09T09:35:10.941700 #2162967] DEBUG -- : rdkafka: [thrd:app]: Waiting for close events
D, [2023-06-09T09:35:10.941753 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": effective subscription list changed from 1 to 0 topic(s):
D, [2023-06-09T09:35:10.941841 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app": resetting group leader info: unsubscribe
D, [2023-06-09T09:35:10.941904 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" initiating rebalance (COOPERATIVE) in state up (join-state init) with 0 assigned partition(s): unsubscribe
D, [2023-06-09T09:35:10.941950 #2162967] DEBUG -- : rdkafka: [thrd:main]: Group "node.rb-e4ef35230e0d_app" received op TERMINATE in state up (join-state wait-join)
ruby: rdkafka_sticky_assignor.c:1898: rd_kafka_sticky_assignor_state_destroy: Assertion `assignor_state' failed.
and Kafka log matching this time:
karafka_21_kafka | [2023-06-09 07:35:07,160] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 1 (__consumer_offsets-31) (reason: Adding new member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-390196ef-cf8f-4d1a-a029-b64793413735] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 2 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,856] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e023de05-5c28-4577-af61-bc3ac4217652] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 2 (__consumer_offsets-31) (reason: removing member rdkafka-e023de05-5c28-4577-af61-bc3ac4217652 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:07,857] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 3 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 3 (__consumer_offsets-31) (reason: Adding new member rdkafka-275f4343-0436-4314-899a-14bd98f1c20b with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,375] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 4 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,377] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 4. The group has 1 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,378] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 4 (__consumer_offsets-31) (reason: Adding new member rdkafka-83086da6-e6b6-4744-9330-635edd095984 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,887] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-275f4343-0436-4314-899a-14bd98f1c20b] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 5 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,888] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-83086da6-e6b6-4744-9330-635edd095984] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 5 (__consumer_offsets-31) (reason: removing member rdkafka-83086da6-e6b6-4744-9330-635edd095984 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:08,889] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 6 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,399] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 6 (__consumer_offsets-31) (reason: Adding new member rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,400] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 7 (__consumer_offsets-31) with 2 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,404] INFO [GroupCoordinator 1001]: Assignment received from leader for group node.rb-e4ef35230e0d_app for generation 7. The group has 2 members, 0 of which are static. (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,935] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 7 (__consumer_offsets-31) (reason: removing member rdkafka-f177c5aa-5f0e-41b0-8044-276cbd5554bf on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-e237d590-7f41-49f2-9796-35cf02d0e6f1] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:09,936] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 8 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,445] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 8 (__consumer_offsets-31) (reason: Adding new member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 9 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:10,446] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 9 (__consumer_offsets-31) (reason: Adding new member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 with group instance id None) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Member rdkafka-d7596725-b61d-4386-a4e3-6b35e6994d75 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:12,910] INFO [GroupCoordinator 1001]: Stabilized group node.rb-c0211bcb57d6_app generation 4 (__consumer_offsets-25) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,014] INFO [GroupCoordinator 1001]: Member rdkafka-d518effd-ace9-4f38-a3b2-7736948e2109 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:24,015] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e293c87d05db_app generation 4 (__consumer_offsets-27) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,221] INFO [GroupCoordinator 1001]: Member rdkafka-bbfb59a4-11b8-4f61-a08e-fe34f54e25aa in group node.rb-1969f87a8fdf_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Stabilized group node.rb-1969f87a8fdf_app generation 2 (__consumer_offsets-48) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc] in group node.rb-1969f87a8fdf_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-1969f87a8fdf_app in state PreparingRebalance with old generation 2 (__consumer_offsets-48) (reason: removing member rdkafka-2ecd8630-a28a-438b-ae4c-d1e094cf92bc on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:34,222] INFO [GroupCoordinator 1001]: Group node.rb-1969f87a8fdf_app with generation 3 is now empty (__consumer_offsets-48) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Member rdkafka-e07f7749-a606-4b0b-9cb5-1904de921d69 in group node.rb-e4ef35230e0d_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,446] INFO [GroupCoordinator 1001]: Stabilized group node.rb-e4ef35230e0d_app generation 10 (__consumer_offsets-31) with 1 members (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,447] INFO [GroupCoordinator 1001]: Member[group.instance.id None, member.id rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33] in group node.rb-e4ef35230e0d_app has left, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e4ef35230e0d_app in state PreparingRebalance with old generation 10 (__consumer_offsets-31) (reason: removing member rdkafka-9d820a59-2cba-48f1-a7fe-0478d97d7e33 on LeaveGroup) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:55,448] INFO [GroupCoordinator 1001]: Group node.rb-e4ef35230e0d_app with generation 11 is now empty (__consumer_offsets-31) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 in group node.rb-c0211bcb57d6_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-c0211bcb57d6_app in state PreparingRebalance with old generation 4 (__consumer_offsets-25) (reason: removing member rdkafka-775f3cc0-a2ac-4bd5-a23e-c47162d31486 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:35:57,911] INFO [GroupCoordinator 1001]: Group node.rb-c0211bcb57d6_app with generation 5 is now empty (__consumer_offsets-25) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 in group node.rb-e293c87d05db_app has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Preparing to rebalance group node.rb-e293c87d05db_app in state PreparingRebalance with old generation 4 (__consumer_offsets-27) (reason: removing member rdkafka-83758cf9-a418-41cd-be06-adcdac81a094 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
karafka_21_kafka | [2023-06-09 07:36:09,016] INFO [GroupCoordinator 1001]: Group node.rb-e293c87d05db_app with generation 5 is now empty (__consumer_offsets-27) (kafka.coordinator.group.GroupCoordinator)
Hey,
This is expansion of the report made here: #4308 - I created a separate card because I don't have edit rights to the original one, and I find that this info correlates to not only the shutdown of the Ruby process but also to any close attempt on the consumer during the sticky cooperative rebalance.
If you consider it part of #4308 please merge them 🙏
How to reproduce
Here is the simplified code I used to reproduce this. It reproduces in 100% of cases:
the close code we use in Ruby follows the expected flow where we first close the consumer and then destroy it.
Every single execution ends up with this:
I confirmed this behavior exists in the following librdkafka versions:
2.1.12.0.21.9.2librdkafka version (release number or git tag):
2.1.1and all mentioned aboveApache Kafka version:
2.8.1and3.4.0from bitnamilibrdkafka client configuration: presented in the above code snippet
Operating system:
Linux 5.4.0-146-generic #163-Ubuntu SMP Fri Mar 17 18:26:02 UTC 2023 x86_64 x86_64 x86_64 GNU/LinuxProvide logs (with
debug=..as necessary) from librdkafkaProvide broker log excerpts
Critical issue - I would consider it critical, because the race-condition is not mentioned in the termination docs (https://github.com/confluentinc/librdkafka/blob/master/INTRODUCTION.md#termination) and on the contrary, they state that: "There is no need to unsubscribe"
Now let's dive deeper:
rd_kafka_unsubscribeand wait for the time of the rebalance. This works well, however, is not reliable and drastically increases the time needed to shut down the consumer as there is no direct way (aside from maybe using metrics via poll) to establish that the consumer is not under a rebalance "exactly at the moment of runningrd_kafka_destroy. The wait is needed despiterd_kafka_assignmentreturning no assignments as it seems that post revocation but prior to re-assignment the TPL is empty. This gives us a "fake" info that there is (and will not be) any TPLs assigned.rd_kafka_consumer_closealso partially mitigates this due to the fact, thatrd_kafka_consumer_closewill unsubscribe automatically. This may mitigate this on a long living consumers (ref: Segfaults during consumer shutdown karafka/rdkafka-ruby#254), however this does not solve the problem for short-lived consumers (don't know why) that are in the middle of getting the first assignment.rdkafka_sticky_assignoris already created. If we attempt to close and destroy consumer then, crash happens as well. This is less likely because there is a short time in between the initialization of therdkafka_sticky_assignorand it handing to rebalance callback however issue persists.close, hence the probability of being in the rebalance state is lower (thought it can happen).rdkafka_assignor.cnor inrdkafka_roundrobin_assignor.rd_kafka_destroyon a process that is anyhow going to be closed (long running processes under shutdown) can also partially mitigate this (9/10 times).Suggested fix
The consumer should probably wait for rebalance to finish before fully closing itself, however this may introduce a potential closing lag on a long running rebalance. The second thing would be to drop out of CG and just let the rebalance go, but I have no idea what will be the effect of this on the consumer group.
Logs
Here is the
debugallinfo tail (if you need more just ping me, I can generate it on the spot):and Kafka log matching this time: