Redpanda Consumer Mass Disconnect Reproduction#79
Redpanda Consumer Mass Disconnect Reproduction#79Lyndon-prequel merged 2 commits intoprequel-dev:mainfrom
Conversation
|
Hi @Lyndon-prequel, @tonymeehan I just wanted to clarify that the failure I submitted in my PR was not my original plan. Initially, I was trying to simulate a high-severity failure described in this GitHub issue: redpanda-data/redpanda#3643. If I had succeeded, it would have been the perfect candidate for a serious production-level failure. This failure involves a Redpanda broker running on an ARM64 AWS instance where, after several hours of normal operation, both the producer and broker experience a sudden surge of errors and memory usage increases (~1GB every 5 hours). The producer uses rust-rdkafka to send around 200 messages/second to 14 topics with idempotency and SASL authentication enabled. Despite trying to reproduce this in multiple languages (Rust, Python), I was not able to successfully simulate the issue, even after many attempts and long-running tests. Because of this, I had to switch to a simpler but still valid failure scenario, which delayed my PR submission. Thanks for understanding! |
|
@tonymeehan Can I have your eye on this 😄? |
|
This rule looks good to merge! Please rename the CRE id (folder and rule ID) to |
|
Done!, @tonymeehan |
|
thanks @Lyndon-prequel |
Add Detection Rule: Redpanda Consumer Mass Disconnect → Coordinator Failure
Overview
This PR adds a CRE detection rule for a critical high-severity Redpanda failure where mass consumer disconnections overwhelm the group coordinator, causing complete message processing halt.
Rule ID:
CRE-2025-0091Severity:
10/10 (Critical)Category:
distributed-messaging-connectivityFailure Scenario Reproduced
Issue: When 100+ consumers are forcibly disconnected simultaneously, Redpanda's consumer group coordinator becomes unresponsive.
Impact:
MemberIdRequiredError)NodeNotReadyError🔗 References
/solves #69
/claim #69
closes #69
Demo Video
Screen.Recording.2025-06-10.034013.1.1.mp4
Screen.Recording.2025-06-10.094058.mp4
Impact Score: 10/10 - Complete message processing halt
Mitigation Score: 7/10 - Requires restart + graceful consumer management