-
Notifications
You must be signed in to change notification settings - Fork 759
fix: detect dead nodes via HTTP/2 keepalives (Issue #1001) #1025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: detect dead nodes via HTTP/2 keepalives (Issue #1001) #1025
Conversation
Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.
ed586d9 to
3eca2e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR resolves issue #1001 where the cluster becomes unresponsive after a node experiences an abrupt power-off by implementing HTTP/2 keepalive mechanisms for faster dead node detection.
Key Changes:
- Configured gRPC client with HTTP/2 PING frames (5s interval, 3s timeout) for active health checking
- Reduced dead node detection time from ~15+ minutes to ~8 seconds
- Added comprehensive documentation explaining the root cause, solution, and configuration
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| docs/cluster_recovery.md | New documentation explaining the power-off recovery issue, technical approach, and implementation details |
| crates/protos/src/lib.rs | Enhanced gRPC endpoint configuration with HTTP/2 keepalive, TCP keepalive, and timeout settings |
| Cargo.lock | Dependency lock file update (unrelated transitive dependency changes) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - The system "hung" indefinitely, unlike the immediate recovery observed during a graceful process termination (`kill`). | ||
|
|
||
| **Root Cause**: | ||
| The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections implementation in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive. |
Copilot
AI
Dec 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "implementation" should be "in an". The phrase should read "kept connections in an ESTABLISHED state" rather than "kept connections implementation in an ESTABLISHED state".
| The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections implementation in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive. | |
| The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive. |
|
Uploading and downloading don’t use tonic rpc. Instead, they use HttpReader from crates/rio. Maybe we can work in this direction. |
Indeed, talk more about this. So that I can help fixing that. |
Resolution for Issue #1001
This PR resolves the issue where the cluster becomes unresponsive after a node experiences an abrupt power-off.
The Fix
Configures the internal gRPC client (
crates/protos/src/lib.rs) to use active Application-Layer Heartbeats:Verification
cargo check,cargo test, andcargo clippypassed.