Skip to content

Conversation

@Jitterx69
Copy link
Contributor

Resolution for Issue #1001

This PR resolves the issue where the cluster becomes unresponsive after a node experiences an abrupt power-off.

The Fix

Configures the internal gRPC client (crates/protos/src/lib.rs) to use active Application-Layer Heartbeats:

  • HTTP/2 PING Interval: 5 seconds
  • Timeout: 3 seconds
  • TCP Keepalive: 10 seconds (Backup)

Verification

  • Surviving nodes now detect dead peers in ~8 seconds.
  • cargo check, cargo test, and cargo clippy passed.

@CLAassistant
Copy link

CLAassistant commented Dec 6, 2025

CLA assistant check
All committers have signed the CLA.

Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.
@Jitterx69 Jitterx69 force-pushed the fix/issue-1001-dead-node-detection branch from ed586d9 to 3eca2e3 Compare December 6, 2025 10:38
@0xdx2 0xdx2 requested a review from Copilot December 6, 2025 11:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR resolves issue #1001 where the cluster becomes unresponsive after a node experiences an abrupt power-off by implementing HTTP/2 keepalive mechanisms for faster dead node detection.

Key Changes:

  • Configured gRPC client with HTTP/2 PING frames (5s interval, 3s timeout) for active health checking
  • Reduced dead node detection time from ~15+ minutes to ~8 seconds
  • Added comprehensive documentation explaining the root cause, solution, and configuration

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File Description
docs/cluster_recovery.md New documentation explaining the power-off recovery issue, technical approach, and implementation details
crates/protos/src/lib.rs Enhanced gRPC endpoint configuration with HTTP/2 keepalive, TCP keepalive, and timeout settings
Cargo.lock Dependency lock file update (unrelated transitive dependency changes)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- The system "hung" indefinitely, unlike the immediate recovery observed during a graceful process termination (`kill`).

**Root Cause**:
The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections implementation in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.
Copy link

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "implementation" should be "in an". The phrase should read "kept connections in an ESTABLISHED state" rather than "kept connections implementation in an ESTABLISHED state".

Suggested change
The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections implementation in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.
The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.

Copilot uses AI. Check for mistakes.
@loverustfs loverustfs merged commit b10d80c into rustfs:main Dec 6, 2025
14 checks passed
@weisd
Copy link
Contributor

weisd commented Dec 6, 2025

Uploading and downloading don’t use tonic rpc. Instead, they use HttpReader from crates/rio. Maybe we can work in this direction.

@Jitterx69
Copy link
Contributor Author

Uploading and downloading don’t use tonic rpc. Instead, they use HttpReader from crates/rio. Maybe we can work in this direction.

Indeed, talk more about this. So that I can help fixing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants