fix: detect dead nodes via HTTP/2 keepalives (Issue #1001) #1025

Jitterx69 · 2025-12-06T09:50:54Z

Resolution for Issue #1001

This PR resolves the issue where the cluster becomes unresponsive after a node experiences an abrupt power-off.

The Fix

Configures the internal gRPC client (crates/protos/src/lib.rs) to use active Application-Layer Heartbeats:

HTTP/2 PING Interval: 5 seconds
Timeout: 3 seconds
TCP Keepalive: 10 seconds (Backup)

Verification

Surviving nodes now detect dead peers in ~8 seconds.
cargo check, cargo test, and cargo clippy passed.

CLAassistant · 2025-12-06T09:51:01Z

All committers have signed the CLA.

Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.

Copilot

Pull request overview

This PR resolves issue #1001 where the cluster becomes unresponsive after a node experiences an abrupt power-off by implementing HTTP/2 keepalive mechanisms for faster dead node detection.

Key Changes:

Configured gRPC client with HTTP/2 PING frames (5s interval, 3s timeout) for active health checking
Reduced dead node detection time from ~15+ minutes to ~8 seconds
Added comprehensive documentation explaining the root cause, solution, and configuration

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.

File	Description
docs/cluster_recovery.md	New documentation explaining the power-off recovery issue, technical approach, and implementation details
crates/protos/src/lib.rs	Enhanced gRPC endpoint configuration with HTTP/2 keepalive, TCP keepalive, and timeout settings
Cargo.lock	Dependency lock file update (unrelated transitive dependency changes)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-06T11:09:15Z

docs/cluster_recovery.md

+-   The system "hung" indefinitely, unlike the immediate recovery observed during a graceful process termination (`kill`).
+
+**Root Cause**:
+The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections implementation in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.


Spelling error: "implementation" should be "in an". The phrase should read "kept connections in an ESTABLISHED state" rather than "kept connections implementation in an ESTABLISHED state".

Suggested change

The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections implementation in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.

The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.

weisd · 2025-12-06T16:11:35Z

Uploading and downloading don’t use tonic rpc. Instead, they use HttpReader from crates/rio. Maybe we can work in this direction.

Jitterx69 · 2025-12-06T16:18:22Z

Uploading and downloading don’t use tonic rpc. Instead, they use HttpReader from crates/rio. Maybe we can work in this direction.

Indeed, talk more about this. So that I can help fixing that.

fix(protos): enable app-layer keepalives (rustfs#1001)

3eca2e3

Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.

Jitterx69 force-pushed the fix/issue-1001-dead-node-detection branch from ed586d9 to 3eca2e3 Compare December 6, 2025 10:38

0xdx2 requested a review from Copilot December 6, 2025 11:05

Copilot started reviewing on behalf of 0xdx2 December 6, 2025 11:05 View session

Copilot AI reviewed Dec 6, 2025

View reviewed changes

Merge branch 'main' into fix/issue-1001-dead-node-detection

7d0a735

loverustfs merged commit b10d80c into rustfs:main Dec 6, 2025
14 checks passed

loverustfs mentioned this pull request Dec 7, 2025

Cluster Behaves Differently to kill vs. Abrupt Power-Off #1001

Open

This was referenced Dec 7, 2025

fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1033

Closed

fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035

Merged

Copilot AI mentioned this pull request Dec 8, 2025

Comprehensive cluster power-off resilience solution with multi-layer defense and EC quorum management #1068

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: detect dead nodes via HTTP/2 keepalives (Issue #1001) #1025

fix: detect dead nodes via HTTP/2 keepalives (Issue #1001) #1025

Jitterx69 commented Dec 6, 2025

Uh oh!

CLAassistant commented Dec 6, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 6, 2025

Uh oh!

Uh oh!

weisd commented Dec 6, 2025

Uh oh!

Jitterx69 commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections implementation in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.
	The standard TCP protocol does not immediately detect a silent peer disappearance (power loss) because no `FIN` or `RST` packets are sent. Without active application-layer heartbeats, the surviving nodes kept connections in an `ESTABLISHED` state, waiting indefinitely for responses that would never arrive.

fix: detect dead nodes via HTTP/2 keepalives (Issue #1001) #1025

fix: detect dead nodes via HTTP/2 keepalives (Issue #1001) #1025

Conversation

Jitterx69 commented Dec 6, 2025

Resolution for Issue #1001

The Fix

Verification

Uh oh!

CLAassistant commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

weisd commented Dec 6, 2025

Uh oh!

Jitterx69 commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Dec 6, 2025 •

edited

Loading