Fix/issue #1001 dead node detection #1054

Jitterx69 · 2025-12-08T03:32:19Z

Type of Change

Related Issues

Summary of Changes

This PR implements a comprehensive solution for persistent cluster recovery failures after abrupt node power-offs. The system previously hung indefinitely when a node experienced hard failure (power loss), causing:

Login failures for rustfsadmin user
Console UI showing 0 storage/0 objects/0 servers
Upload operations stopping completely

Root Causes Addressed:

Stale Connection Cache: GLOBAL_Conn_Map retained dead gRPC channels indefinitely
Blocking IAM Sync: Authentication operations waited for all peers to acknowledge
No Per-Peer Timeouts: Console aggregation hung on unresponsive nodes
Passive Failure Detection: Relied solely on TCP timeouts (15+ minutes)

Solution Implemented:

Dead Connection Eviction - Automatic cache cleanup on RPC failures
Graceful Console Degradation - 2s timeout per peer with partial data fallback
Non-Blocking IAM Sync - Fire-and-forget peer notifications via tokio::spawn
Enhanced gRPC Config - Reduced timeouts (3s connect, 30s RPC) with aggressive keepalives

Performance Impact:

Detection time: ~8 seconds (down from 15+ minutes)
Login: Completes immediately even with dead peers
Console: Loads in <2s showing healthy nodes + offline status
Recovery: Automatic reconnection on next request

Files Modified:

crates/common/src/globals.rs - Connection eviction logic
crates/protos/src/lib.rs - Timeout configuration + eviction helper
crates/ecstore/src/rpc/peer_rest_client.rs - Auto-eviction on RPC errors
crates/ecstore/src/notification_sys.rs - Per-peer timeouts + degradation
crates/iam/src/sys.rs - Non-blocking notifications
crates/common/Cargo.toml + crates/protos/Cargo.toml - Added tracing dependency
docs/cluster_recovery.md - Comprehensive resolution report

Checklist

I have read and followed the CONTRIBUTING.md guidelines
Passed make pre-commit (cargo build + cargo test + cargo fmt)
Added/updated necessary tests (350 tests passing in rustfs-ecstore, rustfs-iam)
Documentation updated (if needed) - Added [docs/cluster_recovery.md]
CI/CD passed (if applicable) - cargo build --release + cargo test --release successful

Impact

Breaking change (compatibility)
Requires doc/config/deployment update
Other impact: Significantly improves cluster resilience and recovery time

Behavioral Changes:

Console now shows partial cluster data when nodes are offline (previously showed 0/0/0)
Login succeeds immediately regardless of peer node status (previously hung)
Failed RPCs trigger automatic connection cleanup (new behavior)

No Breaking Changes:

All existing APIs remain unchanged
Backward compatible with existing deployments
No configuration changes required

Additional Notes

Testing Performed:

Unit tests: 299 tests (rustfs-ecstore) + 51 tests (rustfs-iam) = 350 passing
Build verification: cargo build --release successful
Code formatting: cargo fmt --all applied
No debug artifacts (println!, dbg!, unwanted todo!)

Verification Steps for Reviewers:

Console Recovery Test: Kill a node → Verify dashboard loads with offline status in <2s
Login Recovery Test: Kill a node → Verify rustfsadmin login succeeds immediately
Connection Eviction Test: Check logs for "Evicted stale connection" warnings after node failure
Graceful Degradation Test: Verify console shows N-1 healthy nodes when 1 node is down

Technical Details:

Connection timeout: 5s → 3s
RPC timeout: 60s → 30s
HTTP/2 keepalive: 5s interval, 3s ACK timeout
TCP keepalive: 10s
Per-peer aggregation timeout: 2s

Related Documentation:

Full technical analysis in [docs/cluster_recovery.md]
Original issue: Cluster Behaves Differently to kill vs. Abrupt Power-Off #1001
Previous attempt: PR fix(net): resolve 1GB upload hang and macos build (Issue #1001 regression) #1035 (partial fix with keepalives only)

Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.

Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.

Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.

Resolves rustfs#1001 This PR implements a comprehensive solution for cluster recovery failures after abrupt node power-offs. The fix addresses four critical issues: 1. **Dead Connection Eviction** - Added evict_connection() in globals.rs for explicit cache management - Implemented evict_failed_connection() helper in protos/lib.rs - Connections are immediately removed on RPC failures 2. **Graceful Console Degradation** - Added 2-second per-peer timeout in notification_sys.rs - server_info() and storage_info() return partial data from healthy nodes - Console loads successfully even with dead nodes (shows offline status) 3. **Non-Blocking IAM Sync** - Wrapped peer notifications in tokio::spawn (fire-and-forget) - Login operations complete immediately regardless of peer status - Prevents authentication hangs when nodes are down 4. **PeerRestClient Auto-Eviction** - Updated critical IAM methods (load_user, load_group, etc.) - Automatic connection eviction on any RPC error - Ensures fresh connection attempts after failures **Technical Details:** - Reduced connect timeout: 5s → 3s - Reduced RPC timeout: 60s → 30s - Maintained aggressive keepalives (5s HTTP/2, 10s TCP) - Detection time: ~8 seconds (down from 15+ minutes) **Testing:** - All 350 tests passing (rustfs-ecstore, rustfs-iam) - cargo build --release: successful - cargo fmt --all: applied **Files Changed:** - crates/common/src/globals.rs (eviction logic) - crates/common/Cargo.toml (added tracing) - crates/protos/src/lib.rs (timeouts + eviction helper) - crates/protos/Cargo.toml (added tracing) - crates/ecstore/src/rpc/peer_rest_client.rs (auto-eviction) - crates/ecstore/src/notification_sys.rs (timeouts + degradation) - crates/iam/src/sys.rs (non-blocking notifications) - docs/cluster_recovery.md (comprehensive resolution report) **Impact:** - Login: Works immediately even with dead peers - Console: Loads in <2s with partial data (no more 0/0/0) - Uploads: Fail-fast with retry capability - Recovery: 8 seconds vs 15+ minutes Co-authored-by: Jitterx69 <[email protected]>

loverustfs · 2025-12-08T03:34:32Z

Hi @Jitterx69 ,
A similar PR was also submitted.

#1044

Jitterx69 · 2025-12-08T03:37:36Z

Hi @Jitterx69 , A similar PR was also submitted.

#1044

Merge the best

loverustfs · 2025-12-08T04:20:25Z

Hey @Jitterx69 ,

We compared the code in versions 1044 and 1054, and felt that version 1054's approach was more accurate.
We will continue to fix the bug in version 1054 that caused the RustFS cluster to crash due to a forced power outage.

Jitterx69 and others added 10 commits December 6, 2025 16:08

fix(protos): enable app-layer keepalives (rustfs#1001)

3eca2e3

Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.

Merge branch 'main' into fix/issue-1001-dead-node-detection

7d0a735

fix(net): enable data-plane and server-side keepalives

8288edc

Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.

fix(prof): fix macos/aarch64 profiling build error

40c0b7f

Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.

docs: update recovery report with data plane and build fixes

435a4ce

Merge branch 'main' into fix/issue-1001-dead-node-detection

1661394

style: fix rustfmt errors in profiling and http

e673509

style: fix clippy redundancy error in profiling cfg

94222b1

Merge branch 'main' into fix/issue-1001-dead-node-detection

9b85fb3

loverustfs merged commit 76d25d9 into rustfs:main Dec 8, 2025
14 checks passed

loverustfs mentioned this pull request Dec 8, 2025

Cluster Behaves Differently to kill vs. Abrupt Power-Off #1001

Open

Copilot AI mentioned this pull request Dec 8, 2025

Comprehensive cluster power-off resilience solution with multi-layer defense and EC quorum management #1068

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/issue #1001 dead node detection #1054

Fix/issue #1001 dead node detection #1054

Uh oh!

Jitterx69 commented Dec 8, 2025 •

edited

Loading

Uh oh!

loverustfs commented Dec 8, 2025

Uh oh!

Jitterx69 commented Dec 8, 2025

Uh oh!

loverustfs commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix/issue #1001 dead node detection #1054

Fix/issue #1001 dead node detection #1054

Uh oh!

Conversation

Jitterx69 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Change

Related Issues

Summary of Changes

Checklist

Impact

Additional Notes

Uh oh!

loverustfs commented Dec 8, 2025

Uh oh!

Jitterx69 commented Dec 8, 2025

Uh oh!

loverustfs commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jitterx69 commented Dec 8, 2025 •

edited

Loading