Skip to content

Conversation

@Jitterx69
Copy link
Contributor

@Jitterx69 Jitterx69 commented Dec 8, 2025

Type of Change

  • New Feature
  • Bug Fix
  • Documentation
  • Performance Improvement
  • Test/CI
  • Refactor
  • Other:

Related Issues

#1001

Summary of Changes

This PR implements a comprehensive solution for persistent cluster recovery failures after abrupt node power-offs. The system previously hung indefinitely when a node experienced hard failure (power loss), causing:

  • Login failures for rustfsadmin user
  • Console UI showing 0 storage/0 objects/0 servers
  • Upload operations stopping completely

Root Causes Addressed:

  1. Stale Connection Cache: GLOBAL_Conn_Map retained dead gRPC channels indefinitely
  2. Blocking IAM Sync: Authentication operations waited for all peers to acknowledge
  3. No Per-Peer Timeouts: Console aggregation hung on unresponsive nodes
  4. Passive Failure Detection: Relied solely on TCP timeouts (15+ minutes)

Solution Implemented:

  1. Dead Connection Eviction - Automatic cache cleanup on RPC failures
  2. Graceful Console Degradation - 2s timeout per peer with partial data fallback
  3. Non-Blocking IAM Sync - Fire-and-forget peer notifications via tokio::spawn
  4. Enhanced gRPC Config - Reduced timeouts (3s connect, 30s RPC) with aggressive keepalives

Performance Impact:

  • Detection time: ~8 seconds (down from 15+ minutes)
  • Login: Completes immediately even with dead peers
  • Console: Loads in <2s showing healthy nodes + offline status
  • Recovery: Automatic reconnection on next request

Files Modified:

  • crates/common/src/globals.rs - Connection eviction logic
  • crates/protos/src/lib.rs - Timeout configuration + eviction helper
  • crates/ecstore/src/rpc/peer_rest_client.rs - Auto-eviction on RPC errors
  • crates/ecstore/src/notification_sys.rs - Per-peer timeouts + degradation
  • crates/iam/src/sys.rs - Non-blocking notifications
  • crates/common/Cargo.toml + crates/protos/Cargo.toml - Added tracing dependency
  • docs/cluster_recovery.md - Comprehensive resolution report

Checklist

  • I have read and followed the CONTRIBUTING.md guidelines
  • Passed make pre-commit (cargo build + cargo test + cargo fmt)
  • Added/updated necessary tests (350 tests passing in rustfs-ecstore, rustfs-iam)
  • Documentation updated (if needed) - Added [docs/cluster_recovery.md]
  • CI/CD passed (if applicable) - cargo build --release + cargo test --release successful

Impact

  • Breaking change (compatibility)
  • Requires doc/config/deployment update
  • Other impact: Significantly improves cluster resilience and recovery time

Behavioral Changes:

  • Console now shows partial cluster data when nodes are offline (previously showed 0/0/0)
  • Login succeeds immediately regardless of peer node status (previously hung)
  • Failed RPCs trigger automatic connection cleanup (new behavior)

No Breaking Changes:

  • All existing APIs remain unchanged
  • Backward compatible with existing deployments
  • No configuration changes required

Additional Notes

Testing Performed:

  • Unit tests: 299 tests (rustfs-ecstore) + 51 tests (rustfs-iam) = 350 passing
  • Build verification: cargo build --release successful
  • Code formatting: cargo fmt --all applied
  • No debug artifacts (println!, dbg!, unwanted todo!)

Verification Steps for Reviewers:

  1. Console Recovery Test: Kill a node → Verify dashboard loads with offline status in <2s
  2. Login Recovery Test: Kill a node → Verify rustfsadmin login succeeds immediately
  3. Connection Eviction Test: Check logs for "Evicted stale connection" warnings after node failure
  4. Graceful Degradation Test: Verify console shows N-1 healthy nodes when 1 node is down

Technical Details:

  • Connection timeout: 5s → 3s
  • RPC timeout: 60s → 30s
  • HTTP/2 keepalive: 5s interval, 3s ACK timeout
  • TCP keepalive: 10s
  • Per-peer aggregation timeout: 2s

Related Documentation:


Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

Jitterx69 and others added 10 commits December 6, 2025 16:08
Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.
Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.
Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.
Resolves rustfs#1001

This PR implements a comprehensive solution for cluster recovery failures
after abrupt node power-offs. The fix addresses four critical issues:

1. **Dead Connection Eviction**
   - Added evict_connection() in globals.rs for explicit cache management
   - Implemented evict_failed_connection() helper in protos/lib.rs
   - Connections are immediately removed on RPC failures

2. **Graceful Console Degradation**
   - Added 2-second per-peer timeout in notification_sys.rs
   - server_info() and storage_info() return partial data from healthy nodes
   - Console loads successfully even with dead nodes (shows offline status)

3. **Non-Blocking IAM Sync**
   - Wrapped peer notifications in tokio::spawn (fire-and-forget)
   - Login operations complete immediately regardless of peer status
   - Prevents authentication hangs when nodes are down

4. **PeerRestClient Auto-Eviction**
   - Updated critical IAM methods (load_user, load_group, etc.)
   - Automatic connection eviction on any RPC error
   - Ensures fresh connection attempts after failures

**Technical Details:**
- Reduced connect timeout: 5s → 3s
- Reduced RPC timeout: 60s → 30s
- Maintained aggressive keepalives (5s HTTP/2, 10s TCP)
- Detection time: ~8 seconds (down from 15+ minutes)

**Testing:**
- All 350 tests passing (rustfs-ecstore, rustfs-iam)
- cargo build --release: successful
- cargo fmt --all: applied

**Files Changed:**
- crates/common/src/globals.rs (eviction logic)
- crates/common/Cargo.toml (added tracing)
- crates/protos/src/lib.rs (timeouts + eviction helper)
- crates/protos/Cargo.toml (added tracing)
- crates/ecstore/src/rpc/peer_rest_client.rs (auto-eviction)
- crates/ecstore/src/notification_sys.rs (timeouts + degradation)
- crates/iam/src/sys.rs (non-blocking notifications)
- docs/cluster_recovery.md (comprehensive resolution report)

**Impact:**
- Login: Works immediately even with dead peers
- Console: Loads in <2s with partial data (no more 0/0/0)
- Uploads: Fail-fast with retry capability
- Recovery: 8 seconds vs 15+ minutes

Co-authored-by: Jitterx69 <[email protected]>
@loverustfs
Copy link
Contributor

Hi @Jitterx69 ,
A similar PR was also submitted.

#1044

@Jitterx69
Copy link
Contributor Author

Hi @Jitterx69 , A similar PR was also submitted.

#1044

Merge the best

@loverustfs
Copy link
Contributor

Hey @Jitterx69 ,

We compared the code in versions 1044 and 1054, and felt that version 1054's approach was more accurate.
We will continue to fix the bug in version 1054 that caused the RustFS cluster to crash due to a forced power outage.

@loverustfs loverustfs merged commit 76d25d9 into rustfs:main Dec 8, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants