-
Notifications
You must be signed in to change notification settings - Fork 759
Fix/issue #1001 dead node detection #1054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
loverustfs
merged 10 commits into
rustfs:main
from
Jitterx69:fix/issue-1001-dead-node-detection
Dec 8, 2025
Merged
Fix/issue #1001 dead node detection #1054
loverustfs
merged 10 commits into
rustfs:main
from
Jitterx69:fix/issue-1001-dead-node-detection
Dec 8, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Enables HTTP/2 keepalives and TCP keepalives in gRPC client to detect dead nodes (e.g., power loss) in ~8 seconds, preventing cluster hangs.
Enables reqwest keepalives in Rio (data stream) and TCP keepalives in Server (accepted sockets) to prevent hangs during large transfers if a peer dies.
Guard profiling logic with cfg(target_os=linux) and provide no-ops for other platforms to prevent compilation failures on macOS.
Resolves rustfs#1001 This PR implements a comprehensive solution for cluster recovery failures after abrupt node power-offs. The fix addresses four critical issues: 1. **Dead Connection Eviction** - Added evict_connection() in globals.rs for explicit cache management - Implemented evict_failed_connection() helper in protos/lib.rs - Connections are immediately removed on RPC failures 2. **Graceful Console Degradation** - Added 2-second per-peer timeout in notification_sys.rs - server_info() and storage_info() return partial data from healthy nodes - Console loads successfully even with dead nodes (shows offline status) 3. **Non-Blocking IAM Sync** - Wrapped peer notifications in tokio::spawn (fire-and-forget) - Login operations complete immediately regardless of peer status - Prevents authentication hangs when nodes are down 4. **PeerRestClient Auto-Eviction** - Updated critical IAM methods (load_user, load_group, etc.) - Automatic connection eviction on any RPC error - Ensures fresh connection attempts after failures **Technical Details:** - Reduced connect timeout: 5s → 3s - Reduced RPC timeout: 60s → 30s - Maintained aggressive keepalives (5s HTTP/2, 10s TCP) - Detection time: ~8 seconds (down from 15+ minutes) **Testing:** - All 350 tests passing (rustfs-ecstore, rustfs-iam) - cargo build --release: successful - cargo fmt --all: applied **Files Changed:** - crates/common/src/globals.rs (eviction logic) - crates/common/Cargo.toml (added tracing) - crates/protos/src/lib.rs (timeouts + eviction helper) - crates/protos/Cargo.toml (added tracing) - crates/ecstore/src/rpc/peer_rest_client.rs (auto-eviction) - crates/ecstore/src/notification_sys.rs (timeouts + degradation) - crates/iam/src/sys.rs (non-blocking notifications) - docs/cluster_recovery.md (comprehensive resolution report) **Impact:** - Login: Works immediately even with dead peers - Console: Loads in <2s with partial data (no more 0/0/0) - Uploads: Fail-fast with retry capability - Recovery: 8 seconds vs 15+ minutes Co-authored-by: Jitterx69 <[email protected]>
Contributor
|
Hi @Jitterx69 , |
Contributor
Author
Merge the best |
Contributor
|
Hey @Jitterx69 , We compared the code in versions 1044 and 1054, and felt that version 1054's approach was more accurate. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Type of Change
Related Issues
#1001
Summary of Changes
This PR implements a comprehensive solution for persistent cluster recovery failures after abrupt node power-offs. The system previously hung indefinitely when a node experienced hard failure (power loss), causing:
rustfsadminuserRoot Causes Addressed:
GLOBAL_Conn_Mapretained dead gRPC channels indefinitelySolution Implemented:
tokio::spawnPerformance Impact:
Files Modified:
tracingdependencyChecklist
make pre-commit(cargo build + cargo test + cargo fmt)cargo build --release+cargo test --releasesuccessfulImpact
Behavioral Changes:
No Breaking Changes:
Additional Notes
Testing Performed:
cargo build --releasesuccessfulcargo fmt --allappliedprintln!,dbg!, unwantedtodo!)Verification Steps for Reviewers:
rustfsadminlogin succeeds immediatelyTechnical Details:
Related Documentation:
Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.