fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off #1044

tennisleng · 2025-12-07T23:42:05Z

Description

This PR fixes issue #1001 where the cluster becomes unresponsive when a node is abruptly powered off (e.g., by cutting the power), while gracefully killing the process works correctly.

Root Cause Analysis

The gRPC client was using connect().await which blocks waiting for a connection to be established. When a node is abruptly powered off:

TCP connections don't properly close (no 'goodbye' messages sent)
connect().await blocks waiting for OS TCP timeouts (can be 2+ minutes)
All cluster operations hang while waiting for the unreachable node
The Console Web UI becomes unresponsive

In contrast, when a process is gracefully killed:

TCP FIN packets are sent properly
Connections are cleanly closed
Other nodes quickly detect the failure and continue operating

Changes

Core Fix

Changed from connect().await to connect_lazy() in node_service_time_out_client:

connect_lazy() returns immediately without establishing a connection
Connection is established lazily on first request
Tonic's lazy channel handles automatic reconnection when nodes come back online
Requests to unreachable nodes fail quickly with timeout errors instead of blocking indefinitely

Additional Improvements

Reduced default request timeout from 60s to 30s for faster failure detection
Added clear_connection() and clear_all_connections() helper functions to allow manual clearing of potentially stale connections from the cache

Files Changed

crates/protos/src/lib.rs: Changed connection strategy from eager to lazy
crates/common/src/globals.rs: Added helper functions for connection cache management

Testing

This fix should be tested by:

Deploy a 4-node rustfs cluster
Verify the cluster is healthy and file uploads work
Abruptly power off one node
Verify the application can still upload files (may see errors for that specific node, but cluster remains responsive)
Verify the Console Web UI remains responsive
Power the node back on and verify it rejoins the cluster

Fixes: #1001

… node power-off This fixes issue rustfs#1001 where the cluster becomes unresponsive when a node is abruptly powered off (vs. graceful kill which works correctly). Root cause: The gRPC client was using connect().await which blocks waiting for a connection to be established. When a node is abruptly powered off without sending 'goodbye' messages, TCP connections can hang for extended periods (2+ minutes) waiting for OS TCP timeouts. Changes: - Changed from connect().await to connect_lazy() in node_service_time_out_client - connect_lazy() returns immediately without establishing a connection - Connection is established lazily on first request - Tonic's lazy channel handles automatic reconnection when nodes come back - Requests to unreachable nodes fail quickly with timeout errors instead of blocking indefinitely - Reduced default request timeout from 60s to 30s for faster failure detection - Added clear_connection() and clear_all_connections() helper functions to allow manual clearing of potentially stale connections from the cache This ensures the cluster remains responsive even when nodes vanish abruptly, as requests will fail fast with proper timeout errors rather than blocking indefinitely. Fixes: rustfs#1001

loverustfs · 2025-12-08T04:18:28Z

Hey @tennisleng ,

There may be a conflict between 1054 and 1044 regarding feature fixes.
Currently, it seems that the approach and direction of 1054 are more accurate.
You are welcome to submit other PRs; this PR is closed.
Thank you very much for your contribution!

loverustfs mentioned this pull request Dec 8, 2025

Fix/issue #1001 dead node detection #1054

Merged

15 tasks

Merge branch 'main' into fix/issue-1001-cluster-power-off

c0f4c1b

loverustfs closed this Dec 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off #1044

fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off #1044

Uh oh!

tennisleng commented Dec 7, 2025

Uh oh!

loverustfs commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off #1044

fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off #1044

Uh oh!

Conversation

tennisleng commented Dec 7, 2025

Description

Root Cause Analysis

Changes

Core Fix

Additional Improvements

Files Changed

Testing

Uh oh!

loverustfs commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants