Skip to content

Conversation

@tennisleng
Copy link
Contributor

Description

This PR fixes issue #1001 where the cluster becomes unresponsive when a node is abruptly powered off (e.g., by cutting the power), while gracefully killing the process works correctly.

Root Cause Analysis

The gRPC client was using connect().await which blocks waiting for a connection to be established. When a node is abruptly powered off:

  • TCP connections don't properly close (no 'goodbye' messages sent)
  • connect().await blocks waiting for OS TCP timeouts (can be 2+ minutes)
  • All cluster operations hang while waiting for the unreachable node
  • The Console Web UI becomes unresponsive

In contrast, when a process is gracefully killed:

  • TCP FIN packets are sent properly
  • Connections are cleanly closed
  • Other nodes quickly detect the failure and continue operating

Changes

Core Fix

Changed from connect().await to connect_lazy() in node_service_time_out_client:

  • connect_lazy() returns immediately without establishing a connection
  • Connection is established lazily on first request
  • Tonic's lazy channel handles automatic reconnection when nodes come back online
  • Requests to unreachable nodes fail quickly with timeout errors instead of blocking indefinitely

Additional Improvements

  • Reduced default request timeout from 60s to 30s for faster failure detection
  • Added clear_connection() and clear_all_connections() helper functions to allow manual clearing of potentially stale connections from the cache

Files Changed

  • crates/protos/src/lib.rs: Changed connection strategy from eager to lazy
  • crates/common/src/globals.rs: Added helper functions for connection cache management

Testing

This fix should be tested by:

  1. Deploy a 4-node rustfs cluster
  2. Verify the cluster is healthy and file uploads work
  3. Abruptly power off one node
  4. Verify the application can still upload files (may see errors for that specific node, but cluster remains responsive)
  5. Verify the Console Web UI remains responsive
  6. Power the node back on and verify it rejoins the cluster

Fixes: #1001

… node power-off

This fixes issue rustfs#1001 where the cluster becomes unresponsive when a node is
abruptly powered off (vs. graceful kill which works correctly).

Root cause: The gRPC client was using connect().await which blocks waiting for
a connection to be established. When a node is abruptly powered off without
sending 'goodbye' messages, TCP connections can hang for extended periods (2+
minutes) waiting for OS TCP timeouts.

Changes:
- Changed from connect().await to connect_lazy() in node_service_time_out_client
  - connect_lazy() returns immediately without establishing a connection
  - Connection is established lazily on first request
  - Tonic's lazy channel handles automatic reconnection when nodes come back
  - Requests to unreachable nodes fail quickly with timeout errors instead of
    blocking indefinitely

- Reduced default request timeout from 60s to 30s for faster failure detection

- Added clear_connection() and clear_all_connections() helper functions to
  allow manual clearing of potentially stale connections from the cache

This ensures the cluster remains responsive even when nodes vanish abruptly,
as requests will fail fast with proper timeout errors rather than blocking
indefinitely.

Fixes: rustfs#1001
@loverustfs loverustfs mentioned this pull request Dec 8, 2025
15 tasks
@loverustfs
Copy link
Contributor

Hey @tennisleng ,

There may be a conflict between 1054 and 1044 regarding feature fixes.
Currently, it seems that the approach and direction of 1054 are more accurate.
You are welcome to submit other PRs; this PR is closed.
Thank you very much for your contribution!

@loverustfs loverustfs closed this Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster Behaves Differently to kill vs. Abrupt Power-Off

2 participants