fix(cluster): use lazy connections to prevent unresponsive cluster on node power-off #1044
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes issue #1001 where the cluster becomes unresponsive when a node is abruptly powered off (e.g., by cutting the power), while gracefully killing the process works correctly.
Root Cause Analysis
The gRPC client was using
connect().awaitwhich blocks waiting for a connection to be established. When a node is abruptly powered off:connect().awaitblocks waiting for OS TCP timeouts (can be 2+ minutes)In contrast, when a process is gracefully killed:
Changes
Core Fix
Changed from
connect().awaittoconnect_lazy()innode_service_time_out_client:connect_lazy()returns immediately without establishing a connectionAdditional Improvements
clear_connection()andclear_all_connections()helper functions to allow manual clearing of potentially stale connections from the cacheFiles Changed
crates/protos/src/lib.rs: Changed connection strategy from eager to lazycrates/common/src/globals.rs: Added helper functions for connection cache managementTesting
This fix should be tested by:
Fixes: #1001