Cluster Behaves Differently to kill vs. Abrupt Power-Off


## Context
We have a cluster consisting of 4 machines, each running the rustfs service with 4 disks. This setup forms a single, distributed storage cluster.

## Expected Behavior
The cluster should maintain availability and allow normal file uploads from the application even when one node becomes unavailable, as long as the quorum and data redundancy requirements are met.

## Actual Behavior
- Case 1: Terminating the rustfs process (kill)
  - Action: We terminate the rustfs process on one of the machines.
  - Result: The application continues to upload files normally. This behavior is as expected.
- Case 2: Abruptly powering off a machine  
  - Action: We abruptly power off one of the machines (e.g., by cutting the power).
  - Result: The application cannot upload files normally. Furthermore, the Console Web UI for all rustfs nodes becomes unresponsive or appears to be "hung"/"frozen". The entire cluster seems to become unresponsive.

## Problem Summary
The cluster exhibits different behaviors depending on how a node is taken offline:
- It recovers correctly from a graceful (or signal-based) process termination (kill).
- It fails to recover and becomes unresponsive when a node is abruptly powered off.

This suggests a potential issue in how the cluster handles the sudden, ungraceful failure of a node, possibly related to:
-   Network partition detection and handling.
-   Quorum calculation and re-election processes.
-   Session or connection state management when a node vanishes without sending "goodbye" messages.
-   Timeouts for detecting a truly dead node.

## Steps to Reproduce
1.  Deploy a 4-node rustfs cluster (4 machines, each with 4 disks).
2.  Verify the cluster is healthy and the application can upload files.
3.  Abruptly power off one of the machines (e.g., by cutting the power).
4.  Observe that the application can no longer upload files.
5.  Attempt to access the Console Web UI of any rustfs node. It will be unresponsive.

## Additional Information
-   Cluster Size: 4 nodes.
-   Disks per Node: 4.
-   Failure Mode: The key difference is between a process-level kill and a hardware-level power-off.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cluster Behaves Differently to kill vs. Abrupt Power-Off #1001

Context

Expected Behavior

Actual Behavior

Problem Summary

Steps to Reproduce

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cluster Behaves Differently to kill vs. Abrupt Power-Off #1001

Description

Context

Expected Behavior

Actual Behavior

Problem Summary

Steps to Reproduce

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions