-
Notifications
You must be signed in to change notification settings - Fork 759
Open
Labels
S-reproducingStatus: Reproducing a bug reportStatus: Reproducing a bug reportbugSomething isn't workingSomething isn't working
Description
Context
We have a cluster consisting of 4 machines, each running the rustfs service with 4 disks. This setup forms a single, distributed storage cluster.
Expected Behavior
The cluster should maintain availability and allow normal file uploads from the application even when one node becomes unavailable, as long as the quorum and data redundancy requirements are met.
Actual Behavior
- Case 1: Terminating the rustfs process (kill)
- Action: We terminate the rustfs process on one of the machines.
- Result: The application continues to upload files normally. This behavior is as expected.
- Case 2: Abruptly powering off a machine
- Action: We abruptly power off one of the machines (e.g., by cutting the power).
- Result: The application cannot upload files normally. Furthermore, the Console Web UI for all rustfs nodes becomes unresponsive or appears to be "hung"/"frozen". The entire cluster seems to become unresponsive.
Problem Summary
The cluster exhibits different behaviors depending on how a node is taken offline:
- It recovers correctly from a graceful (or signal-based) process termination (kill).
- It fails to recover and becomes unresponsive when a node is abruptly powered off.
This suggests a potential issue in how the cluster handles the sudden, ungraceful failure of a node, possibly related to:
- Network partition detection and handling.
- Quorum calculation and re-election processes.
- Session or connection state management when a node vanishes without sending "goodbye" messages.
- Timeouts for detecting a truly dead node.
Steps to Reproduce
- Deploy a 4-node rustfs cluster (4 machines, each with 4 disks).
- Verify the cluster is healthy and the application can upload files.
- Abruptly power off one of the machines (e.g., by cutting the power).
- Observe that the application can no longer upload files.
- Attempt to access the Console Web UI of any rustfs node. It will be unresponsive.
Additional Information
- Cluster Size: 4 nodes.
- Disks per Node: 4.
- Failure Mode: The key difference is between a process-level kill and a hardware-level power-off.
Copilot
Metadata
Metadata
Labels
S-reproducingStatus: Reproducing a bug reportStatus: Reproducing a bug reportbugSomething isn't workingSomething isn't working