Skip to content

Cluster Behaves Differently to kill vs. Abrupt Power-Off #1001

@Pandas886

Description

@Pandas886

Context

We have a cluster consisting of 4 machines, each running the rustfs service with 4 disks. This setup forms a single, distributed storage cluster.

Expected Behavior

The cluster should maintain availability and allow normal file uploads from the application even when one node becomes unavailable, as long as the quorum and data redundancy requirements are met.

Actual Behavior

  • Case 1: Terminating the rustfs process (kill)
    • Action: We terminate the rustfs process on one of the machines.
    • Result: The application continues to upload files normally. This behavior is as expected.
  • Case 2: Abruptly powering off a machine
    • Action: We abruptly power off one of the machines (e.g., by cutting the power).
    • Result: The application cannot upload files normally. Furthermore, the Console Web UI for all rustfs nodes becomes unresponsive or appears to be "hung"/"frozen". The entire cluster seems to become unresponsive.

Problem Summary

The cluster exhibits different behaviors depending on how a node is taken offline:

  • It recovers correctly from a graceful (or signal-based) process termination (kill).
  • It fails to recover and becomes unresponsive when a node is abruptly powered off.

This suggests a potential issue in how the cluster handles the sudden, ungraceful failure of a node, possibly related to:

  • Network partition detection and handling.
  • Quorum calculation and re-election processes.
  • Session or connection state management when a node vanishes without sending "goodbye" messages.
  • Timeouts for detecting a truly dead node.

Steps to Reproduce

  1. Deploy a 4-node rustfs cluster (4 machines, each with 4 disks).
  2. Verify the cluster is healthy and the application can upload files.
  3. Abruptly power off one of the machines (e.g., by cutting the power).
  4. Observe that the application can no longer upload files.
  5. Attempt to access the Console Web UI of any rustfs node. It will be unresponsive.

Additional Information

  • Cluster Size: 4 nodes.
  • Disks per Node: 4.
  • Failure Mode: The key difference is between a process-level kill and a hardware-level power-off.

Metadata

Metadata

Labels

S-reproducingStatus: Reproducing a bug reportbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions