Fix: Trigger self-heal on read when shards missing from rejoined nodes #871

Copilot · 2025-11-17T07:07:20Z

Type of Change

Bug Fix

Related Issues

Summary of Changes

Problem: When a node rejoins after being offline during writes, reads succeed using available shards but never trigger healing. Missing shards on the rejoined node remain unrecovered indefinitely, degrading redundancy protection.

Root Cause: Self-heal only triggered on decode errors. If enough shards exist to satisfy read quorum (e.g., 3 of 4 data shards), read succeeds silently despite missing shards.

Fix: Added proactive missing-shard detection in get_object_with_fileinfo:

// After verifying read quorum is satisfied
let total_shards = erasure.data_shards + erasure.parity_shards;
let missing_shards = total_shards - nil_count;
if missing_shards > 0 && nil_count >= erasure.data_shards {
    // Trigger low-priority background heal
    send_heal_request(..., HealChannelPriority::Low, ...)
}

Uses low priority to avoid interfering with critical heal operations. Restores full redundancy automatically on first read after node recovery.

Checklist

I have read and followed the CONTRIBUTING.md guidelines
Passed make pre-commit
Added/updated necessary tests
Documentation updated (if needed)
CI/CD passed (if applicable)

Impact

Breaking change (compatibility)
Requires doc/config/deployment update
Other impact:

Additional Notes

Changed file: crates/ecstore/src/set_disk.rs (+27 lines)

Scenario this fixes:

node2 offline
Write myname.zip → shards on node1,3,4 only
node2 rejoins
GET myname.zip → now triggers heal, rebuilds node2 shards

Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

example.com
- Triggering command: /home/REDACTED/work/rustfs/rustfs/target/debug/deps/rustfs_ecstore-8b0e737f8ad72232 (dns block)
example.org
- Triggering command: /home/REDACTED/work/rustfs/rustfs/target/debug/deps/rustfs_ecstore-8b0e737f8ad72232 (dns block)
server
- Triggering command: /home/REDACTED/work/rustfs/rustfs/target/debug/deps/rustfs_ecstore-8b0e737f8ad72232 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>Data was not automatically recovered during the disconnection</issue_title>
<issue_description>

Configuration

Startup Arguments: http://node{1...4}:9000/data/rustfs{0...3}
(This implies a 4-node cluster, with 4 drives/volumes per node.)

Steps to Reproduce

Simulate Failure: node2 experienced an outage and was taken offline.

Write While Degraded: While the cluster was in a degraded state (with node2 offline), I successfully uploaded a new object, myname.zip.

Recovery: node2 was repaired and brought back online, rejoining the cluster.

Trigger Read: I then performed a GET request to download the myname.zip object.

Expected Behavior

My expectation is that reading the object (myname.zip) would trigger a self-heal process. The system should detect that node2 is back online but missing the newly written shards. It should then reconstruct the missing data shards for myname.zip and populate (write) them to the appropriate drives on node2 (i.e., /data/rustfs{0...3}).

Actual Result

The data shards for myname.zip are still missing from all drives on node2. The read operation succeeded (likely by reading from the other nodes), but it did not trigger the expected data reconstruction and healing process for the newly rejoined node.</issue_description>

<agent_instructions>According to the error situation prompted in the issues, read the project source code for detailed and in-depth analysis, why the problem occurred and what the reasons were, how to avoid the recurrence of such problems, and finally how to gracefully and completely fix and improve this problem. First, provide a repair plan, and then carry out a complete and comprehensive and elegant repair and improvement process according to the repair plan.</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@loverustfs We will retest this issue. If it recurs, we will fix the bug.

Fixes Data was not automatically recovered during the disconnection #829

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…recovery-during-disconnection

- Added proactive heal detection in get_object_with_fileinfo - When reading an object, now checks if any shards are missing even if read succeeds - Sends low-priority heal request to reconstruct missing shards on rejoined nodes - This fixes the issue where data written during node outage is not healed when node rejoins Co-authored-by: houseme <[email protected]>

* Initial plan * Replace CRC libraries with unified crc-fast implementation Co-authored-by: houseme <[email protected]> * fix * fix: replace low to Normal --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: houseme <[email protected]> Co-authored-by: houseme <[email protected]>

* Initial plan * Fix large file upload freeze by increasing StreamReader buffer size Co-authored-by: houseme <[email protected]> * Add comprehensive documentation for large file upload freeze fix Co-authored-by: houseme <[email protected]> * upgrade s3s version * Fix compilation error: use BufReader instead of non-existent StreamReader::with_capacity Co-authored-by: houseme <[email protected]> * Update documentation with correct BufReader implementation Co-authored-by: houseme <[email protected]> * add tokio feature `io-util` * Implement adaptive buffer sizing based on file size Co-authored-by: houseme <[email protected]> * Constants are managed uniformly and fmt code * fix * Fix: Trigger self-heal on read when shards missing from rejoined nodes (#871) * Initial plan * Fix: Trigger self-heal when missing shards detected during read - Added proactive heal detection in get_object_with_fileinfo - When reading an object, now checks if any shards are missing even if read succeeds - Sends low-priority heal request to reconstruct missing shards on rejoined nodes - This fixes the issue where data written during node outage is not healed when node rejoins Co-authored-by: houseme <[email protected]> * fix * Unify CRC implementations to crc-fast (#873) * Initial plan * Replace CRC libraries with unified crc-fast implementation Co-authored-by: houseme <[email protected]> * fix * fix: replace low to Normal --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: houseme <[email protected]> Co-authored-by: houseme <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: houseme <[email protected]> Co-authored-by: houseme <[email protected]> --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: houseme <[email protected]> Co-authored-by: houseme <[email protected]>

Initial plan

fd66f0e

Copilot AI assigned Copilot and houseme Nov 17, 2025

Copilot started work on behalf of houseme November 17, 2025 07:07 View session

houseme changed the base branch from main to copilot/fix-upload-freeze-issue November 17, 2025 07:08

houseme and others added 3 commits November 17, 2025 15:08

Merge branch 'copilot/fix-upload-freeze-issue' into copilot/fix-data-…

a69caa1

…recovery-during-disconnection

Merge branch 'copilot/fix-upload-freeze-issue' into copilot/fix-data-…

b214422

…recovery-during-disconnection

Copilot AI changed the title ~~[WIP] Fix data recovery issue after node disconnection~~ Fix: Trigger self-heal on read when shards missing from rejoined nodes Nov 17, 2025

Copilot AI requested a review from houseme November 17, 2025 07:28

Copilot finished work on behalf of houseme November 17, 2025 07:28

houseme and others added 2 commits November 17, 2025 16:14

fix

c68eb18

houseme marked this pull request as ready for review November 17, 2025 14:53

houseme approved these changes Nov 17, 2025

View reviewed changes

houseme merged commit a6fd558 into copilot/fix-upload-freeze-issue Nov 17, 2025

houseme deleted the copilot/fix-data-recovery-during-disconnection branch November 17, 2025 14:54

Copilot AI mentioned this pull request Nov 18, 2025

Fix: Implement priority-based heal queue with comprehensive diagnostic logging #884

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Trigger self-heal on read when shards missing from rejoined nodes #871

Fix: Trigger self-heal on read when shards missing from rejoined nodes #871

Uh oh!

Copilot AI commented Nov 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Trigger self-heal on read when shards missing from rejoined nodes #871

Fix: Trigger self-heal on read when shards missing from rejoined nodes #871

Uh oh!

Conversation

Copilot AI commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Change

Related Issues

Summary of Changes

Checklist

Impact

Additional Notes

I tried to connect to the following addresses, but was blocked by firewall rules:

Configuration

Steps to Reproduce

Expected Behavior

Actual Result

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 17, 2025 •

edited

Loading