Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 17, 2025

Type of Change

  • Bug Fix

Related Issues

Summary of Changes

Problem: When a node rejoins after being offline during writes, reads succeed using available shards but never trigger healing. Missing shards on the rejoined node remain unrecovered indefinitely, degrading redundancy protection.

Root Cause: Self-heal only triggered on decode errors. If enough shards exist to satisfy read quorum (e.g., 3 of 4 data shards), read succeeds silently despite missing shards.

Fix: Added proactive missing-shard detection in get_object_with_fileinfo:

// After verifying read quorum is satisfied
let total_shards = erasure.data_shards + erasure.parity_shards;
let missing_shards = total_shards - nil_count;
if missing_shards > 0 && nil_count >= erasure.data_shards {
    // Trigger low-priority background heal
    send_heal_request(..., HealChannelPriority::Low, ...)
}

Uses low priority to avoid interfering with critical heal operations. Restores full redundancy automatically on first read after node recovery.

Checklist

  • I have read and followed the CONTRIBUTING.md guidelines
  • Passed make pre-commit
  • Added/updated necessary tests
  • Documentation updated (if needed)
  • CI/CD passed (if applicable)

Impact

  • Breaking change (compatibility)
  • Requires doc/config/deployment update
  • Other impact:

Additional Notes

Changed file: crates/ecstore/src/set_disk.rs (+27 lines)

Scenario this fixes:

  1. node2 offline
  2. Write myname.zip → shards on node1,3,4 only
  3. node2 rejoins
  4. GET myname.zip → now triggers heal, rebuilds node2 shards

Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • example.com
    • Triggering command: /home/REDACTED/work/rustfs/rustfs/target/debug/deps/rustfs_ecstore-8b0e737f8ad72232 (dns block)
  • example.org
    • Triggering command: /home/REDACTED/work/rustfs/rustfs/target/debug/deps/rustfs_ecstore-8b0e737f8ad72232 (dns block)
  • server
    • Triggering command: /home/REDACTED/work/rustfs/rustfs/target/debug/deps/rustfs_ecstore-8b0e737f8ad72232 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Data was not automatically recovered during the disconnection</issue_title>
<issue_description>

Configuration

  • Startup Arguments: http://node{1...4}:9000/data/rustfs{0...3}
    (This implies a 4-node cluster, with 4 drives/volumes per node.)

Steps to Reproduce

  1. Simulate Failure: node2 experienced an outage and was taken offline.
  2. Write While Degraded: While the cluster was in a degraded state (with node2 offline), I successfully uploaded a new object, myname.zip.
  3. Recovery: node2 was repaired and brought back online, rejoining the cluster.
  4. Trigger Read: I then performed a GET request to download the myname.zip object.

Expected Behavior

My expectation is that reading the object (myname.zip) would trigger a self-heal process. The system should detect that node2 is back online but missing the newly written shards. It should then reconstruct the missing data shards for myname.zip and populate (write) them to the appropriate drives on node2 (i.e., /data/rustfs{0...3}).

Actual Result

The data shards for myname.zip are still missing from all drives on node2. The read operation succeeded (likely by reading from the other nodes), but it did not trigger the expected data reconstruction and healing process for the newly rejoined node.</issue_description>

<agent_instructions>According to the error situation prompted in the issues, read the project source code for detailed and in-depth analysis, why the problem occurred and what the reasons were, how to avoid the recurrence of such problems, and finally how to gracefully and completely fix and improve this problem. First, provide a repair plan, and then carry out a complete and comprehensive and elegant repair and improvement process according to the repair plan.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@loverustfs We will retest this issue. If it recurs, we will fix the bug.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@houseme houseme changed the base branch from main to copilot/fix-upload-freeze-issue November 17, 2025 07:08
houseme and others added 3 commits November 17, 2025 15:08
- Added proactive heal detection in get_object_with_fileinfo
- When reading an object, now checks if any shards are missing even if read succeeds
- Sends low-priority heal request to reconstruct missing shards on rejoined nodes
- This fixes the issue where data written during node outage is not healed when node rejoins

Co-authored-by: houseme <[email protected]>
Copilot AI changed the title [WIP] Fix data recovery issue after node disconnection Fix: Trigger self-heal on read when shards missing from rejoined nodes Nov 17, 2025
Copilot AI requested a review from houseme November 17, 2025 07:28
houseme and others added 2 commits November 17, 2025 16:14
* Initial plan

* Replace CRC libraries with unified crc-fast implementation

Co-authored-by: houseme <[email protected]>

* fix

* fix: replace low to Normal

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: houseme <[email protected]>
Co-authored-by: houseme <[email protected]>
@houseme houseme marked this pull request as ready for review November 17, 2025 14:53
@houseme houseme merged commit a6fd558 into copilot/fix-upload-freeze-issue Nov 17, 2025
@houseme houseme deleted the copilot/fix-data-recovery-during-disconnection branch November 17, 2025 14:54
houseme added a commit that referenced this pull request Nov 17, 2025
* Initial plan

* Fix large file upload freeze by increasing StreamReader buffer size

Co-authored-by: houseme <[email protected]>

* Add comprehensive documentation for large file upload freeze fix

Co-authored-by: houseme <[email protected]>

* upgrade s3s version

* Fix compilation error: use BufReader instead of non-existent StreamReader::with_capacity

Co-authored-by: houseme <[email protected]>

* Update documentation with correct BufReader implementation

Co-authored-by: houseme <[email protected]>

* add tokio feature `io-util`

* Implement adaptive buffer sizing based on file size

Co-authored-by: houseme <[email protected]>

* Constants are managed uniformly and fmt code

* fix

* Fix: Trigger self-heal on read when shards missing from rejoined nodes (#871)

* Initial plan

* Fix: Trigger self-heal when missing shards detected during read

- Added proactive heal detection in get_object_with_fileinfo
- When reading an object, now checks if any shards are missing even if read succeeds
- Sends low-priority heal request to reconstruct missing shards on rejoined nodes
- This fixes the issue where data written during node outage is not healed when node rejoins

Co-authored-by: houseme <[email protected]>

* fix

* Unify CRC implementations to crc-fast (#873)

* Initial plan

* Replace CRC libraries with unified crc-fast implementation

Co-authored-by: houseme <[email protected]>

* fix

* fix: replace low to Normal

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: houseme <[email protected]>
Co-authored-by: houseme <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: houseme <[email protected]>
Co-authored-by: houseme <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: houseme <[email protected]>
Co-authored-by: houseme <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data was not automatically recovered during the disconnection

2 participants