Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 18, 2025

Type of Change

  • Bug Fix
  • Performance Improvement
  • New Feature
  • Documentation
  • Test/CI
  • Refactor
  • Other:

Related Issues

Summary of Changes

This PR addresses issue #829 where data written while a node is offline is not automatically healed to the rejoined node, even after reading the object. While PR #871 added code to trigger heal requests during reads, the heal queue was using simple FIFO processing that ignored priority levels. After initial implementation and user testing feedback, comprehensive diagnostic logging was added to enable precise identification of where the heal flow may fail.

Problem Analysis

Issue #829: After a node rejoins following an outage, data written while the node was offline is not automatically healed to the rejoined node, even after reading the object.

PR #871 Fix: Added code to detect missing shards during reads and trigger heal requests with HealChannelPriority::Normal (changed from Low).

Remaining Issues:

  1. Heal queue used simple FIFO (VecDeque::pop_front()), completely ignoring priority fields
  2. Even with Normal priority, requests waited behind all previously queued tasks
  3. No priority-based scheduling mechanism
  4. High-priority heal requests couldn't be processed faster
  5. Insufficient diagnostic information to identify where heal flow failures occur

Root Cause

  1. Non-priority queue: HealManager's heal_queue used VecDeque<HealRequest> but processed items with pop_front(), completely ignoring the priority field
  2. Single queue approach: All heal requests entered one queue regardless of priority
  3. No preemption mechanism: No mechanism to prioritize high-priority tasks over low-priority ones
  4. Limited observability: Lack of detailed logging made it difficult to diagnose heal flow issues

Changes Made

Core Priority Queue Implementation:

  1. New PriorityHealQueue structure:

    • Uses BinaryHeap for priority-based ordering
    • Maintains sequence numbers for FIFO ordering within same priority
    • Uses HashSet for automatic deduplication of duplicate heal requests
    • Supports priority ordering: Urgent > High > Normal > Low
  2. Updated HealPriority:

    • Added Copy and Hash traits to avoid ownership issues
    • Maintains backward compatibility
  3. Updated HealManager:

    • Changed heal_queue from VecDeque to PriorityHealQueue
    • Updated submit_heal_request with deduplication support
    • Updated process_heal_queue to process by priority
    • Updated auto disk scanner to use new queue API

Enhanced Monitoring & Observability:

  1. Queue monitoring features:
    • Added get_priority_stats() to track request counts by priority
    • Logs queue statistics when adding high/urgent priority tasks
    • Warns when queue capacity exceeds 80% full
    • Logs queue status when backlog exceeds 10 items

Adaptive Task Processing:

  1. Improved throughput:
    • Process up to 2 tasks per cycle instead of 1
    • Intelligent slot allocation based on available capacity
    • Improves throughput for high-priority heal operations

Robust Error Handling:

  1. Better error reporting:
    • Fixed all instances where heal request errors were silently ignored
    • Added contextual warning logs when heal requests fail to enqueue
    • Improved error messages across ecstore operations

Comprehensive Diagnostic Logging:

  1. Read path diagnostics:

    • Detailed shard availability logging (total_shards, available_shards, missing_shards)
    • Success confirmation when heal requests are enqueued
    • Improved variable naming for clarity (nil_count → available_shards)
  2. Heal execution diagnostics:

    • Log available disk count vs total disks from disks_with_all_parts
    • Per-disk heal decision logging with reason and endpoint information
    • Summary of how many disks need healing
  3. Heal writing diagnostics:

    • Detailed logging for rename phase with disk endpoints
    • Success/failure logging for each disk's heal operation
    • Clear indication when healed data is successfully written to rejoined nodes

Comprehensive Testing:

  1. Test coverage:
    • 8 unit tests covering all queue functionality
    • Tests for priority ordering, FIFO within priority, deduplication
    • Tests for queue statistics and empty state checking
    • Tests for different heal types and mixed priorities

Checklist

  • I have read and followed the CONTRIBUTING.md guidelines
  • Passed make pre-commit
  • Added/updated necessary tests
  • Documentation updated (if needed)
  • CI/CD passed (if applicable)

Impact

  • Breaking change (compatibility)
  • Requires doc/config/deployment update
  • Other impact: Internal implementation change with improved observability and performance

Additional Notes

Test Results

  • ✅ All 8 new unit tests pass
  • ✅ All existing tests pass
  • ✅ Clippy checks pass (no warnings)
  • ✅ Code formatting complete

Performance Impact

  • Minimal overhead from BinaryHeap operations (O(log n))
  • Improved throughput from processing multiple tasks per cycle
  • Better resource utilization through priority-based scheduling

Behavior Changes

  • Heal requests now processed by priority instead of FIFO order
  • Duplicate heal requests automatically deduplicated
  • Queue capacity warnings help prevent queue overflow
  • Failed heal request submissions now logged instead of silently ignored
  • Comprehensive logging at all heal flow stages for diagnostics

Diagnostic Guidance

The enhanced logging enables precise diagnosis of heal flow issues. After testing, check logs for:

  1. Read detection: "Detected missing shards during read" - confirms heal is triggered
  2. Request acceptance: "Successfully enqueued heal request" - confirms queue accepts request
  3. Task creation: "Starting heal task" - confirms HealManager processes queue
  4. Disk marking: "Disk X heal check: should_heal=true" - identifies which disks need healing
  5. Heal completion: "Successfully renamed healed data" - confirms data written to disk

This diagnostic information helps identify exactly where in the heal flow any issues occur, enabling targeted fixes.


Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

Original prompt

issue #829 ,After the repair was performed at #871, the problem still exists. Please further analyze the code to conduct a detailed and complete repair. Currently, after 'HealChannelPriority::Low' has been adjusted to 'HealChannelPriority::Normal', after waiting for a while, no data after the repair appears in the newly added machine. Please deeply analyze and interpret the code and perform detailed, complete and elegant repair processing.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@houseme
Copy link
Contributor

houseme commented Nov 18, 2025

@copilot Further improvement and improvement

- Add priority-based queue statistics tracking
- Implement queue capacity warnings (>80% full)
- Process multiple tasks per cycle when capacity allows
- Add proper error logging for failed heal request submissions
- Add Hash trait to HealPriority for HashMap support
- Improve observability with detailed queue status logs

Co-authored-by: houseme <[email protected]>
Copy link
Contributor Author

Copilot AI commented Nov 18, 2025

@copilot Further improvement and improvement

已完成进一步改进和优化,主要包括:

  1. 队列监控和可观测性:添加优先级统计、容量警告(>80%)和详细的队列状态日志
  2. 自适应任务处理:每个周期最多处理2个任务,提高吞吐量
  3. 改进的错误处理:修复所有忽略heal请求错误的地方,添加适当的警告日志
  4. 增强的测试覆盖:新增队列统计和is_empty功能的测试

详见提交 6d3299d

Copilot AI changed the title [WIP] Analyze and repair heal channel priority issues Fix: Implement priority-based heal queue with monitoring and adaptive processing Nov 18, 2025
@houseme
Copy link
Contributor

houseme commented Nov 18, 2025

@copilot After testing and verification, the above transformation did not successfully fix the problem. Please further analyze the problem and project code in depth, and conduct an in-depth and detailed construction and repair plan before repairing it.

Copy link
Contributor Author

Copilot AI commented Nov 18, 2025

@copilot After testing and verification, the above transformation did not successfully fix the problem. Please further analyze the problem and project code in depth, and conduct an in-depth and detail...

I've performed a deep analysis of the heal flow and identified several potential issues. To provide a targeted fix, I need to understand exactly where the failure occurs:

Questions to help diagnose:

  1. Are heal requests being triggered during reads? (Check logs for "Detected missing shards during read")
  2. Are heal tasks being created and executed? (Check logs for "Starting heal task")
  3. Are heal operations completing? (Check logs for "Heal completed successfully" or "Heal task failed")
  4. If heal tasks are running, what error messages appear?

Potential root causes identified:

  1. Priority queue working correctly - My changes ensure requests are processed by priority
  2. Heal detection during read working - PR Fix: Trigger self-heal on read when shards missing from rejoined nodes #871 code triggers heal requests
  3. Possible issue in heal execution - The heal logic might not be identifying the rejoined node as needing data

Next steps:
I'm implementing enhanced diagnostic logging and a potential fix for the heal detection logic. This will include:

  • More detailed logging throughout the heal flow
  • Verification that should_heal_object_on_disk() correctly identifies rejoined nodes
  • Ensure heal writes are actually performed and renamed to final location

Working on the fix now...

- Add detailed logging for shard availability checks during reads
- Log successful heal request submissions
- Add detailed logging in heal_object for disk checking
- Log each disk's heal decision with reason and endpoint
- Add logging for heal writing and renaming phases
- Improve variable naming (nil_count -> available_shards)
- Add success confirmation messages for renamed healed data

These logs will help diagnose where the heal flow is failing.

Co-authored-by: houseme <[email protected]>
Copilot AI changed the title Fix: Implement priority-based heal queue with monitoring and adaptive processing Fix: Implement priority-based heal queue with comprehensive diagnostic logging Nov 18, 2025
…cess

- **Logging & Observability**:
  - Upgrade critical heal logs from [info](/crates/e2e_test/src/reliant/node_interact_test.rs:196:0-213:1) to `warn` for better visibility.
  - Implement structured logging with `tracing` fields for machine readability.
  - Add `#[tracing::instrument]` to [HealTask](c/crates/ahm/src/heal/task.rs:182:0-205:1) and [SetDisks](/crates/ecstore/src/set_disk.rs:120:0-131:1) methods for automatic context propagation.

- **Robustness**:
  - Add exponential backoff retry (3 attempts) for acquiring write locks in [heal_object](/crates/ahm/src/heal/storage.rs:438:4-460:5) to handle contention.
  - Handle [rename_data](/crates/ecstore/src/set_disk.rs:392:4-516:5) failures gracefully by preserving temporary files instead of forcing deletion, preventing potential data loss.

- **Data Safety**:
  - Fix [object_exists](/crates/ahm/src/heal/storage.rs:395:4-412:5) to propagate IO errors instead of treating them as "object not found".
  - Update [ErasureSetHealer](/crates/ahm/src/heal/erasure_healer.rs:28:0-33:1) to mark objects as failed rather than skipped when existence checks error, ensuring they are tracked for retry.
@houseme houseme marked this pull request as ready for review November 19, 2025 16:18
@houseme houseme merged commit 277d80d into main Nov 19, 2025
14 checks passed
@houseme houseme deleted the copilot/analyze-and-repair-heal-channel branch November 19, 2025 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data was not automatically recovered during the disconnection

2 participants