Fix: Implement priority-based heal queue with comprehensive diagnostic logging #884

Copilot · 2025-11-18T07:19:47Z

Type of Change

Related Issues

Fixes Data was not automatically recovered during the disconnection #829
Builds upon PR Fix: Trigger self-heal on read when shards missing from rejoined nodes #871

Summary of Changes

This PR addresses issue #829 where data written while a node is offline is not automatically healed to the rejoined node, even after reading the object. While PR #871 added code to trigger heal requests during reads, the heal queue was using simple FIFO processing that ignored priority levels. After initial implementation and user testing feedback, comprehensive diagnostic logging was added to enable precise identification of where the heal flow may fail.

Problem Analysis

Issue #829: After a node rejoins following an outage, data written while the node was offline is not automatically healed to the rejoined node, even after reading the object.

PR #871 Fix: Added code to detect missing shards during reads and trigger heal requests with HealChannelPriority::Normal (changed from Low).

Remaining Issues:

Heal queue used simple FIFO (VecDeque::pop_front()), completely ignoring priority fields
Even with Normal priority, requests waited behind all previously queued tasks
No priority-based scheduling mechanism
High-priority heal requests couldn't be processed faster
Insufficient diagnostic information to identify where heal flow failures occur

Root Cause

Non-priority queue: HealManager's heal_queue used VecDeque<HealRequest> but processed items with pop_front(), completely ignoring the priority field
Single queue approach: All heal requests entered one queue regardless of priority
No preemption mechanism: No mechanism to prioritize high-priority tasks over low-priority ones
Limited observability: Lack of detailed logging made it difficult to diagnose heal flow issues

Changes Made

Core Priority Queue Implementation:

New PriorityHealQueue structure:
- Uses BinaryHeap for priority-based ordering
- Maintains sequence numbers for FIFO ordering within same priority
- Uses HashSet for automatic deduplication of duplicate heal requests
- Supports priority ordering: Urgent > High > Normal > Low
Updated HealPriority:
- Added Copy and Hash traits to avoid ownership issues
- Maintains backward compatibility
Updated HealManager:
- Changed heal_queue from VecDeque to PriorityHealQueue
- Updated submit_heal_request with deduplication support
- Updated process_heal_queue to process by priority
- Updated auto disk scanner to use new queue API

Enhanced Monitoring & Observability:

Queue monitoring features:
- Added get_priority_stats() to track request counts by priority
- Logs queue statistics when adding high/urgent priority tasks
- Warns when queue capacity exceeds 80% full
- Logs queue status when backlog exceeds 10 items

Adaptive Task Processing:

Improved throughput:
- Process up to 2 tasks per cycle instead of 1
- Intelligent slot allocation based on available capacity
- Improves throughput for high-priority heal operations

Robust Error Handling:

Better error reporting:
- Fixed all instances where heal request errors were silently ignored
- Added contextual warning logs when heal requests fail to enqueue
- Improved error messages across ecstore operations

Comprehensive Diagnostic Logging:

Read path diagnostics:
- Detailed shard availability logging (total_shards, available_shards, missing_shards)
- Success confirmation when heal requests are enqueued
- Improved variable naming for clarity (nil_count → available_shards)
Heal execution diagnostics:
- Log available disk count vs total disks from disks_with_all_parts
- Per-disk heal decision logging with reason and endpoint information
- Summary of how many disks need healing
Heal writing diagnostics:
- Detailed logging for rename phase with disk endpoints
- Success/failure logging for each disk's heal operation
- Clear indication when healed data is successfully written to rejoined nodes

Comprehensive Testing:

Test coverage:
- 8 unit tests covering all queue functionality
- Tests for priority ordering, FIFO within priority, deduplication
- Tests for queue statistics and empty state checking
- Tests for different heal types and mixed priorities

Checklist

I have read and followed the CONTRIBUTING.md guidelines
Passed make pre-commit
Added/updated necessary tests
Documentation updated (if needed)
CI/CD passed (if applicable)

Impact

Breaking change (compatibility)
Requires doc/config/deployment update
Other impact: Internal implementation change with improved observability and performance

Additional Notes

Test Results

✅ All 8 new unit tests pass
✅ All existing tests pass
✅ Clippy checks pass (no warnings)
✅ Code formatting complete

Performance Impact

Minimal overhead from BinaryHeap operations (O(log n))
Improved throughput from processing multiple tasks per cycle
Better resource utilization through priority-based scheduling

Behavior Changes

Heal requests now processed by priority instead of FIFO order
Duplicate heal requests automatically deduplicated
Queue capacity warnings help prevent queue overflow
Failed heal request submissions now logged instead of silently ignored
Comprehensive logging at all heal flow stages for diagnostics

Diagnostic Guidance

The enhanced logging enables precise diagnosis of heal flow issues. After testing, check logs for:

Read detection: "Detected missing shards during read" - confirms heal is triggered
Request acceptance: "Successfully enqueued heal request" - confirms queue accepts request
Task creation: "Starting heal task" - confirms HealManager processes queue
Disk marking: "Disk X heal check: should_heal=true" - identifies which disks need healing
Heal completion: "Successfully renamed healed data" - confirms data written to disk

This diagnostic information helps identify exactly where in the heal flow any issues occur, enabling targeted fixes.

Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

Original prompt

issue #829 ,After the repair was performed at #871, the problem still exists. Please further analyze the code to conduct a detailed and complete repair. Currently, after 'HealChannelPriority::Low' has been adjusted to 'HealChannelPriority::Normal', after waiting for a while, no data after the repair appears in the newly added machine. Please deeply analyze and interpret the code and perform detailed, complete and elegant repair processing.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: houseme <[email protected]>

houseme · 2025-11-18T08:41:58Z

@copilot Further improvement and improvement

- Add priority-based queue statistics tracking - Implement queue capacity warnings (>80% full) - Process multiple tasks per cycle when capacity allows - Add proper error logging for failed heal request submissions - Add Hash trait to HealPriority for HashMap support - Improve observability with detailed queue status logs Co-authored-by: houseme <[email protected]>

Copilot · 2025-11-18T08:56:55Z

@copilot Further improvement and improvement

已完成进一步改进和优化，主要包括：

队列监控和可观测性：添加优先级统计、容量警告（>80%）和详细的队列状态日志
自适应任务处理：每个周期最多处理2个任务，提高吞吐量
改进的错误处理：修复所有忽略heal请求错误的地方，添加适当的警告日志
增强的测试覆盖：新增队列统计和is_empty功能的测试

详见提交 6d3299d

houseme · 2025-11-18T10:54:10Z

@copilot After testing and verification, the above transformation did not successfully fix the problem. Please further analyze the problem and project code in depth, and conduct an in-depth and detailed construction and repair plan before repairing it.

Copilot · 2025-11-18T11:01:30Z

@copilot After testing and verification, the above transformation did not successfully fix the problem. Please further analyze the problem and project code in depth, and conduct an in-depth and detail...

I've performed a deep analysis of the heal flow and identified several potential issues. To provide a targeted fix, I need to understand exactly where the failure occurs:

Questions to help diagnose:

Are heal requests being triggered during reads? (Check logs for "Detected missing shards during read")
Are heal tasks being created and executed? (Check logs for "Starting heal task")
Are heal operations completing? (Check logs for "Heal completed successfully" or "Heal task failed")
If heal tasks are running, what error messages appear?

Potential root causes identified:

Priority queue working correctly - My changes ensure requests are processed by priority
Heal detection during read working - PR Fix: Trigger self-heal on read when shards missing from rejoined nodes #871 code triggers heal requests
Possible issue in heal execution - The heal logic might not be identifying the rejoined node as needing data

Next steps:
I'm implementing enhanced diagnostic logging and a potential fix for the heal detection logic. This will include:

More detailed logging throughout the heal flow
Verification that should_heal_object_on_disk() correctly identifies rejoined nodes
Ensure heal writes are actually performed and renamed to final location

Working on the fix now...

- Add detailed logging for shard availability checks during reads - Log successful heal request submissions - Add detailed logging in heal_object for disk checking - Log each disk's heal decision with reason and endpoint - Add logging for heal writing and renaming phases - Improve variable naming (nil_count -> available_shards) - Add success confirmation messages for renamed healed data These logs will help diagnose where the heal flow is failing. Co-authored-by: houseme <[email protected]>

…cess - **Logging & Observability**: - Upgrade critical heal logs from [info](/crates/e2e_test/src/reliant/node_interact_test.rs:196:0-213:1) to `warn` for better visibility. - Implement structured logging with `tracing` fields for machine readability. - Add `#[tracing::instrument]` to [HealTask](c/crates/ahm/src/heal/task.rs:182:0-205:1) and [SetDisks](/crates/ecstore/src/set_disk.rs:120:0-131:1) methods for automatic context propagation. - **Robustness**: - Add exponential backoff retry (3 attempts) for acquiring write locks in [heal_object](/crates/ahm/src/heal/storage.rs:438:4-460:5) to handle contention. - Handle [rename_data](/crates/ecstore/src/set_disk.rs:392:4-516:5) failures gracefully by preserving temporary files instead of forcing deletion, preventing potential data loss. - **Data Safety**: - Fix [object_exists](/crates/ahm/src/heal/storage.rs:395:4-412:5) to propagate IO errors instead of treating them as "object not found". - Update [ErasureSetHealer](/crates/ahm/src/heal/erasure_healer.rs:28:0-33:1) to mark objects as failed rather than skipped when existence checks error, ensuring they are tracked for retry.

Initial plan

28caaf2

Copilot AI assigned Copilot and houseme Nov 18, 2025

Copilot started work on behalf of houseme November 18, 2025 07:20 View session

Copilot AI and others added 2 commits November 18, 2025 07:37

Implement priority-based heal queue with deduplication

42b5e8a

Co-authored-by: houseme <[email protected]>

Apply cargo fmt formatting fixes

5aa0b13

Co-authored-by: houseme <[email protected]>

Copilot AI requested a review from houseme November 18, 2025 07:46

Copilot started work on behalf of houseme November 18, 2025 08:42 View session

Copilot AI changed the title ~~[WIP] Analyze and repair heal channel priority issues~~ Fix: Implement priority-based heal queue with monitoring and adaptive processing Nov 18, 2025

Copilot finished work on behalf of houseme November 18, 2025 08:58

Copilot started work on behalf of houseme November 18, 2025 10:54 View session

Copilot AI changed the title ~~Fix: Implement priority-based heal queue with monitoring and adaptive processing~~ Fix: Implement priority-based heal queue with comprehensive diagnostic logging Nov 18, 2025

Copilot finished work on behalf of houseme November 18, 2025 11:11

houseme added 10 commits November 18, 2025 23:50

fix

6be8c6e

Merge branch 'main' into copilot/analyze-and-repair-heal-channel

e9eab8e

fix

b30c050

fmt

800529d

improve code for heal_object

ea2e38d

Merge branch 'main' into copilot/analyze-and-repair-heal-channel

c5d1d10

fix

95622e8

fix

cf4d830

fix

81aa4bc

houseme marked this pull request as ready for review November 19, 2025 16:18

houseme merged commit 277d80d into main Nov 19, 2025
14 checks passed

houseme deleted the copilot/analyze-and-repair-heal-channel branch November 19, 2025 16:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Implement priority-based heal queue with comprehensive diagnostic logging #884

Fix: Implement priority-based heal queue with comprehensive diagnostic logging #884

Uh oh!

Copilot AI commented Nov 18, 2025 •

edited

Loading

Uh oh!

houseme commented Nov 18, 2025

Uh oh!

Copilot AI commented Nov 18, 2025

Uh oh!

houseme commented Nov 18, 2025

Uh oh!

Copilot AI commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Implement priority-based heal queue with comprehensive diagnostic logging #884

Fix: Implement priority-based heal queue with comprehensive diagnostic logging #884

Uh oh!

Conversation

Copilot AI commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Change

Related Issues

Summary of Changes

Problem Analysis

Root Cause

Changes Made

Checklist

Impact

Additional Notes

Test Results

Performance Impact

Behavior Changes

Diagnostic Guidance

Uh oh!

houseme commented Nov 18, 2025

Uh oh!

Copilot AI commented Nov 18, 2025

Uh oh!

houseme commented Nov 18, 2025

Uh oh!

Copilot AI commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 18, 2025 •

edited

Loading