-
Notifications
You must be signed in to change notification settings - Fork 759
Fix: Implement priority-based heal queue with comprehensive diagnostic logging #884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: houseme <[email protected]>
Co-authored-by: houseme <[email protected]>
|
@copilot Further improvement and improvement |
- Add priority-based queue statistics tracking - Implement queue capacity warnings (>80% full) - Process multiple tasks per cycle when capacity allows - Add proper error logging for failed heal request submissions - Add Hash trait to HealPriority for HashMap support - Improve observability with detailed queue status logs Co-authored-by: houseme <[email protected]>
|
@copilot After testing and verification, the above transformation did not successfully fix the problem. Please further analyze the problem and project code in depth, and conduct an in-depth and detailed construction and repair plan before repairing it. |
I've performed a deep analysis of the heal flow and identified several potential issues. To provide a targeted fix, I need to understand exactly where the failure occurs: Questions to help diagnose:
Potential root causes identified:
Next steps:
Working on the fix now... |
- Add detailed logging for shard availability checks during reads - Log successful heal request submissions - Add detailed logging in heal_object for disk checking - Log each disk's heal decision with reason and endpoint - Add logging for heal writing and renaming phases - Improve variable naming (nil_count -> available_shards) - Add success confirmation messages for renamed healed data These logs will help diagnose where the heal flow is failing. Co-authored-by: houseme <[email protected]>
…cess - **Logging & Observability**: - Upgrade critical heal logs from [info](/crates/e2e_test/src/reliant/node_interact_test.rs:196:0-213:1) to `warn` for better visibility. - Implement structured logging with `tracing` fields for machine readability. - Add `#[tracing::instrument]` to [HealTask](c/crates/ahm/src/heal/task.rs:182:0-205:1) and [SetDisks](/crates/ecstore/src/set_disk.rs:120:0-131:1) methods for automatic context propagation. - **Robustness**: - Add exponential backoff retry (3 attempts) for acquiring write locks in [heal_object](/crates/ahm/src/heal/storage.rs:438:4-460:5) to handle contention. - Handle [rename_data](/crates/ecstore/src/set_disk.rs:392:4-516:5) failures gracefully by preserving temporary files instead of forcing deletion, preventing potential data loss. - **Data Safety**: - Fix [object_exists](/crates/ahm/src/heal/storage.rs:395:4-412:5) to propagate IO errors instead of treating them as "object not found". - Update [ErasureSetHealer](/crates/ahm/src/heal/erasure_healer.rs:28:0-33:1) to mark objects as failed rather than skipped when existence checks error, ensuring they are tracked for retry.
Type of Change
Related Issues
Summary of Changes
This PR addresses issue #829 where data written while a node is offline is not automatically healed to the rejoined node, even after reading the object. While PR #871 added code to trigger heal requests during reads, the heal queue was using simple FIFO processing that ignored priority levels. After initial implementation and user testing feedback, comprehensive diagnostic logging was added to enable precise identification of where the heal flow may fail.
Problem Analysis
Issue #829: After a node rejoins following an outage, data written while the node was offline is not automatically healed to the rejoined node, even after reading the object.
PR #871 Fix: Added code to detect missing shards during reads and trigger heal requests with
HealChannelPriority::Normal(changed fromLow).Remaining Issues:
VecDeque::pop_front()), completely ignoring priority fieldsNormalpriority, requests waited behind all previously queued tasksRoot Cause
HealManager'sheal_queueusedVecDeque<HealRequest>but processed items withpop_front(), completely ignoring thepriorityfieldChanges Made
Core Priority Queue Implementation:
New PriorityHealQueue structure:
BinaryHeapfor priority-based orderingHashSetfor automatic deduplication of duplicate heal requestsUrgent > High > Normal > LowUpdated HealPriority:
CopyandHashtraits to avoid ownership issuesUpdated HealManager:
heal_queuefromVecDequetoPriorityHealQueuesubmit_heal_requestwith deduplication supportprocess_heal_queueto process by priorityEnhanced Monitoring & Observability:
get_priority_stats()to track request counts by priorityAdaptive Task Processing:
Robust Error Handling:
Comprehensive Diagnostic Logging:
Read path diagnostics:
Heal execution diagnostics:
disks_with_all_partsHeal writing diagnostics:
Comprehensive Testing:
Checklist
make pre-commitImpact
Additional Notes
Test Results
Performance Impact
BinaryHeapoperations (O(log n))Behavior Changes
Diagnostic Guidance
The enhanced logging enables precise diagnosis of heal flow issues. After testing, check logs for:
"Detected missing shards during read"- confirms heal is triggered"Successfully enqueued heal request"- confirms queue accepts request"Starting heal task"- confirms HealManager processes queue"Disk X heal check: should_heal=true"- identifies which disks need healing"Successfully renamed healed data"- confirms data written to diskThis diagnostic information helps identify exactly where in the heal flow any issues occur, enabling targeted fixes.
Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.