-
Notifications
You must be signed in to change notification settings - Fork 759
fix: resolve logic errors in ahm heal module #788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Fix response publishing logic in HealChannelProcessor to properly handle errors - Fix negative index handling in DiskStatusChange event to fail fast instead of silently converting to 0 - Enhance timeout control in heal_erasure_set Step 3 loop to immediately respond to cancellation/timeout - Add proper error propagation for task cancellation and timeout in bucket healing loop
- Increase measurement count from 3 to 5 runs for better stability - Increase workload from 5000 to 10000 operations for more accurate timing - Use median of 5 measurements instead of single measurement - Ensure with_scanner duration is at least baseline to avoid negative overhead - Increase wait time for scanner state stabilization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the heal channel system by adding broadcast support for heal responses and introduces utilities for standardizing set disk identifiers. Key changes include timeout/cancellation control for heal tasks and improvements to test stability.
- Added global broadcast channel for heal responses to support multiple subscribers
- Introduced utilities for parsing, formatting, and normalizing set disk identifiers
- Implemented timeout and cancellation control mechanisms for heal tasks
- Improved integration test stability by using multiple measurements and median values
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
crates/common/src/heal_channel.rs |
Adds broadcast channel infrastructure for heal responses with publish/subscribe functions and test coverage |
crates/ahm/src/heal/utils.rs |
New utility module for standardized set disk identifier formatting, parsing, and normalization |
crates/ahm/src/heal/task.rs |
Implements timeout/cancellation control with await_with_control helper and proper error propagation |
crates/ahm/src/heal/storage.rs |
Refactored to use new utility functions for parsing set disk identifiers |
crates/ahm/src/heal/resume.rs |
Updated ResumeState to include set_disk_id field for better task identification |
crates/ahm/src/heal/manager.rs |
Updated to use canonical set disk identifier format from utility functions |
crates/ahm/src/heal/event.rs |
Updated to use utility functions for set disk identifier formatting |
crates/ahm/src/heal/erasure_healer.rs |
Enhanced task ID generation and resume state initialization with set disk identifiers |
crates/ahm/src/heal/channel.rs |
Refactored to publish responses to both local channel and global broadcast |
crates/ahm/tests/integration_tests.rs |
Improved test stability using multiple measurements and median calculation |
crates/ahm/src/heal/mod.rs |
Added new utils module to public interface |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
crates/ahm/src/heal/event.rs
Outdated
| .ok_or_else(|| Error::InvalidHealType { | ||
| heal_type: format!("Invalid pool/set indices: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx), | ||
| }) | ||
| .expect("Invalid pool/set indices for disk status change event"); |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using both ok_or_else() and expect() is redundant. The ok_or_else() creates an Error, but expect() will panic anyway if the value is None. Either use just ok_or_else() and propagate the error with ?, or use expect() directly with a descriptive message. The current code defeats the purpose of constructing a proper error type.
| .ok_or_else(|| Error::InvalidHealType { | |
| heal_type: format!("Invalid pool/set indices: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx), | |
| }) | |
| .expect("Invalid pool/set indices for disk status change event"); | |
| .expect(&format!("Invalid pool/set indices for disk status change event: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx)); |
- Keep Error import as it's required for InvalidHealType usage - Maintain compatibility with main branch changes
Co-authored-by: Copilot <[email protected]>
Replace redundant ok_or_else() + expect() pattern with unwrap_or_else() + panic!() to avoid creating unnecessary Error type when the value will panic anyway. This also defers error message formatting until the error actually occurs.
Co-authored-by: Copilot <[email protected]>
- Fix panic in HealEvent::to_heal_request for invalid indices - Replace unwrap() calls with proper error handling in resume.rs - Fix race conditions and timeout calculation in task.rs - Fix semaphore acquisition error handling in erasure_healer.rs - Improve error message for large objects in storage.rs - Add comprehensive unit tests for progress, event, and channel modules - Fix clippy warning: move test module to end of file in heal_channel.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Baseline test: no scanner load - measure multiple times for stability | ||
| let mut baseline_measurements = Vec::new(); | ||
| for _ in 0..5 { | ||
| let duration = measure_workload(10_000, Duration::ZERO).await; | ||
| baseline_measurements.push(duration); | ||
| } | ||
| // Use median to reduce impact of outliers | ||
| baseline_measurements.sort(); | ||
| let baseline_duration = baseline_measurements[2].max(Duration::from_millis(20)); |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code hardcodes the median index as [2] for a collection of 5 measurements. If the loop count changes in the future, this index will be incorrect. Consider using a constant for the measurement count or computing the median index dynamically: let median_idx = baseline_measurements.len() / 2;.
| // Performance test: with scanner load - measure multiple times for stability | ||
| let mut scanner_measurements = Vec::new(); | ||
| for _ in 0..5 { | ||
| let duration = measure_workload(10_000, Duration::ZERO).await; | ||
| scanner_measurements.push(duration); | ||
| } | ||
| scanner_measurements.sort(); | ||
| let with_scanner_duration = scanner_measurements[2].max(baseline_duration); |
Copilot
AI
Nov 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as the baseline measurements: the median index [2] is hardcoded. Consider using a constant or computing the median index dynamically.
- Add comment to check_control_flags explaining why return value is discarded - Fix hardcoded median index in performance test using constant and dynamic calculation
| let file_path = Path::new(BUCKET_META_PREFIX).join(format!("{task_id}_{RESUME_CHECKPOINT_FILE}")); | ||
| match disk.read_all(RUSTFS_META_BUCKET, file_path.to_str().unwrap()).await { | ||
| Ok(data) => !data.is_empty(), | ||
| match path_to_str(&file_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does error information need to be recorded?
Summary
This PR fixes several critical logic errors identified in the AHM (Automatic Health Management) heal module code review.
Changes
1. Response Publishing Logic Fix
crates/ahm/src/heal/channel.rs2. Negative Index Handling Fix
crates/ahm/src/heal/event.rs3. Timeout Control Enhancement
crates/ahm/src/heal/task.rsheal_erasure_setdidn't properly handle cancellation/timeout errorsTesting
Verification Commands
make fmt-check cargo clippy --all-targets --all-features --package rustfs-ahm -- -D warnings cargo test --package rustfs-ahm --lib cargo check --all-targets --package rustfs-ahmImpact