fix: resolve logic errors in ahm heal module #788

overtrue · 2025-11-04T15:10:30Z

Summary

This PR fixes several critical logic errors identified in the AHM (Automatic Health Management) heal module code review.

Changes

1. Response Publishing Logic Fix

File: crates/ahm/src/heal/channel.rs
Issue: Response publishing had unclear error handling logic
Fix: Improved error handling to ensure broadcast always attempts even if local send fails, with clearer comments

2. Negative Index Handling Fix

File: crates/ahm/src/heal/event.rs
Issue: Negative pool/set indices were silently converted to 0, hiding data problems
Fix: Changed to fail fast with explicit error message when invalid indices are detected

3. Timeout Control Enhancement

File: crates/ahm/src/heal/task.rs
Issue: Step 3 loop in heal_erasure_set didn't properly handle cancellation/timeout errors
Fix: Added immediate error propagation for TaskCancelled and TaskTimeout errors in bucket healing loop

Testing

✅ All unit tests pass (27 passed)
✅ Code formatting verified
✅ Clippy checks pass
✅ Compilation check passes

Verification Commands

make fmt-check
cargo clippy --all-targets --all-features --package rustfs-ahm -- -D warnings
cargo test --package rustfs-ahm --lib
cargo check --all-targets --package rustfs-ahm

Impact

Critical: Fixes potential data corruption issues from invalid index handling
Important: Improves error handling and timeout responsiveness
Low Risk: Changes are localized to error handling paths

- Fix response publishing logic in HealChannelProcessor to properly handle errors - Fix negative index handling in DiskStatusChange event to fail fast instead of silently converting to 0 - Enhance timeout control in heal_erasure_set Step 3 loop to immediately respond to cancellation/timeout - Add proper error propagation for task cancellation and timeout in bucket healing loop

- Increase measurement count from 3 to 5 runs for better stability - Increase workload from 5000 to 10000 operations for more accurate timing - Use median of 5 measurements instead of single measurement - Ensure with_scanner duration is at least baseline to avoid negative overhead - Increase wait time for scanner state stabilization

Copilot

Pull Request Overview

This PR enhances the heal channel system by adding broadcast support for heal responses and introduces utilities for standardizing set disk identifiers. Key changes include timeout/cancellation control for heal tasks and improvements to test stability.

Added global broadcast channel for heal responses to support multiple subscribers
Introduced utilities for parsing, formatting, and normalizing set disk identifiers
Implemented timeout and cancellation control mechanisms for heal tasks
Improved integration test stability by using multiple measurements and median values

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`crates/common/src/heal_channel.rs`	Adds broadcast channel infrastructure for heal responses with publish/subscribe functions and test coverage
`crates/ahm/src/heal/utils.rs`	New utility module for standardized set disk identifier formatting, parsing, and normalization
`crates/ahm/src/heal/task.rs`	Implements timeout/cancellation control with `await_with_control` helper and proper error propagation
`crates/ahm/src/heal/storage.rs`	Refactored to use new utility functions for parsing set disk identifiers
`crates/ahm/src/heal/resume.rs`	Updated `ResumeState` to include `set_disk_id` field for better task identification
`crates/ahm/src/heal/manager.rs`	Updated to use canonical set disk identifier format from utility functions
`crates/ahm/src/heal/event.rs`	Updated to use utility functions for set disk identifier formatting
`crates/ahm/src/heal/erasure_healer.rs`	Enhanced task ID generation and resume state initialization with set disk identifiers
`crates/ahm/src/heal/channel.rs`	Refactored to publish responses to both local channel and global broadcast
`crates/ahm/tests/integration_tests.rs`	Improved test stability using multiple measurements and median calculation
`crates/ahm/src/heal/mod.rs`	Added new `utils` module to public interface

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/ahm/src/heal/channel.rs

Copilot · 2025-11-04T21:17:58Z

crates/ahm/src/heal/event.rs

+                    .ok_or_else(|| Error::InvalidHealType {
+                        heal_type: format!("Invalid pool/set indices: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx),
+                    })
+                    .expect("Invalid pool/set indices for disk status change event");


Using both ok_or_else() and expect() is redundant. The ok_or_else() creates an Error, but expect() will panic anyway if the value is None. Either use just ok_or_else() and propagate the error with ?, or use expect() directly with a descriptive message. The current code defeats the purpose of constructing a proper error type.

Suggested change

.ok_or_else(|| Error::InvalidHealType {

heal_type: format!("Invalid pool/set indices: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx),

})

.expect("Invalid pool/set indices for disk status change event");

.expect(&format!("Invalid pool/set indices for disk status change event: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx));

crates/ahm/src/heal/task.rs

- Keep Error import as it's required for InvalidHealType usage - Maintain compatibility with main branch changes

Co-authored-by: Copilot <[email protected]>

Replace redundant ok_or_else() + expect() pattern with unwrap_or_else() + panic!() to avoid creating unnecessary Error type when the value will panic anyway. This also defers error message formatting until the error actually occurs.

Co-authored-by: Copilot <[email protected]>

- Fix panic in HealEvent::to_heal_request for invalid indices - Replace unwrap() calls with proper error handling in resume.rs - Fix race conditions and timeout calculation in task.rs - Fix semaphore acquisition error handling in erasure_healer.rs - Improve error message for large objects in storage.rs - Add comprehensive unit tests for progress, event, and channel modules - Fix clippy warning: move test module to end of file in heal_channel.rs

Copilot

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/ahm/src/heal/channel.rs

crates/ahm/src/heal/task.rs

crates/ahm/src/heal/erasure_healer.rs

crates/ahm/src/heal/task.rs

Copilot · 2025-11-04T22:28:32Z

crates/ahm/tests/integration_tests.rs

+    // Baseline test: no scanner load - measure multiple times for stability
+    let mut baseline_measurements = Vec::new();
+    for _ in 0..5 {
+        let duration = measure_workload(10_000, Duration::ZERO).await;
+        baseline_measurements.push(duration);
+    }
+    // Use median to reduce impact of outliers
+    baseline_measurements.sort();
+    let baseline_duration = baseline_measurements[2].max(Duration::from_millis(20));


The code hardcodes the median index as [2] for a collection of 5 measurements. If the loop count changes in the future, this index will be incorrect. Consider using a constant for the measurement count or computing the median index dynamically: let median_idx = baseline_measurements.len() / 2;.

Copilot · 2025-11-04T22:28:32Z

crates/ahm/tests/integration_tests.rs

+    // Performance test: with scanner load - measure multiple times for stability
+    let mut scanner_measurements = Vec::new();
+    for _ in 0..5 {
+        let duration = measure_workload(10_000, Duration::ZERO).await;
+        scanner_measurements.push(duration);
+    }
+    scanner_measurements.sort();
+    let with_scanner_duration = scanner_measurements[2].max(baseline_duration);


Same issue as the baseline measurements: the median index [2] is hardcoded. Consider using a constant or computing the median index dynamically.

- Add comment to check_control_flags explaining why return value is discarded - Fix hardcoded median index in performance test using constant and dynamic calculation

houseme · 2025-11-05T00:15:26Z

crates/ahm/src/heal/resume.rs

        let file_path = Path::new(BUCKET_META_PREFIX).join(format!("{task_id}_{RESUME_CHECKPOINT_FILE}"));
-        match disk.read_all(RUSTFS_META_BUCKET, file_path.to_str().unwrap()).await {
-            Ok(data) => !data.is_empty(),
+        match path_to_str(&file_path) {


Does error information need to be recorded?

overtrue requested review from guojidan, houseme and weisd November 4, 2025 15:11

overtrue added 2 commits November 5, 2025 05:11

wip

f3cc282

overtrue requested a review from Copilot November 4, 2025 21:16

Copilot AI reviewed Nov 4, 2025

View reviewed changes

overtrue and others added 6 commits November 5, 2025 05:18

merge: resolve conflict with main branch in channel.rs

6c87a40

- Keep Error import as it's required for InvalidHealType usage - Maintain compatibility with main branch changes

Update crates/ahm/src/heal/channel.rs

1af2d61

Co-authored-by: Copilot <[email protected]>

Update crates/ahm/src/heal/task.rs

7d39142

Co-authored-by: Copilot <[email protected]>

style: apply cargo fmt formatting

181e2b2

overtrue requested a review from Copilot November 4, 2025 22:25

Copilot AI reviewed Nov 4, 2025

View reviewed changes

refactor(ahm): address copilot review suggestions

7a37e6f

- Add comment to check_control_flags explaining why return value is discarded - Fix hardcoded median index in performance test using constant and dynamic calculation

loverustfs merged commit eb7eb9c into main Nov 5, 2025
14 checks passed

loverustfs deleted the feat/fix-ahm-logic-errors branch November 5, 2025 00:15

houseme reviewed Nov 5, 2025

View reviewed changes

loverustfs mentioned this pull request Nov 5, 2025

What was this zombie process for? What is rustfs curl requesting for? #792

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resolve logic errors in ahm heal module #788

fix: resolve logic errors in ahm heal module #788

Uh oh!

overtrue commented Nov 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Nov 4, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 4, 2025

Uh oh!

Copilot AI Nov 4, 2025

Uh oh!

Uh oh!

houseme Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: resolve logic errors in ahm heal module #788

fix: resolve logic errors in ahm heal module #788

Uh oh!

Conversation

overtrue commented Nov 4, 2025

Summary

Changes

1. Response Publishing Logic Fix

2. Negative Index Handling Fix

3. Timeout Control Enhancement

Testing

Verification Commands

Impact

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

houseme Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants