Skip to content

Conversation

@overtrue
Copy link
Collaborator

@overtrue overtrue commented Nov 4, 2025

Summary

This PR fixes several critical logic errors identified in the AHM (Automatic Health Management) heal module code review.

Changes

1. Response Publishing Logic Fix

  • File: crates/ahm/src/heal/channel.rs
  • Issue: Response publishing had unclear error handling logic
  • Fix: Improved error handling to ensure broadcast always attempts even if local send fails, with clearer comments

2. Negative Index Handling Fix

  • File: crates/ahm/src/heal/event.rs
  • Issue: Negative pool/set indices were silently converted to 0, hiding data problems
  • Fix: Changed to fail fast with explicit error message when invalid indices are detected

3. Timeout Control Enhancement

  • File: crates/ahm/src/heal/task.rs
  • Issue: Step 3 loop in heal_erasure_set didn't properly handle cancellation/timeout errors
  • Fix: Added immediate error propagation for TaskCancelled and TaskTimeout errors in bucket healing loop

Testing

  • ✅ All unit tests pass (27 passed)
  • ✅ Code formatting verified
  • ✅ Clippy checks pass
  • ✅ Compilation check passes

Verification Commands

make fmt-check
cargo clippy --all-targets --all-features --package rustfs-ahm -- -D warnings
cargo test --package rustfs-ahm --lib
cargo check --all-targets --package rustfs-ahm

Impact

  • Critical: Fixes potential data corruption issues from invalid index handling
  • Important: Improves error handling and timeout responsiveness
  • Low Risk: Changes are localized to error handling paths

- Fix response publishing logic in HealChannelProcessor to properly handle errors
- Fix negative index handling in DiskStatusChange event to fail fast instead of silently converting to 0
- Enhance timeout control in heal_erasure_set Step 3 loop to immediately respond to cancellation/timeout
- Add proper error propagation for task cancellation and timeout in bucket healing loop
- Increase measurement count from 3 to 5 runs for better stability
- Increase workload from 5000 to 10000 operations for more accurate timing
- Use median of 5 measurements instead of single measurement
- Ensure with_scanner duration is at least baseline to avoid negative overhead
- Increase wait time for scanner state stabilization
@overtrue overtrue requested a review from Copilot November 4, 2025 21:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the heal channel system by adding broadcast support for heal responses and introduces utilities for standardizing set disk identifiers. Key changes include timeout/cancellation control for heal tasks and improvements to test stability.

  • Added global broadcast channel for heal responses to support multiple subscribers
  • Introduced utilities for parsing, formatting, and normalizing set disk identifiers
  • Implemented timeout and cancellation control mechanisms for heal tasks
  • Improved integration test stability by using multiple measurements and median values

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/common/src/heal_channel.rs Adds broadcast channel infrastructure for heal responses with publish/subscribe functions and test coverage
crates/ahm/src/heal/utils.rs New utility module for standardized set disk identifier formatting, parsing, and normalization
crates/ahm/src/heal/task.rs Implements timeout/cancellation control with await_with_control helper and proper error propagation
crates/ahm/src/heal/storage.rs Refactored to use new utility functions for parsing set disk identifiers
crates/ahm/src/heal/resume.rs Updated ResumeState to include set_disk_id field for better task identification
crates/ahm/src/heal/manager.rs Updated to use canonical set disk identifier format from utility functions
crates/ahm/src/heal/event.rs Updated to use utility functions for set disk identifier formatting
crates/ahm/src/heal/erasure_healer.rs Enhanced task ID generation and resume state initialization with set disk identifiers
crates/ahm/src/heal/channel.rs Refactored to publish responses to both local channel and global broadcast
crates/ahm/tests/integration_tests.rs Improved test stability using multiple measurements and median calculation
crates/ahm/src/heal/mod.rs Added new utils module to public interface

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 152 to 155
.ok_or_else(|| Error::InvalidHealType {
heal_type: format!("Invalid pool/set indices: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx),
})
.expect("Invalid pool/set indices for disk status change event");
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using both ok_or_else() and expect() is redundant. The ok_or_else() creates an Error, but expect() will panic anyway if the value is None. Either use just ok_or_else() and propagate the error with ?, or use expect() directly with a descriptive message. The current code defeats the purpose of constructing a proper error type.

Suggested change
.ok_or_else(|| Error::InvalidHealType {
heal_type: format!("Invalid pool/set indices: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx),
})
.expect("Invalid pool/set indices for disk status change event");
.expect(&format!("Invalid pool/set indices for disk status change event: pool={}, set={}", endpoint.pool_idx, endpoint.set_idx));

Copilot uses AI. Check for mistakes.
overtrue and others added 6 commits November 5, 2025 05:18
- Keep Error import as it's required for InvalidHealType usage
- Maintain compatibility with main branch changes
Replace redundant ok_or_else() + expect() pattern with
unwrap_or_else() + panic!() to avoid creating unnecessary Error
type when the value will panic anyway. This also defers error
message formatting until the error actually occurs.
- Fix panic in HealEvent::to_heal_request for invalid indices
- Replace unwrap() calls with proper error handling in resume.rs
- Fix race conditions and timeout calculation in task.rs
- Fix semaphore acquisition error handling in erasure_healer.rs
- Improve error message for large objects in storage.rs
- Add comprehensive unit tests for progress, event, and channel modules
- Fix clippy warning: move test module to end of file in heal_channel.rs
@overtrue overtrue requested a review from Copilot November 4, 2025 22:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 245 to 253
// Baseline test: no scanner load - measure multiple times for stability
let mut baseline_measurements = Vec::new();
for _ in 0..5 {
let duration = measure_workload(10_000, Duration::ZERO).await;
baseline_measurements.push(duration);
}
// Use median to reduce impact of outliers
baseline_measurements.sort();
let baseline_duration = baseline_measurements[2].max(Duration::from_millis(20));
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code hardcodes the median index as [2] for a collection of 5 measurements. If the loop count changes in the future, this index will be incorrect. Consider using a constant for the measurement count or computing the median index dynamically: let median_idx = baseline_measurements.len() / 2;.

Copilot uses AI. Check for mistakes.
Comment on lines 260 to 267
// Performance test: with scanner load - measure multiple times for stability
let mut scanner_measurements = Vec::new();
for _ in 0..5 {
let duration = measure_workload(10_000, Duration::ZERO).await;
scanner_measurements.push(duration);
}
scanner_measurements.sort();
let with_scanner_duration = scanner_measurements[2].max(baseline_duration);
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the baseline measurements: the median index [2] is hardcoded. Consider using a constant or computing the median index dynamically.

Copilot uses AI. Check for mistakes.
- Add comment to check_control_flags explaining why return value is discarded
- Fix hardcoded median index in performance test using constant and dynamic calculation
@loverustfs loverustfs merged commit eb7eb9c into main Nov 5, 2025
14 checks passed
@loverustfs loverustfs deleted the feat/fix-ahm-logic-errors branch November 5, 2025 00:15
let file_path = Path::new(BUCKET_META_PREFIX).join(format!("{task_id}_{RESUME_CHECKPOINT_FILE}"));
match disk.read_all(RUSTFS_META_BUCKET, file_path.to_str().unwrap()).await {
Ok(data) => !data.is_empty(),
match path_to_str(&file_path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does error information need to be recorded?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants