Skip to content

Conversation

@weisd
Copy link
Contributor

@weisd weisd commented Nov 27, 2025

Fix Admin Heal API and Add Pagination Support for Large Buckets

Summary

This PR fixes the admin heal API endpoint and introduces pagination support for healing operations to handle large buckets efficiently without loading all objects into memory.

Changes

1. Admin Heal API Fix

  • Added missing /v3/heal/{bucket} endpoint to support healing entire buckets without requiring a prefix
  • Previously, only /v3/heal/{bucket}/{prefix} was available, which required a prefix parameter

2. Pagination Support for Object Listing

  • New API: Added list_objects_for_heal_page() method to HealStorageAPI trait

    • Returns objects in pages with continuation token support
    • Prevents memory exhaustion when processing buckets with millions of objects
    • Default page size: 1000 objects per page
  • Refactored heal operations:

    • heal_bucket_with_resume(): Now processes objects page-by-page instead of loading all at once
    • heal_bucket(): Updated to use pagination for better memory efficiency
    • Both methods now handle checkpoint/resume correctly with pagination

3. HealScanMode Serialization Improvements

  • Implemented custom Serialize and Deserialize for HealScanMode enum
  • Supports both integer (0, 1, 2) and string ("Unknown", "Normal", "Deep") formats
  • Better compatibility with different API clients

Type of Change

  • New Feature
  • Bug Fix
  • Documentation
  • Performance Improvement
  • Test/CI
  • Refactor
  • Other:

Related Issues

Summary of Changes

Checklist

  • I have read and followed the CONTRIBUTING.md guidelines
  • Passed make pre-commit
  • Added/updated necessary tests
  • Documentation updated (if needed)
  • CI/CD passed (if applicable)

Impact

  • Breaking change (compatibility)
  • Requires doc/config/deployment update
  • Other impact:

Additional Notes


Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

fix heal bucket objects with pagination
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes the admin heal API by adding a missing endpoint and introduces pagination support for healing operations to efficiently handle large buckets without loading all objects into memory.

  • Added /v3/heal/{bucket} endpoint to heal entire buckets without requiring a prefix parameter
  • Implemented list_objects_for_heal_page() method with continuation token support (page size: 1000 objects)
  • Refactored heal_bucket_with_resume() and heal_bucket() to process objects page-by-page instead of loading all at once
  • Added custom Serialize/Deserialize implementations for HealScanMode enum to support both integer and string formats

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
rustfs/src/admin/mod.rs Added missing /v3/heal/{bucket} endpoint for healing buckets without prefix
crates/common/src/heal_channel.rs Implemented custom serialization for HealScanMode enum to support multiple formats; added serde attributes to HealOpts fields
crates/ahm/src/heal/storage.rs Added list_objects_for_heal_page() method with pagination support; refactored list_objects_for_heal() to use pagination internally
crates/ahm/src/heal/erasure_healer.rs Refactored heal_bucket_with_resume() and heal_bucket() to process objects using pagination instead of loading all objects into memory
crates/ahm/src/heal/manager.rs Enhanced log messages in auto disk scanner for better debugging
crates/ahm/src/scanner/node_scanner.rs Added comment to scanning loop initialization
crates/ahm/src/scanner/data_scanner.rs Commented out immediate data usage collection on startup
crates/ahm/src/heal/channel.rs Improved heal start request log message to include object prefix; updated test mock with new pagination method
crates/ahm/tests/heal_bug_fixes_test.rs Updated test mock to implement new list_objects_for_heal_page() method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +259 to 276
// 2. process objects with pagination to avoid loading all objects into memory
let mut continuation_token: Option<String> = None;
let mut global_obj_idx = 0usize;

loop {
// Get one page of objects
let (objects, next_token, is_truncated) = self
.storage
.list_objects_for_heal_page(bucket, "", continuation_token.as_deref())
.await?;

// Check if object still exists before attempting heal
let object_exists = match self.storage.object_exists(bucket, object).await {
Ok(exists) => exists,
Err(e) => {
warn!("Failed to check existence of {}/{}: {}, marking as failed", bucket, object, e);
*failed_objects += 1;
checkpoint_manager.add_failed_object(object.clone()).await?;
*current_object_index = obj_idx + 1;
// Process objects in this page
for object in objects {
// Skip objects before the checkpoint
if global_obj_idx < *current_object_index {
global_obj_idx += 1;
continue;
}
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resume logic with pagination is inefficient. When resuming from checkpoint object index N, the code fetches and iterates through all pages from the beginning, skipping N objects one by one (lines 273-276). For large buckets, this means fetching thousands of objects just to skip them.

Consider using the start_after parameter in list_objects_v2 (which is available but currently passed as None on line 552) to resume directly from the last processed object name instead of skipping by index. This would require storing the last object name in the checkpoint rather than just the index.

Copilot uses AI. Check for mistakes.
}
// check if already processed
if checkpoint_manager.get_checkpoint().await.processed_objects.contains(&object) {
global_obj_idx += 1;
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent checkpoint index update when skipping already-processed objects. When an object is found in processed_objects set (line 279), the code increments global_obj_idx but doesn't update *current_object_index. This is inconsistent with error handling paths (lines 297, 310) and the successful processing path (line 342), which all update *current_object_index = global_obj_idx.

This inconsistency could cause issues if the heal operation is interrupted and resumed - the checkpoint index might not accurately reflect the actual progress. Consider adding *current_object_index = global_obj_idx; at line 280 for consistency.

Suggested change
global_obj_idx += 1;
global_obj_idx += 1;
*current_object_index = global_obj_idx;

Copilot uses AI. Check for mistakes.
// start scanning loop
let scanner_clone = self.clone_for_background();
tokio::spawn(async move {
// update object count and size for each bucket
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "update object count and size for each bucket" doesn't accurately describe what the code does. The code is spawning a task that runs scan_loop_with_resume(None), which is a generic scanning loop, not specifically updating object counts and sizes. This comment appears to be a copy-paste error or placeholder. Consider removing it or providing a more accurate description of what scan_loop_with_resume does.

Suggested change
// update object count and size for each bucket
// spawn background scanning loop

Copilot uses AI. Check for mistakes.

// Initialize and start the node scanner
self.node_scanner.initialize_stats().await?;
// update object count and size for each bucket
Copy link

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "update object count and size for each bucket" doesn't accurately describe what the code does. The code is calling node_scanner.start(), which initializes and starts the node scanner, not specifically updating object counts and sizes. This comment appears to be a copy-paste error or placeholder. Consider removing it or providing a more accurate description.

Suggested change
// update object count and size for each bucket

Copilot uses AI. Check for mistakes.
@weisd weisd merged commit a8b7b28 into main Dec 3, 2025
14 checks passed
@weisd weisd deleted the weisd/heal branch December 3, 2025 10:10
houseme added a commit that referenced this pull request Dec 3, 2025
…s/rustfs into dependabot/cargo/dependencies-1126

* 'dependabot/cargo/dependencies-1126' of github.com:rustfs/rustfs:
  Fix Admin Heal API and Add Pagination Support for Large Buckets (#933)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants