Fix Admin Heal API and Add Pagination Support for Large Buckets #933

weisd · 2025-11-27T06:39:04Z

Fix Admin Heal API and Add Pagination Support for Large Buckets

Summary

This PR fixes the admin heal API endpoint and introduces pagination support for healing operations to handle large buckets efficiently without loading all objects into memory.

Changes

1. Admin Heal API Fix

Added missing /v3/heal/{bucket} endpoint to support healing entire buckets without requiring a prefix
Previously, only /v3/heal/{bucket}/{prefix} was available, which required a prefix parameter

2. Pagination Support for Object Listing

New API: Added list_objects_for_heal_page() method to HealStorageAPI trait
- Returns objects in pages with continuation token support
- Prevents memory exhaustion when processing buckets with millions of objects
- Default page size: 1000 objects per page
Refactored heal operations:
- heal_bucket_with_resume(): Now processes objects page-by-page instead of loading all at once
- heal_bucket(): Updated to use pagination for better memory efficiency
- Both methods now handle checkpoint/resume correctly with pagination

3. HealScanMode Serialization Improvements

Implemented custom Serialize and Deserialize for HealScanMode enum
Supports both integer (0, 1, 2) and string ("Unknown", "Normal", "Deep") formats
Better compatibility with different API clients

Type of Change

Related Issues

Summary of Changes

Checklist

I have read and followed the CONTRIBUTING.md guidelines
Passed make pre-commit
Added/updated necessary tests
Documentation updated (if needed)
CI/CD passed (if applicable)

Impact

Breaking change (compatibility)
Requires doc/config/deployment update
Other impact:

Additional Notes

Thank you for your contribution! Please ensure your PR follows the community standards (CODE_OF_CONDUCT.md) and sign the CLA if this is your first contribution.

fix heal bucket objects with pagination

Copilot

Pull request overview

This PR fixes the admin heal API by adding a missing endpoint and introduces pagination support for healing operations to efficiently handle large buckets without loading all objects into memory.

Added /v3/heal/{bucket} endpoint to heal entire buckets without requiring a prefix parameter
Implemented list_objects_for_heal_page() method with continuation token support (page size: 1000 objects)
Refactored heal_bucket_with_resume() and heal_bucket() to process objects page-by-page instead of loading all at once
Added custom Serialize/Deserialize implementations for HealScanMode enum to support both integer and string formats

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
rustfs/src/admin/mod.rs	Added missing `/v3/heal/{bucket}` endpoint for healing buckets without prefix
crates/common/src/heal_channel.rs	Implemented custom serialization for `HealScanMode` enum to support multiple formats; added serde attributes to `HealOpts` fields
crates/ahm/src/heal/storage.rs	Added `list_objects_for_heal_page()` method with pagination support; refactored `list_objects_for_heal()` to use pagination internally
crates/ahm/src/heal/erasure_healer.rs	Refactored `heal_bucket_with_resume()` and `heal_bucket()` to process objects using pagination instead of loading all objects into memory
crates/ahm/src/heal/manager.rs	Enhanced log messages in auto disk scanner for better debugging
crates/ahm/src/scanner/node_scanner.rs	Added comment to scanning loop initialization
crates/ahm/src/scanner/data_scanner.rs	Commented out immediate data usage collection on startup
crates/ahm/src/heal/channel.rs	Improved heal start request log message to include object prefix; updated test mock with new pagination method
crates/ahm/tests/heal_bug_fixes_test.rs	Updated test mock to implement new `list_objects_for_heal_page()` method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/ahm/src/scanner/data_scanner.rs

Copilot · 2025-11-27T06:43:23Z

crates/ahm/src/heal/erasure_healer.rs

+        // 2. process objects with pagination to avoid loading all objects into memory
+        let mut continuation_token: Option<String> = None;
+        let mut global_obj_idx = 0usize;
+
+        loop {
+            // Get one page of objects
+            let (objects, next_token, is_truncated) = self
+                .storage
+                .list_objects_for_heal_page(bucket, "", continuation_token.as_deref())
                .await?;

-            // Check if object still exists before attempting heal
-            let object_exists = match self.storage.object_exists(bucket, object).await {
-                Ok(exists) => exists,
-                Err(e) => {
-                    warn!("Failed to check existence of {}/{}: {}, marking as failed", bucket, object, e);
-                    *failed_objects += 1;
-                    checkpoint_manager.add_failed_object(object.clone()).await?;
-                    *current_object_index = obj_idx + 1;
+            // Process objects in this page
+            for object in objects {
+                // Skip objects before the checkpoint
+                if global_obj_idx < *current_object_index {
+                    global_obj_idx += 1;
                    continue;
                }


The resume logic with pagination is inefficient. When resuming from checkpoint object index N, the code fetches and iterates through all pages from the beginning, skipping N objects one by one (lines 273-276). For large buckets, this means fetching thousands of objects just to skip them.

Consider using the start_after parameter in list_objects_v2 (which is available but currently passed as None on line 552) to resume directly from the last processed object name instead of skipping by index. This would require storing the last object name in the checkpoint rather than just the index.

Copilot · 2025-11-27T06:43:24Z

crates/ahm/src/heal/erasure_healer.rs

-            }
+                // check if already processed
+                if checkpoint_manager.get_checkpoint().await.processed_objects.contains(&object) {
+                    global_obj_idx += 1;


Inconsistent checkpoint index update when skipping already-processed objects. When an object is found in processed_objects set (line 279), the code increments global_obj_idx but doesn't update *current_object_index. This is inconsistent with error handling paths (lines 297, 310) and the successful processing path (line 342), which all update *current_object_index = global_obj_idx.

This inconsistency could cause issues if the heal operation is interrupted and resumed - the checkpoint index might not accurately reflect the actual progress. Consider adding *current_object_index = global_obj_idx; at line 280 for consistency.

Suggested change

global_obj_idx += 1;

global_obj_idx += 1;

*current_object_index = global_obj_idx;

Copilot · 2025-11-27T06:43:24Z

crates/ahm/src/scanner/node_scanner.rs

        // start scanning loop
        let scanner_clone = self.clone_for_background();
        tokio::spawn(async move {
+            // update object count and size for each bucket


The comment "update object count and size for each bucket" doesn't accurately describe what the code does. The code is spawning a task that runs scan_loop_with_resume(None), which is a generic scanning loop, not specifically updating object counts and sizes. This comment appears to be a copy-paste error or placeholder. Consider removing it or providing a more accurate description of what scan_loop_with_resume does.

Suggested change

// update object count and size for each bucket

// spawn background scanning loop

Copilot · 2025-11-27T06:43:24Z

crates/ahm/src/scanner/data_scanner.rs


        // Initialize and start the node scanner
        self.node_scanner.initialize_stats().await?;
+        // update object count and size for each bucket


The comment "update object count and size for each bucket" doesn't accurately describe what the code does. The code is calling node_scanner.start(), which initializes and starts the node scanner, not specifically updating object counts and sizes. This comment appears to be a copy-paste error or placeholder. Consider removing it or providing a more accurate description.

Suggested change

// update object count and size for each bucket

Co-authored-by: Copilot <[email protected]>

fix heal bucket objects with pagination

…s/rustfs into dependabot/cargo/dependencies-1126 * 'dependabot/cargo/dependencies-1126' of github.com:rustfs/rustfs: Fix Admin Heal API and Add Pagination Support for Large Buckets (#933)

weisd added 2 commits November 27, 2025 13:53

fix admin heal

c2d011f

fix heal bucket objects with pagination

checkout run.sh

d2be4ce

weisd requested review from Copilot and guojidan November 27, 2025 06:39

Copilot started reviewing on behalf of weisd November 27, 2025 06:39 View session

Copilot finished reviewing on behalf of weisd November 27, 2025 06:42

Copilot AI reviewed Nov 27, 2025

View reviewed changes

weisd and others added 13 commits November 27, 2025 15:22

Update crates/ahm/src/scanner/data_scanner.rs

22c0870

Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into weisd/heal

0da6a09

Merge branch 'main' into weisd/heal

964c461

Merge branch 'main' into weisd/heal

0c33897

Merge branch 'main' into weisd/heal

91bc3e7

fix admin heal

7a5ff21

fix heal bucket objects with pagination

checkout run.sh

1a4f7e9

add disks_with_all_parts logs

5835234

test heal

d70dae4

test heal

b78272f

fix heal object

2c95ddc

fix heal object

b9e84f8

fix heal object

ef85d77

weisd merged commit a8b7b28 into main Dec 3, 2025
14 checks passed

weisd deleted the weisd/heal branch December 3, 2025 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Admin Heal API and Add Pagination Support for Large Buckets #933

Fix Admin Heal API and Add Pagination Support for Large Buckets #933

Uh oh!

weisd commented Nov 27, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	global_obj_idx += 1;
	global_obj_idx += 1;
	*current_object_index = global_obj_idx;

	// update object count and size for each bucket
	// spawn background scanning loop

Fix Admin Heal API and Add Pagination Support for Large Buckets #933

Fix Admin Heal API and Add Pagination Support for Large Buckets #933

Uh oh!

Conversation

weisd commented Nov 27, 2025

Fix Admin Heal API and Add Pagination Support for Large Buckets

Summary

Changes

1. Admin Heal API Fix

2. Pagination Support for Object Listing

3. HealScanMode Serialization Improvements

Type of Change

Related Issues

Summary of Changes

Checklist

Impact

Additional Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants