Skip to content

[ENH] Garbage collect usearch index files#6416

Merged
Sicheng-Pan merged 27 commits intomainfrom
02-11-_enh_garbage_collect_usearch_index_files
Feb 14, 2026
Merged

[ENH] Garbage collect usearch index files#6416
Sicheng-Pan merged 27 commits intomainfrom
02-11-_enh_garbage_collect_usearch_index_files

Conversation

@Sicheng-Pan
Copy link
Copy Markdown
Contributor

@Sicheng-Pan Sicheng-Pan commented Feb 12, 2026

s## Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • N/A
  • New functionality
    • Wire up GC to clean up usearch binary files

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Copy Markdown
Contributor Author

Sicheng-Pan commented Feb 12, 2026

@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review February 12, 2026 06:11
@propel-code-bot
Copy link
Copy Markdown
Contributor

propel-code-bot bot commented Feb 12, 2026

Garbage collector now cleans up USearch index artifacts

Extends GC operators so USearch/Spann index binaries are discovered, preserved when needed, and deleted when unused. Adds USearchIndex::format_storage_key usage and new QUANTIZED_SPANN_* constants in file discovery paths plus enables the usearch feature on the chroma-index dependency.

Key Changes

• Updated ListFilesAtVersionsOperator to recognize QUANTIZED_SPANN_* file types and translate segment paths into USearch binary storage keys via USearchIndex::format_storage_key.
• Modified FetchSparseIndexFilesOperator to treat USearch files like HNSW indices, skipping fetch but collecting prefixes for deferred deletion unless the version is preserved.
• Enhanced ComputeUnusedFilesOperator to emit USearch .bin keys into unused_hnsw_prefixes, ensuring GC deletes orphaned quantized and raw centroids.
• Enabled the usearch feature on chroma-index within rust/garbage_collector/Cargo.toml so the crate exposes USearchIndex helpers.

Possible Issues

• No automated tests cover the USearch GC paths, so regressions (e.g., changing key formatting) would go unnoticed.
• If segment paths ever omit the USearch shard suffix, Segment::extract_prefix_and_id may fail and prevent GC cleanup for those files.

This summary was automatically generated by @propel-code-bot

@blacksmith-sh

This comment has been minimized.

@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch from 34250b1 to da75d4e Compare February 12, 2026 18:48
Comment on lines +107 to +123
// For usearch index files, add the single .bin file directly.
if file_type == QUANTIZED_SPANN_RAW_CENTROID
|| file_type == QUANTIZED_SPANN_QUANTIZED_CENTROID
{
let quantized = file_type == QUANTIZED_SPANN_QUANTIZED_CENTROID;
for file_path in file_paths.paths.iter() {
let (prefix, id) =
Segment::extract_prefix_and_id(file_path).map_err(|e| {
tracing::error!(error = %e, "Failed to extract prefix and ID");
ComputeUnusedFilesError::InvalidUuid(e, file_path.to_string())
})?;
let s3_key =
USearchIndex::format_storage_key(prefix, IndexUuid(id), quantized);
unused_hnsw_prefixes.push(s3_key);
}
continue;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

[Testing] The new SPANN handling in compute_unused_between_successive_versions isn’t covered by any tests—the existing suites only exercise sparse index blocks and HNSW prefixes. Without a regression test we can’t detect if future refactors stop emitting the formatted .bin key, which would silently leak storage. Please extend the tests in this file to create a CollectionSegmentInfo containing a QUANTIZED_SPANN_* entry and assert that compute_unused_between_successive_versions returns the expected USearchIndex::format_storage_key(...) value in unused_hnsw_prefixes.

Context for Agents
The new SPANN handling in `compute_unused_between_successive_versions` isn’t covered by any tests—the existing suites only exercise sparse index blocks and HNSW prefixes. Without a regression test we can’t detect if future refactors stop emitting the formatted `.bin` key, which would silently leak storage. Please extend the tests in this file to create a `CollectionSegmentInfo` containing a `QUANTIZED_SPANN_*` entry and assert that `compute_unused_between_successive_versions` returns the expected `USearchIndex::format_storage_key(...)` value in `unused_hnsw_prefixes`.

File: rust/garbage_collector/src/operators/compute_unused_files.rs
Line: 123

@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch from da75d4e to dc0d345 Compare February 12, 2026 21:27
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator branch from d067206 to 2acf8f0 Compare February 12, 2026 21:27
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch from dc0d345 to a6a9ce3 Compare February 12, 2026 21:33
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator branch from 2acf8f0 to 588f737 Compare February 12, 2026 21:33
@blacksmith-sh

This comment has been minimized.

@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator branch from 588f737 to b64d5d2 Compare February 13, 2026 00:43
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch from a6a9ce3 to e43785c Compare February 13, 2026 00:43
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator branch from b64d5d2 to 9debc15 Compare February 13, 2026 01:39
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch 2 times, most recently from e3c0a2b to eb342bb Compare February 13, 2026 05:39
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch from eb342bb to e1fb908 Compare February 13, 2026 19:19
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator branch from 96c6e0e to 1a1b6e8 Compare February 13, 2026 19:19
@blacksmith-sh

This comment has been minimized.

@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator branch from 1a1b6e8 to 7cefa67 Compare February 14, 2026 00:05
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch from e1fb908 to 5769085 Compare February 14, 2026 00:05
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator branch from 7cefa67 to 8f4b1ec Compare February 14, 2026 00:09
@Sicheng-Pan Sicheng-Pan force-pushed the 02-11-_enh_garbage_collect_usearch_index_files branch from 5769085 to def34fc Compare February 14, 2026 00:09
Copy link
Copy Markdown
Contributor Author

Sicheng-Pan commented Feb 14, 2026

Merge activity

  • Feb 14, 3:39 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Feb 14, 4:16 AM UTC: @Sicheng-Pan merged this pull request with Graphite.

@Sicheng-Pan Sicheng-Pan changed the base branch from 02-11-_enh_wire_up_quantized_reader_in_new_orchestrator to graphite-base/6416 February 14, 2026 03:40
@Sicheng-Pan Sicheng-Pan changed the base branch from graphite-base/6416 to main February 14, 2026 04:15
@Sicheng-Pan Sicheng-Pan merged commit 188d6fe into main Feb 14, 2026
69 of 70 checks passed
tanujnay112 added a commit that referenced this pull request Feb 18, 2026
- **[ENH]: Cache rust git submodules in mounted volume (#6424)**
- **[CHORE](k8s) increase dev CPU limits from 100m to 200-300m (#6435)**
- **[ENH] replace live cloud tests with k8s integration tests (#6434)**
- **[ENH] Make dirty_log_collections metric mcmr-aware. (#6353)**
- **[ENH] Quantized Spann Segment Writer (#6397)**
- **[ENH] Wire up quantized writer in compaction (#6399)**
- **[ENH] Quantized Spann Segment Reader (#6405)**
- **[ENH] Wire up quantized reader in new orchestrator (#6409)**
- **[ENH] Garbage collect usearch index files (#6416)**
- **[ENH] Trace quantized spann implementation (#6425)**
- **[ENH]: Precompute data chunk len() (#6442)**
- **[BUG]: Compaction version file flush was incomplete on MCMR
(#6423)**
- **[DOC]: Fixed broken links in Readme (#6440)**
- **[DOC] Fix link to Rust documentation (#6443)**
- **[ENH]: Allow users to disable FTS in schema (#6214)**

---------

Co-authored-by: Robert Escriva <[email protected]>
Co-authored-by: Macronova <[email protected]>
Co-authored-by: Nilpotent <[email protected]>
Co-authored-by: anderk222 <[email protected]>
Co-authored-by: Sanket Kedia <[email protected]>
@Sicheng-Pan Sicheng-Pan deleted the 02-11-_enh_garbage_collect_usearch_index_files branch February 25, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants