Skip to content

[ENH] Quantized Spann Segment Writer#6397

Merged
Sicheng-Pan merged 9 commits intomainfrom
02-10-_enh_quantized_spann_segment
Feb 14, 2026
Merged

[ENH] Quantized Spann Segment Writer#6397
Sicheng-Pan merged 9 commits intomainfrom
02-10-_enh_quantized_spann_segment

Conversation

@Sicheng-Pan
Copy link
Copy Markdown
Contributor

@Sicheng-Pan Sicheng-Pan commented Feb 10, 2026

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Updated a few structs for QuantizedSpannIndexWriter to facilitate segment writer
    • Separated scrub and rebuild centroid logic from commit to a separate finish
    • Introduces the new QuantizedSpannSegmentWriter, under the feature flag
    • Updated the VectorSegmentWriter to use the new writer impl

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Copy Markdown
Contributor Author

Sicheng-Pan commented Feb 10, 2026

@github-actions
Copy link
Copy Markdown

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@Sicheng-Pan Sicheng-Pan force-pushed the 02-10-_enh_quantized_spann_segment branch from 8bded9c to 70b715c Compare February 10, 2026 22:26
@blacksmith-sh

This comment has been minimized.

@Sicheng-Pan Sicheng-Pan changed the title [ENH] Quantized Spann Segment [ENH] Quantized Spann Segment Writer Feb 11, 2026
@Sicheng-Pan Sicheng-Pan force-pushed the 02-10-_enh_quantized_spann_segment branch from 15768fa to e3957b5 Compare February 11, 2026 02:22
@Sicheng-Pan Sicheng-Pan marked this pull request as ready for review February 11, 2026 02:23
@propel-code-bot
Copy link
Copy Markdown
Contributor

propel-code-bot bot commented Feb 11, 2026

The changes also expand the schema and configuration utilities to recognize quantized SPANN variants, add the associated file-path, prefetch, and materialized-log plumbing needed for the new segment type, and back the feature with integration tests covering persistence, reopen, garbage collection, and log-application flows under the usearch gate.

Possible Issues

QuantizedSpannSegmentWriter::finish must run before commit; any caller skipping it will miss scrub/drop/rebuild logic and potentially corrupt segment state.
Schema::get_spann_config defaults unspecified spaces to cosine, which may alter behavior for collections expecting another default.
• Distributed SPANN writers now reject quantized segments; ensure migration tooling creates quantized segments via the new module.
• Prefetch lists omit centroid artifacts; confirm this is intentional to avoid performance regressions on reopen.

This summary was automatically generated by @propel-code-bot

@blacksmith-sh

This comment has been minimized.

Comment on lines +147 to +187
QuantizedSpannSegmentError::Config(format!(
"failed to parse record segment file path: {e}"
))
})?;
let options = BlockfileReaderOptions::new(id, prefix.to_string());
let reader = blockfile_provider.read(options).await.map_err(|e| {
QuantizedSpannSegmentError::Config(format!(
"failed to open record segment reader: {e}"
))
})?;
Some(reader)
}
None => None,
},
None => None,
};

// Order matches file_path_keys: cluster[0], embedding_metadata[1],
// quantized_centroid[2], raw_centroid[3], scalar_metadata[4].
let file_ids = QuantizedSpannIds {
embedding_metadata_id: parsed[1].1,
prefix_path: prefix_path.clone(),
quantized_centroid_id: IndexUuid(parsed[2].1),
quantized_cluster_id: parsed[0].1,
raw_centroid_id: IndexUuid(parsed[3].1),
scalar_metadata_id: parsed[4].1,
};
QuantizedSpannIndexWriter::open(
cluster_block_size,
vector_segment.collection,
spann_config,
dimensionality,
distance_function,
file_ids,
cmek,
prefix_path.clone(),
raw_embedding_reader,
blockfile_provider,
usearch_provider,
)
.await?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical

[Logic] apply_materialized_log_chunk() now hard-fails whenever the materialized log record doesn’t contain an embedding inline. In production, materialize_logs() commonly hydrates embeddings from the record segment (the log itself often omits them after compaction), so legitimate AddNew/OverwriteExisting operations will now panic with QuantizedSpannSegmentError::Data. You should accept the RecordSegmentReader that’s already being passed to VectorSegmentWriter::apply_materialized_log_chunk, hydrate the record when embeddings_ref_from_log() returns None, and only error when both sources are missing. For example:

pub async fn apply_materialized_log_chunk(
    &self,
    record_segment_reader: &RecordSegmentReader<'_>,
    materialized_chunk: &MaterializeLogsResult,
) -> Result<(), ApplyMaterializedLogError> {
    for record in materialized_chunk {
        let embedding = match record.embeddings_ref_from_log() {
            Some(v) => Cow::Borrowed(v),
            None => Cow::Owned(
                record
                    .hydrate(record_segment_reader, 1)
                    .await?
                    .embedding
                    .to_vec(),
            ),
        };
        self.index.add(record.get_offset_id(), &embedding).await?;
    }
    Ok(())
}

Without this fallback any replay that relies on the record segment (which is the default case) will immediately fail.

Context for Agents
`apply_materialized_log_chunk()` now hard-fails whenever the materialized log record doesn’t contain an embedding inline. In production, `materialize_logs()` commonly hydrates embeddings from the record segment (the log itself often omits them after compaction), so legitimate `AddNew`/`OverwriteExisting` operations will now panic with `QuantizedSpannSegmentError::Data`. You should accept the `RecordSegmentReader` that’s already being passed to `VectorSegmentWriter::apply_materialized_log_chunk`, hydrate the record when `embeddings_ref_from_log()` returns `None`, and only error when both sources are missing. For example:

```rust
pub async fn apply_materialized_log_chunk(
    &self,
    record_segment_reader: &RecordSegmentReader<'_>,
    materialized_chunk: &MaterializeLogsResult,
) -> Result<(), ApplyMaterializedLogError> {
    for record in materialized_chunk {
        let embedding = match record.embeddings_ref_from_log() {
            Some(v) => Cow::Borrowed(v),
            None => Cow::Owned(
                record
                    .hydrate(record_segment_reader, 1)
                    .await?
                    .embedding
                    .to_vec(),
            ),
        };
        self.index.add(record.get_offset_id(), &embedding).await?;
    }
    Ok(())
}
```
Without this fallback any replay that relies on the record segment (which is the default case) will immediately fail.

File: rust/segment/src/quantized_spann.rs
Line: 187

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddNew/OverwriteExisting requires embedding to be present as a system invariance

@Sicheng-Pan Sicheng-Pan force-pushed the 02-10-_enh_quantized_spann_segment branch 2 times, most recently from fa35f48 to b02eb80 Compare February 12, 2026 01:41
@Sicheng-Pan Sicheng-Pan force-pushed the 02-10-_enh_quantized_spann_segment branch from b02eb80 to 41c8cc9 Compare February 12, 2026 21:33
@blacksmith-sh

This comment has been minimized.

@Sicheng-Pan Sicheng-Pan force-pushed the 02-10-_enh_quantized_spann_segment branch from 41c8cc9 to d58781f Compare February 13, 2026 00:43
Copy link
Copy Markdown
Contributor Author

Sicheng-Pan commented Feb 14, 2026

Merge activity

  • Feb 14, 1:52 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Feb 14, 1:52 AM UTC: @Sicheng-Pan merged this pull request with Graphite.

@Sicheng-Pan Sicheng-Pan merged commit ead5dff into main Feb 14, 2026
67 checks passed
tanujnay112 added a commit that referenced this pull request Feb 18, 2026
- **[ENH]: Cache rust git submodules in mounted volume (#6424)**
- **[CHORE](k8s) increase dev CPU limits from 100m to 200-300m (#6435)**
- **[ENH] replace live cloud tests with k8s integration tests (#6434)**
- **[ENH] Make dirty_log_collections metric mcmr-aware. (#6353)**
- **[ENH] Quantized Spann Segment Writer (#6397)**
- **[ENH] Wire up quantized writer in compaction (#6399)**
- **[ENH] Quantized Spann Segment Reader (#6405)**
- **[ENH] Wire up quantized reader in new orchestrator (#6409)**
- **[ENH] Garbage collect usearch index files (#6416)**
- **[ENH] Trace quantized spann implementation (#6425)**
- **[ENH]: Precompute data chunk len() (#6442)**
- **[BUG]: Compaction version file flush was incomplete on MCMR
(#6423)**
- **[DOC]: Fixed broken links in Readme (#6440)**
- **[DOC] Fix link to Rust documentation (#6443)**
- **[ENH]: Allow users to disable FTS in schema (#6214)**

---------

Co-authored-by: Robert Escriva <[email protected]>
Co-authored-by: Macronova <[email protected]>
Co-authored-by: Nilpotent <[email protected]>
Co-authored-by: anderk222 <[email protected]>
Co-authored-by: Sanket Kedia <[email protected]>
@Sicheng-Pan Sicheng-Pan deleted the 02-10-_enh_quantized_spann_segment branch February 25, 2026 23:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants