[ENH] Quantized Spann Segment Writer#6397
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
8bded9c to
70b715c
Compare
This comment has been minimized.
This comment has been minimized.
15768fa to
e3957b5
Compare
|
The changes also expand the schema and configuration utilities to recognize quantized SPANN variants, add the associated file-path, prefetch, and materialized-log plumbing needed for the new segment type, and back the feature with integration tests covering persistence, reopen, garbage collection, and log-application flows under the Possible Issues• This summary was automatically generated by @propel-code-bot |
This comment has been minimized.
This comment has been minimized.
e3957b5 to
0a585d2
Compare
| QuantizedSpannSegmentError::Config(format!( | ||
| "failed to parse record segment file path: {e}" | ||
| )) | ||
| })?; | ||
| let options = BlockfileReaderOptions::new(id, prefix.to_string()); | ||
| let reader = blockfile_provider.read(options).await.map_err(|e| { | ||
| QuantizedSpannSegmentError::Config(format!( | ||
| "failed to open record segment reader: {e}" | ||
| )) | ||
| })?; | ||
| Some(reader) | ||
| } | ||
| None => None, | ||
| }, | ||
| None => None, | ||
| }; | ||
|
|
||
| // Order matches file_path_keys: cluster[0], embedding_metadata[1], | ||
| // quantized_centroid[2], raw_centroid[3], scalar_metadata[4]. | ||
| let file_ids = QuantizedSpannIds { | ||
| embedding_metadata_id: parsed[1].1, | ||
| prefix_path: prefix_path.clone(), | ||
| quantized_centroid_id: IndexUuid(parsed[2].1), | ||
| quantized_cluster_id: parsed[0].1, | ||
| raw_centroid_id: IndexUuid(parsed[3].1), | ||
| scalar_metadata_id: parsed[4].1, | ||
| }; | ||
| QuantizedSpannIndexWriter::open( | ||
| cluster_block_size, | ||
| vector_segment.collection, | ||
| spann_config, | ||
| dimensionality, | ||
| distance_function, | ||
| file_ids, | ||
| cmek, | ||
| prefix_path.clone(), | ||
| raw_embedding_reader, | ||
| blockfile_provider, | ||
| usearch_provider, | ||
| ) | ||
| .await? |
There was a problem hiding this comment.
[Logic] apply_materialized_log_chunk() now hard-fails whenever the materialized log record doesn’t contain an embedding inline. In production, materialize_logs() commonly hydrates embeddings from the record segment (the log itself often omits them after compaction), so legitimate AddNew/OverwriteExisting operations will now panic with QuantizedSpannSegmentError::Data. You should accept the RecordSegmentReader that’s already being passed to VectorSegmentWriter::apply_materialized_log_chunk, hydrate the record when embeddings_ref_from_log() returns None, and only error when both sources are missing. For example:
pub async fn apply_materialized_log_chunk(
&self,
record_segment_reader: &RecordSegmentReader<'_>,
materialized_chunk: &MaterializeLogsResult,
) -> Result<(), ApplyMaterializedLogError> {
for record in materialized_chunk {
let embedding = match record.embeddings_ref_from_log() {
Some(v) => Cow::Borrowed(v),
None => Cow::Owned(
record
.hydrate(record_segment_reader, 1)
.await?
.embedding
.to_vec(),
),
};
self.index.add(record.get_offset_id(), &embedding).await?;
}
Ok(())
}Without this fallback any replay that relies on the record segment (which is the default case) will immediately fail.
Context for Agents
`apply_materialized_log_chunk()` now hard-fails whenever the materialized log record doesn’t contain an embedding inline. In production, `materialize_logs()` commonly hydrates embeddings from the record segment (the log itself often omits them after compaction), so legitimate `AddNew`/`OverwriteExisting` operations will now panic with `QuantizedSpannSegmentError::Data`. You should accept the `RecordSegmentReader` that’s already being passed to `VectorSegmentWriter::apply_materialized_log_chunk`, hydrate the record when `embeddings_ref_from_log()` returns `None`, and only error when both sources are missing. For example:
```rust
pub async fn apply_materialized_log_chunk(
&self,
record_segment_reader: &RecordSegmentReader<'_>,
materialized_chunk: &MaterializeLogsResult,
) -> Result<(), ApplyMaterializedLogError> {
for record in materialized_chunk {
let embedding = match record.embeddings_ref_from_log() {
Some(v) => Cow::Borrowed(v),
None => Cow::Owned(
record
.hydrate(record_segment_reader, 1)
.await?
.embedding
.to_vec(),
),
};
self.index.add(record.get_offset_id(), &embedding).await?;
}
Ok(())
}
```
Without this fallback any replay that relies on the record segment (which is the default case) will immediately fail.
File: rust/segment/src/quantized_spann.rs
Line: 187There was a problem hiding this comment.
AddNew/OverwriteExisting requires embedding to be present as a system invariance
fa35f48 to
b02eb80
Compare
b02eb80 to
41c8cc9
Compare
This comment has been minimized.
This comment has been minimized.
41c8cc9 to
d58781f
Compare
Merge activity
|
- **[ENH]: Cache rust git submodules in mounted volume (#6424)** - **[CHORE](k8s) increase dev CPU limits from 100m to 200-300m (#6435)** - **[ENH] replace live cloud tests with k8s integration tests (#6434)** - **[ENH] Make dirty_log_collections metric mcmr-aware. (#6353)** - **[ENH] Quantized Spann Segment Writer (#6397)** - **[ENH] Wire up quantized writer in compaction (#6399)** - **[ENH] Quantized Spann Segment Reader (#6405)** - **[ENH] Wire up quantized reader in new orchestrator (#6409)** - **[ENH] Garbage collect usearch index files (#6416)** - **[ENH] Trace quantized spann implementation (#6425)** - **[ENH]: Precompute data chunk len() (#6442)** - **[BUG]: Compaction version file flush was incomplete on MCMR (#6423)** - **[DOC]: Fixed broken links in Readme (#6440)** - **[DOC] Fix link to Rust documentation (#6443)** - **[ENH]: Allow users to disable FTS in schema (#6214)** --------- Co-authored-by: Robert Escriva <[email protected]> Co-authored-by: Macronova <[email protected]> Co-authored-by: Nilpotent <[email protected]> Co-authored-by: anderk222 <[email protected]> Co-authored-by: Sanket Kedia <[email protected]>

Description of changes
Summarize the changes made by this PR.
QuantizedSpannIndexWriterto facilitate segment writercommitto a separatefinishQuantizedSpannSegmentWriter, under the feature flagVectorSegmentWriterto use the new writer implTest plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?