[ENH]: Precompute data chunk len() by tanujnay112 · Pull Request #6442 · chroma-core/chroma

tanujnay112 · 2026-02-15T00:38:59Z

Description of changes

Improvements & Bug fixes
- DataChunk.len() used to count the number of visibility[i] = True values to compute the length(). This is inefficient to compute on every invocation. The metadata log reader has a loop that invokes that invokes this len() function for every fetched log record. This appears to continue to take up > 40% of CPU on stack traces during gets on a high number of log records. This change precomputes this length value to fix this perf issue.
New functionality
- ...

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_

github-actions · 2026-02-15T00:39:06Z

tanujnay112 · 2026-02-15T00:39:11Z

[ENH]: Precompute data chunk len() #6442 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

propel-code-bot · 2026-02-15T00:48:48Z

Precompute Chunk visible length

Adds a cached visible_count to Chunk<T> so that len() returns the precomputed value instead of iterating visibility on every call. The constructor, Clone, and set_visibility paths now initialize or refresh the cached count, eliminating the repeated CPU-intensive scan noted in metadata log reads.

Key Changes

• Extended Chunk<T> in rust/types/src/data_chunk.rs with a visible_count field cloned alongside data and visibility.
• Initialized visible_count in Chunk::new to the full length of the backing data and returned it directly in len().
• Updated set_visibility to recompute visible_count whenever the visibility vector is replaced, keeping the cache consistent.

Possible Issues

• If any consumer mutates visibility without using set_visibility (e.g., by cloning the arc slice and modifying it), visible_count would desynchronize; reviewers should confirm such patterns do not exist.

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2026-02-15T00:50:42Z

rust/types/src/data_chunk.rs

    /// # Arguments
    /// * `visibility` - A vector of boolean values indicating the visibility of the elements
    pub fn set_visibility(&mut self, visibility: Vec<bool>) {
+        self.visible_count = visibility.iter().filter(|&v| *v).count();


[Logic] The method documentation explicitly states that visibility length should match the data chunk length. However, this is not enforced, which could lead to inconsistent state where len() (based on visible_count) disagrees with iter().count() (based on data.len() and visibility) if a longer visibility vector is passed.

Add an assertion to enforce this invariant and prevent subtle bugs.

Suggested change

self.visible_count = visibility.iter().filter(|&v| *v).count();

assert_eq!(visibility.len(), self.data.len(), "Visibility vector length must match data length");

self.visible_count = visibility.iter().filter(|&v| *v).count();

Context for Agents

The method documentation explicitly states that `visibility` length should match the data chunk length. However, this is not enforced, which could lead to inconsistent state where `len()` (based on `visible_count`) disagrees with `iter().count()` (based on `data.len()` and `visibility`) if a longer visibility vector is passed. Add an assertion to enforce this invariant and prevent subtle bugs. ```suggestion assert_eq!(visibility.len(), self.data.len(), "Visibility vector length must match data length"); self.visible_count = visibility.iter().filter(|&v| *v).count(); ``` File: rust/types/src/data_chunk.rs Line: 81

- **[ENH]: Cache rust git submodules in mounted volume (#6424)** - **[CHORE](k8s) increase dev CPU limits from 100m to 200-300m (#6435)** - **[ENH] replace live cloud tests with k8s integration tests (#6434)** - **[ENH] Make dirty_log_collections metric mcmr-aware. (#6353)** - **[ENH] Quantized Spann Segment Writer (#6397)** - **[ENH] Wire up quantized writer in compaction (#6399)** - **[ENH] Quantized Spann Segment Reader (#6405)** - **[ENH] Wire up quantized reader in new orchestrator (#6409)** - **[ENH] Garbage collect usearch index files (#6416)** - **[ENH] Trace quantized spann implementation (#6425)** - **[ENH]: Precompute data chunk len() (#6442)** - **[BUG]: Compaction version file flush was incomplete on MCMR (#6423)** - **[DOC]: Fixed broken links in Readme (#6440)** - **[DOC] Fix link to Rust documentation (#6443)** - **[ENH]: Allow users to disable FTS in schema (#6214)** --------- Co-authored-by: Robert Escriva <[email protected]> Co-authored-by: Macronova <[email protected]> Co-authored-by: Nilpotent <[email protected]> Co-authored-by: anderk222 <[email protected]> Co-authored-by: Sanket Kedia <[email protected]>

[ENH]: Precompute data chunk len()

28d3150

tanujnay112 marked this pull request as ready for review February 15, 2026 00:48

propel-code-bot bot reviewed Feb 15, 2026

View reviewed changes

HammadB approved these changes Feb 15, 2026

View reviewed changes

tanujnay112 merged commit 874a1f9 into main Feb 15, 2026
121 of 133 checks passed

tanujnay112 mentioned this pull request Feb 18, 2026

[CHORE]: fast forward rc/2026-02-13 to 3cb097601ddd379ab109456727a926e593e43a5c #6456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH]: Precompute data chunk len()#6442

[ENH]: Precompute data chunk len()#6442
tanujnay112 merged 1 commit intomainfrom
metadata_len_opt

tanujnay112 commented Feb 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 15, 2026

Uh oh!

tanujnay112 commented Feb 15, 2026

Uh oh!

propel-code-bot bot commented Feb 15, 2026

Uh oh!

propel-code-bot bot Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	self.visible_count = visibility.iter().filter(\|&v\| *v).count();
	assert_eq!(visibility.len(), self.data.len(), "Visibility vector length must match data length");
	self.visible_count = visibility.iter().filter(\|&v\| *v).count();

Conversation

tanujnay112 commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

github-actions bot commented Feb 15, 2026

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

tanujnay112 commented Feb 15, 2026

Uh oh!

propel-code-bot bot commented Feb 15, 2026

Uh oh!

propel-code-bot bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanujnay112 commented Feb 15, 2026 •

edited

Loading