What happens
chromadb/api/types.py ships (chromadb 1.5.8):
class DefaultEmbeddingFunction(EmbeddingFunction[Documents]):
def __call__(self, input: Documents) -> Embeddings:
from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import (
ONNXMiniLM_L6_V2,
)
return ONNXMiniLM_L6_V2()(input)
DefaultEmbeddingFunction.__call__ constructs a fresh ONNXMiniLM_L6_V2 every time it runs, triggering cold lazy-init of the tokenizer (~5ms) and the ONNX InferenceSession (~180ms) per invocation. Users whose workload hits embedding on a hot path (per-request retrieval, streaming ingest, RAG loops) pay ~200ms of avoidable tokenizer+model setup per call.
Secondary issue — thread contention under concurrency
Even with a cached instance, the default intra_op_num_threads=0 ("use all cores") causes severe context-switch thrashing when multiple concurrent queries hit the same session: each embed call fans out across all CPUs, producing worse-than-serial scaling.
Measurement
AMD EPYC-Rome VPS, 16 cores, 300-drawer palace, chromadb 1.5.8:
| Scenario |
Pre-fix |
Post-fix (singleton + intra_op=1) |
Speedup |
semantic_search p95 (single-user; per /search call, one embed inside) |
412 ms |
95 ms |
4.3× |
| Composite 4-layer wake-up p95 |
768 ms |
106 ms |
7.2× |
| 4-concurrent scaling ratio vs single-call |
4.36× |
1.35× |
3.2× better |
| Ingest per drawer |
299 ms |
104 ms |
2.9× |
The 4-concurrent ratio is the one worth dwelling on: 4.36× means four parallel embed calls take longer than running them in series would — the default thread fan-out produces negative scaling.
Proposed fix
Two parts:
- Cache a single
ONNXMiniLM_L6_V2 instance at the class level (or via module-level singleton) so DefaultEmbeddingFunction.__call__ routes every invocation through one instance.
- Construct that instance with
intra_op_num_threads=1, inter_op_num_threads=1 so concurrent embeds parallelize across separate cores rather than thrashing contention on shared cores.
InferenceSession.run() is documented as thread-safe; the Rust-backed tokenizers.Tokenizer is thread-safe for encode. Vectors are byte-identical pre- and post-patch for identical input — no index rebuild required by users applying the change.
Workaround downstream
A monkey-patch of DefaultEmbeddingFunction.__call__ landing the singleton + intra_op=1 settings produces the numbers above. Version-pinned to chromadb 1.5.8 because the patch depends on the exact call site. Discovered during MemPalace-sidecar integration work (xelauvas/xelasphere, April 2026); happy to share the patch module and full benchmark artifacts if useful.
Version
chromadb 1.5.8 (reproducible back to at least 1.4.x based on the identical __call__ body). Python 3.12.3, onnxruntime 1.24.4.
What happens
chromadb/api/types.pyships (chromadb 1.5.8):DefaultEmbeddingFunction.__call__constructs a freshONNXMiniLM_L6_V2every time it runs, triggering cold lazy-init of the tokenizer (~5ms) and the ONNXInferenceSession(~180ms) per invocation. Users whose workload hits embedding on a hot path (per-request retrieval, streaming ingest, RAG loops) pay ~200ms of avoidable tokenizer+model setup per call.Secondary issue — thread contention under concurrency
Even with a cached instance, the default
intra_op_num_threads=0("use all cores") causes severe context-switch thrashing when multiple concurrent queries hit the same session: each embed call fans out across all CPUs, producing worse-than-serial scaling.Measurement
AMD EPYC-Rome VPS, 16 cores, 300-drawer palace, chromadb 1.5.8:
semantic_searchp95 (single-user; per/searchcall, one embed inside)The 4-concurrent ratio is the one worth dwelling on: 4.36× means four parallel embed calls take longer than running them in series would — the default thread fan-out produces negative scaling.
Proposed fix
Two parts:
ONNXMiniLM_L6_V2instance at the class level (or via module-level singleton) soDefaultEmbeddingFunction.__call__routes every invocation through one instance.intra_op_num_threads=1, inter_op_num_threads=1so concurrent embeds parallelize across separate cores rather than thrashing contention on shared cores.InferenceSession.run()is documented as thread-safe; the Rust-backedtokenizers.Tokenizeris thread-safe forencode. Vectors are byte-identical pre- and post-patch for identical input — no index rebuild required by users applying the change.Workaround downstream
A monkey-patch of
DefaultEmbeddingFunction.__call__landing the singleton +intra_op=1settings produces the numbers above. Version-pinned to chromadb 1.5.8 because the patch depends on the exact call site. Discovered during MemPalace-sidecar integration work (xelauvas/xelasphere, April 2026); happy to share the patch module and full benchmark artifacts if useful.Version
chromadb 1.5.8 (reproducible back to at least 1.4.x based on the identical
__call__body). Python 3.12.3, onnxruntime 1.24.4.