Skip to content

DefaultEmbeddingFunction.__call__ constructs a new ONNXMiniLM_L6_V2 on every call (10× slowdown on repeated embeds) #6941

@xelauvas

Description

@xelauvas

What happens

chromadb/api/types.py ships (chromadb 1.5.8):

class DefaultEmbeddingFunction(EmbeddingFunction[Documents]):
    def __call__(self, input: Documents) -> Embeddings:
        from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import (
            ONNXMiniLM_L6_V2,
        )
        return ONNXMiniLM_L6_V2()(input)

DefaultEmbeddingFunction.__call__ constructs a fresh ONNXMiniLM_L6_V2 every time it runs, triggering cold lazy-init of the tokenizer (~5ms) and the ONNX InferenceSession (~180ms) per invocation. Users whose workload hits embedding on a hot path (per-request retrieval, streaming ingest, RAG loops) pay ~200ms of avoidable tokenizer+model setup per call.

Secondary issue — thread contention under concurrency

Even with a cached instance, the default intra_op_num_threads=0 ("use all cores") causes severe context-switch thrashing when multiple concurrent queries hit the same session: each embed call fans out across all CPUs, producing worse-than-serial scaling.

Measurement

AMD EPYC-Rome VPS, 16 cores, 300-drawer palace, chromadb 1.5.8:

Scenario Pre-fix Post-fix (singleton + intra_op=1) Speedup
semantic_search p95 (single-user; per /search call, one embed inside) 412 ms 95 ms 4.3×
Composite 4-layer wake-up p95 768 ms 106 ms 7.2×
4-concurrent scaling ratio vs single-call 4.36× 1.35× 3.2× better
Ingest per drawer 299 ms 104 ms 2.9×

The 4-concurrent ratio is the one worth dwelling on: 4.36× means four parallel embed calls take longer than running them in series would — the default thread fan-out produces negative scaling.

Proposed fix

Two parts:

  1. Cache a single ONNXMiniLM_L6_V2 instance at the class level (or via module-level singleton) so DefaultEmbeddingFunction.__call__ routes every invocation through one instance.
  2. Construct that instance with intra_op_num_threads=1, inter_op_num_threads=1 so concurrent embeds parallelize across separate cores rather than thrashing contention on shared cores.

InferenceSession.run() is documented as thread-safe; the Rust-backed tokenizers.Tokenizer is thread-safe for encode. Vectors are byte-identical pre- and post-patch for identical input — no index rebuild required by users applying the change.

Workaround downstream

A monkey-patch of DefaultEmbeddingFunction.__call__ landing the singleton + intra_op=1 settings produces the numbers above. Version-pinned to chromadb 1.5.8 because the patch depends on the exact call site. Discovered during MemPalace-sidecar integration work (xelauvas/xelasphere, April 2026); happy to share the patch module and full benchmark artifacts if useful.

Version

chromadb 1.5.8 (reproducible back to at least 1.4.x based on the identical __call__ body). Python 3.12.3, onnxruntime 1.24.4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions