DefaultEmbeddingFunction.__call__ constructs a new ONNXMiniLM_L6_V2 on every call (10× slowdown on repeated embeds)

## What happens

`chromadb/api/types.py` ships (chromadb 1.5.8):

```python
class DefaultEmbeddingFunction(EmbeddingFunction[Documents]):
    def __call__(self, input: Documents) -> Embeddings:
        from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import (
            ONNXMiniLM_L6_V2,
        )
        return ONNXMiniLM_L6_V2()(input)
```

`DefaultEmbeddingFunction.__call__` constructs a fresh `ONNXMiniLM_L6_V2` every time it runs, triggering cold lazy-init of the tokenizer (~5ms) and the ONNX `InferenceSession` (~180ms) per invocation. Users whose workload hits embedding on a hot path (per-request retrieval, streaming ingest, RAG loops) pay ~200ms of avoidable tokenizer+model setup per call.

## Secondary issue — thread contention under concurrency

Even with a cached instance, the default `intra_op_num_threads=0` ("use all cores") causes severe context-switch thrashing when multiple concurrent queries hit the same session: each embed call fans out across all CPUs, producing worse-than-serial scaling.

## Measurement

AMD EPYC-Rome VPS, 16 cores, 300-drawer palace, chromadb 1.5.8:

| Scenario | Pre-fix | Post-fix (singleton + intra_op=1) | Speedup |
|---|---|---|---|
| `semantic_search` p95 (single-user; per `/search` call, one embed inside) | 412 ms | 95 ms | 4.3× |
| Composite 4-layer wake-up p95 | 768 ms | 106 ms | 7.2× |
| 4-concurrent scaling ratio vs single-call | 4.36× | 1.35× | 3.2× better |
| Ingest per drawer | 299 ms | 104 ms | 2.9× |

The 4-concurrent ratio is the one worth dwelling on: 4.36× means four parallel embed calls take **longer than running them in series** would — the default thread fan-out produces negative scaling.

## Proposed fix

Two parts:

1. Cache a single `ONNXMiniLM_L6_V2` instance at the class level (or via module-level singleton) so `DefaultEmbeddingFunction.__call__` routes every invocation through one instance.
2. Construct that instance with `intra_op_num_threads=1, inter_op_num_threads=1` so concurrent embeds parallelize across separate cores rather than thrashing contention on shared cores.

`InferenceSession.run()` is documented as thread-safe; the Rust-backed `tokenizers.Tokenizer` is thread-safe for `encode`. Vectors are byte-identical pre- and post-patch for identical input — no index rebuild required by users applying the change.

## Workaround downstream

A monkey-patch of `DefaultEmbeddingFunction.__call__` landing the singleton + `intra_op=1` settings produces the numbers above. Version-pinned to chromadb 1.5.8 because the patch depends on the exact call site. Discovered during MemPalace-sidecar integration work (xelauvas/xelasphere, April 2026); happy to share the patch module and full benchmark artifacts if useful.

## Version

chromadb 1.5.8 (reproducible back to at least 1.4.x based on the identical `__call__` body). Python 3.12.3, onnxruntime 1.24.4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DefaultEmbeddingFunction.call constructs a new ONNXMiniLM_L6_V2 on every call (10× slowdown on repeated embeds) #6941

What happens

Secondary issue — thread contention under concurrency

Measurement

Proposed fix

Workaround downstream

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Pre-fix	Post-fix (singleton + intra_op=1)	Speedup
`semantic_search` p95 (single-user; per `/search` call, one embed inside)	412 ms	95 ms	4.3×
Composite 4-layer wake-up p95	768 ms	106 ms	7.2×
4-concurrent scaling ratio vs single-call	4.36×	1.35×	3.2× better
Ingest per drawer	299 ms	104 ms	2.9×

DefaultEmbeddingFunction.__call__ constructs a new ONNXMiniLM_L6_V2 on every call (10× slowdown on repeated embeds) #6941

Description

What happens

Secondary issue — thread contention under concurrency

Measurement

Proposed fix

Workaround downstream

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

DefaultEmbeddingFunction.call constructs a new ONNXMiniLM_L6_V2 on every call (10× slowdown on repeated embeds) #6941