feat: GPU-accelerated embeddings via optional sentence-transformers#527
feat: GPU-accelerated embeddings via optional sentence-transformers#527phobicdotno wants to merge 3 commits intoMemPalace:developfrom
Conversation
Adds GPU support (NVIDIA CUDA, AMD ROCm, Apple Silicon MPS) for embedding computation during mining. Keeps existing ONNX/CPU path as default. - New embeddings.py: shared embedding factory with device detection - Batched collection.add() with 5,000-item chunking (ChromaDB hard limit) - Embedding compatibility verification on cross-embedder palace access - --device auto|cuda|rocm|mps|cpu flag for CLI - Zero new required dependencies: pip install mempalace[gpu] to opt in Closes MemPalace#515
|
This is a clean implementation — the scope is well-bounded and the benchmark numbers are honest (the MPS regression is particularly important to publish openly rather than hide). A few observations from our production use case: The 5,000-item batch chunking solves the right problem. We've hit the 5,461 SQLite variable limit in our own pipeline and ended up with our own workaround. Having this in the core Embedding compatibility probe is valuable. The L2-distance check catches the silent degradation case where an existing palace gets re-accessed with a different embedder. One thing to verify: the probe embeds a fixed test string and queries for it — make sure the test string is in the collection (i.e., the palace was successfully mined) before the probe runs, otherwise you'll get a cold-palace false alarm. A guard like The auto-detect CPU-over-MPS decision is correct and important to document explicitly. Wall-clock regression on M1 is real (we've seen it in other embedding workloads), and a user who just bought Apple Silicon would rightfully be confused why One gap: the PR adds batching in 12 tests for the new module is appropriate given the surface area. The compatibility probe test is the most important one to keep passing — it guards against the silent-degradation case.
[MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/] |
|
Thanks for the thorough review — all good catches. Pushed fixes: Batch constant: Already Empty palace guard: Added CLI --help for --device: Updated help text to explain why MPS is not auto-selected on Apple Silicon — users will see this in Remaining call sites: Checked all Chunk progress logging: Added All 25 tests pass, ruff clean. |
The --device flag added to argparse was missing from test Namespace objects, causing 3 test failures with AttributeError.
Summary
Closes #515
Adds optional GPU acceleration for embedding computation during mining. Keeps the existing ONNX/CPU path as the default -- zero new required dependencies.
embeddings.py: shared embedding factory with automatic GPU detection (NVIDIA CUDA, AMD ROCm, Apple Silicon MPS)collection.add(): 5,000-item chunking to stay under ChromaDB's 5,461 hard limit--deviceCLI flag:auto|cuda|rocm|mps|cpuon theminecommanddeviceconfig property:MEMPALACE_DEVICEenv var orconfig.jsonpip install mempalace[gpu]: optional dependency group, no impact on base installBenchmarks
NVIDIA RTX 4080
Apple M1 (MacBook Pro 16GB)
MPS finding: Apple Silicon MPS reduces CPU utilization by ~12x (CPU stays idle while GPU computes), but wall-clock time is ~2x slower for small embedding batches due to data transfer overhead between unified memory and the GPU shader cores. Auto-detect therefore skips MPS and defaults to CPU. Users can force
--device mpsif they want the CPU headroom for other tasks.ChromaDB batch limit discovery
ChromaDB has an undocumented hard limit of 5,461 items per
collection.add()call (inherited from the underlying SQLiteSQLITE_MAX_VARIABLE_NUMBERdefault). Exceeding it causes a cryptictoo many SQL variableserror. The newflush_batch()function chunks at 5,000 to stay safely under this limit with a margin.Embedding compatibility
When a palace was mined with one embedder (e.g. ONNX default) and is later accessed with a different one (e.g. SentenceTransformer GPU), the embedding vectors live in different spaces. The new
verify_embedding_compatibility()function does a one-time probe: embeds a test string, queries the collection, and warns if the L2 distance suggests a mismatch. This prevents silent degradation of search quality.Architecture alignment
This PR aligns with mempalace's core principles:
Files changed
mempalace/embeddings.pymempalace/config.pydeviceproperty (env var + config file)mempalace/cli.py--deviceflag tominecommand, pre-warm embeddingsmempalace/miner.pyget_collection()from embeddings modulemempalace/convo_miner.pyflush_batch()instead of one-at-a-timecollection.add()mempalace/searcher.pyget_collection()for consistent embedding functionmempalace/mcp_server.pyget_collection()in MCP servermempalace/layers.pyget_collection()in 4-layer memory stackmempalace/palace_graph.pyget_collection()in graph traversalpyproject.tomlgpuoptional dependency grouptests/test_embeddings.pyTest plan
ruff check .passesruff format --checkpasses (our files)pytest tests/test_embeddings.py -v-- 12/12 passpytest tests/ -v-- full test suitepip install mempalace[gpu]on CUDA machine, verify GPU detectedmempalace mine ~/project --device cudaon NVIDIA GPUmempalace mine ~/project --device cpufalls back correctlyReference
Fork with full development history: https://github.com/phobicdotno/mempalace-gpu