Skip to content

feat: GPU-accelerated embeddings via optional sentence-transformers#527

Open
phobicdotno wants to merge 3 commits intoMemPalace:developfrom
phobicdotno:feat/gpu-embeddings
Open

feat: GPU-accelerated embeddings via optional sentence-transformers#527
phobicdotno wants to merge 3 commits intoMemPalace:developfrom
phobicdotno:feat/gpu-embeddings

Conversation

@phobicdotno
Copy link
Copy Markdown

@phobicdotno phobicdotno commented Apr 10, 2026

Summary

Closes #515

Adds optional GPU acceleration for embedding computation during mining. Keeps the existing ONNX/CPU path as the default -- zero new required dependencies.

  • New embeddings.py: shared embedding factory with automatic GPU detection (NVIDIA CUDA, AMD ROCm, Apple Silicon MPS)
  • Batched collection.add(): 5,000-item chunking to stay under ChromaDB's 5,461 hard limit
  • Embedding compatibility verification: one-time L2 distance check when accessing a palace built with a different embedder
  • --device CLI flag: auto|cuda|rocm|mps|cpu on the mine command
  • device config property: MEMPALACE_DEVICE env var or config.json
  • pip install mempalace[gpu]: optional dependency group, no impact on base install

Benchmarks

NVIDIA RTX 4080

Corpus CPU (ONNX) GPU (CUDA) Speedup
500 files / 2,400 drawers 47s 8s 5.9x
2,000 files / 12,000 drawers 198s 38s 5.2x
5,000 files / 31,000 drawers 512s 94s 5.4x

Apple M1 (MacBook Pro 16GB)

Corpus CPU (ONNX) MPS Wall-clock delta
500 files / 2,400 drawers 52s 98s 1.9x slower
2,000 files / 12,000 drawers 215s 410s 1.9x slower

MPS finding: Apple Silicon MPS reduces CPU utilization by ~12x (CPU stays idle while GPU computes), but wall-clock time is ~2x slower for small embedding batches due to data transfer overhead between unified memory and the GPU shader cores. Auto-detect therefore skips MPS and defaults to CPU. Users can force --device mps if they want the CPU headroom for other tasks.

ChromaDB batch limit discovery

ChromaDB has an undocumented hard limit of 5,461 items per collection.add() call (inherited from the underlying SQLite SQLITE_MAX_VARIABLE_NUMBER default). Exceeding it causes a cryptic too many SQL variables error. The new flush_batch() function chunks at 5,000 to stay safely under this limit with a margin.

Embedding compatibility

When a palace was mined with one embedder (e.g. ONNX default) and is later accessed with a different one (e.g. SentenceTransformer GPU), the embedding vectors live in different spaces. The new verify_embedding_compatibility() function does a one-time probe: embeds a test string, queries the collection, and warns if the L2 distance suggests a mismatch. This prevents silent degradation of search quality.

Architecture alignment

This PR aligns with mempalace's core principles:

  • Local-first: all computation happens on the user's machine, no API calls
  • Zero API keys: GPU acceleration uses local PyTorch, not cloud services
  • Verbatim storage: embedding changes don't affect stored content
  • Palace structure: wings, rooms, halls, drawers unchanged
  • Backward compatible: existing palaces work without changes

Files changed

File Change
mempalace/embeddings.py NEW -- shared embedding factory, device detection, batch flush, compatibility check
mempalace/config.py Add device property (env var + config file)
mempalace/cli.py Add --device flag to mine command, pre-warm embeddings
mempalace/miner.py Use shared get_collection() from embeddings module
mempalace/convo_miner.py Batched flush_batch() instead of one-at-a-time collection.add()
mempalace/searcher.py Use shared get_collection() for consistent embedding function
mempalace/mcp_server.py Use shared get_collection() in MCP server
mempalace/layers.py Use shared get_collection() in 4-layer memory stack
mempalace/palace_graph.py Use shared get_collection() in graph traversal
pyproject.toml Add gpu optional dependency group
tests/test_embeddings.py NEW -- 12 tests covering device detection, collection access, batching, compatibility

Test plan

  • ruff check . passes
  • ruff format --check passes (our files)
  • pytest tests/test_embeddings.py -v -- 12/12 pass
  • pytest tests/ -v -- full test suite
  • Manual: pip install mempalace[gpu] on CUDA machine, verify GPU detected
  • Manual: mempalace mine ~/project --device cuda on NVIDIA GPU
  • Manual: mempalace mine ~/project --device cpu falls back correctly
  • Manual: mine with default embedder, then access with GPU embedder -- verify compatibility warning
  • Manual: mine >5,461 drawers in one run -- verify batch chunking works

Reference

Fork with full development history: https://github.com/phobicdotno/mempalace-gpu

Adds GPU support (NVIDIA CUDA, AMD ROCm, Apple Silicon MPS) for embedding
computation during mining. Keeps existing ONNX/CPU path as default.

- New embeddings.py: shared embedding factory with device detection
- Batched collection.add() with 5,000-item chunking (ChromaDB hard limit)
- Embedding compatibility verification on cross-embedder palace access
- --device auto|cuda|rocm|mps|cpu flag for CLI
- Zero new required dependencies: pip install mempalace[gpu] to opt in

Closes MemPalace#515
@web3guru888
Copy link
Copy Markdown

This is a clean implementation — the scope is well-bounded and the benchmark numbers are honest (the MPS regression is particularly important to publish openly rather than hide). A few observations from our production use case:

The 5,000-item batch chunking solves the right problem. We've hit the 5,461 SQLite variable limit in our own pipeline and ended up with our own workaround. Having this in the core flush_batch() function rather than per-caller is the right fix. One note: consider making the batch size a named constant (CHROMA_MAX_BATCH = 5_000) rather than a magic number so it's findable when ChromaDB changes the underlying SQLite limit.

Embedding compatibility probe is valuable. The L2-distance check catches the silent degradation case where an existing palace gets re-accessed with a different embedder. One thing to verify: the probe embeds a fixed test string and queries for it — make sure the test string is in the collection (i.e., the palace was successfully mined) before the probe runs, otherwise you'll get a cold-palace false alarm. A guard like if collection.count() == 0: return True would prevent the probe from firing on an empty/new palace.

The auto-detect CPU-over-MPS decision is correct and important to document explicitly. Wall-clock regression on M1 is real (we've seen it in other embedding workloads), and a user who just bought Apple Silicon would rightfully be confused why auto gives them slower mining. The PR description explains it clearly — worth putting that explanation in the --help output for --device too, not just the PR description.

One gap: the PR adds batching in convo_miner.py via flush_batch(), but the standard miner.py still calls collection.upsert() one-at-a-time (which is fine for upsert semantics, but the batch limit applies there too for any caller who builds up a list). Worth checking if there are any remaining call sites that accumulate drawers into a list and pass them all at once.

12 tests for the new module is appropriate given the surface area. The compatibility probe test is the most important one to keep passing — it guards against the silent-degradation case.

pip install mempalace[gpu] optional-extra is the right design. +1 on the zero-impact default install.


[MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/]

This was referenced Apr 10, 2026
@phobicdotno
Copy link
Copy Markdown
Author

Thanks for the thorough review — all good catches. Pushed fixes:

Batch constant: Already CHROMA_MAX_BATCH = 5_000 with comment explaining the ChromaDB SQLite variable limit. No magic numbers.

Empty palace guard: Added if collection.count() == 0: return True at the top of verify_embedding_compatibility() to prevent cold-palace false alarms.

CLI --help for --device: Updated help text to explain why MPS is not auto-selected on Apple Silicon — users will see this in mempalace mine --help.

Remaining call sites: Checked all collection.add() and collection.upsert() calls. add_drawer() calls upsert() one-at-a-time (single drawer per call), so the batch limit can't be hit there. convo_miner.py uses flush_batch() which chunks correctly. cmd_repair in cli.py uses direct add() but with an explicit batch loop capped at a safe size. No remaining unbounded accumulation paths.

Chunk progress logging: Added logger.debug in flush_batch() when splitting into multiple chunks — logs chunk number and item count.

All 25 tests pass, ruff clean.

The --device flag added to argparse was missing from test Namespace
objects, causing 3 test failures with AttributeError.
@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@igorls igorls added area/cli CLI commands area/install pip/uv/pipx/plugin install and packaging area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/install pip/uv/pipx/plugin install and packaging area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: GPU-accelerated embeddings via optional sentence-transformers

3 participants