Problem
Mining large directories is CPU-bound on the embedding step. On a 500-file directory producing ~9,000 drawers, a single mine takes 8+ minutes on CPU. For palaces with 50k+ drawers, full re-mines become impractical.
Proposal
Add optional GPU-accelerated embeddings via sentence-transformers, keeping the existing ONNX/CPU path as the default. This is implemented and battle-tested in my fork mempalace-gpu.
Design
- New
mempalace/embeddings.py — shared embedding function factory with device detection
- All ChromaDB collection access goes through
embeddings.get_collection() for vector compatibility
sentence-transformers all-MiniLM-L6-v2 produces identical vectors to ChromaDB's default ONNX model — existing palaces stay compatible
- Device selection via
--device auto|cuda|rocm|mps|cpu, MEMPALACE_DEVICE env var, or config.json
- Change detection uses content hash (md5) rather than mtime — more reliable across file systems and git operations
- Zero new required dependencies — GPU is an optional extra:
pip install mempalace[gpu]
Architecture alignment
- Local first: everything runs on the user's machine, no cloud
- Zero API by default: base install unchanged, GPU is opt-in
- Verbatim storage: embedding layer only affects indexing speed, not content
- Palace structure preserved: wings/halls/rooms untouched
Benchmark results
NVIDIA RTX 4080 SUPER (Windows)
| Test |
Files |
Drawers |
CPU |
CUDA |
Speedup |
| Large mixed codebase |
118 |
13,673 |
156.7s |
26.3s |
6.0x |
| Medium Flutter app |
145 |
2,906 |
37.3s |
10.7s |
3.5x |
Apple M1 (macOS)
| Test |
Files |
Drawers |
CPU |
MPS |
Winner |
| ~/Documents |
500 |
1,239 |
6:09 |
5:48 |
MPS 1.06x |
| ~/home |
500 |
8,886 |
8:28 |
17:16 |
CPU 2.0x |
Key finding: Apple MPS doesn't win on wall-clock time for mining — it's actually 2x slower on larger workloads due to CPU-GPU data transfer overhead with small embedding batches. However, MPS provides a 12x reduction in CPU utilization (5-6% vs 60-69%), leaving the processor almost entirely free for other work during mining. This matters for developers running mines in the background while working.
The implementation defaults to CPU on Apple Silicon since wall-clock time is the primary concern for most users. Those who prefer lower CPU load can opt in with --device mps.
NVIDIA CUDA shows consistent 3-6x wall-clock gains with no such trade-off.
Full benchmark methodology: benchmarks/apple_m1_results.md
Scope
This proposal covers only the embedding acceleration layer. Kept separate from #353 (incremental update) so each can be reviewed independently.
Files changed (from upstream)
| File |
Change |
mempalace/embeddings.py |
New — GPU detection, embedding factory, batch flush, collection wrapper |
mempalace/miner.py |
Batched collection.add() (100 docs/call instead of 1) |
mempalace/convo_miner.py |
Batched collection.add() |
mempalace/config.py |
device property |
mempalace/cli.py |
--device flag |
mempalace/searcher.py |
Shared embedding function |
mempalace/mcp_server.py |
Shared embedding function |
mempalace/layers.py |
Shared embedding function |
mempalace/palace_graph.py |
Shared embedding function |
pyproject.toml |
gpu optional dependency group |
Additional discoveries (implemented in fork)
- ChromaDB
collection.add() hard limit at 5,461 items — batch flush now chunks at 5,000
- Embedding function conflict when opening collections created with different embedders — added fallback with warning
- One-time
verify_embedding_compatibility check: when falling back to default embedder, queries the collection with a known test vector and warns if L2 distance suggests incompatible embedding spaces (cached per session so it only runs once per collection)
Reference
Working implementation: phobicdotno/mempalace-gpu
Happy to open a PR from a clean branch once the approach is discussed. The previous #351 bundled too much — this is the focused GPU-only slice.
Problem
Mining large directories is CPU-bound on the embedding step. On a 500-file directory producing ~9,000 drawers, a single mine takes 8+ minutes on CPU. For palaces with 50k+ drawers, full re-mines become impractical.
Proposal
Add optional GPU-accelerated embeddings via
sentence-transformers, keeping the existing ONNX/CPU path as the default. This is implemented and battle-tested in my fork mempalace-gpu.Design
mempalace/embeddings.py— shared embedding function factory with device detectionembeddings.get_collection()for vector compatibilitysentence-transformersall-MiniLM-L6-v2produces identical vectors to ChromaDB's default ONNX model — existing palaces stay compatible--device auto|cuda|rocm|mps|cpu,MEMPALACE_DEVICEenv var, orconfig.jsonpip install mempalace[gpu]Architecture alignment
Benchmark results
NVIDIA RTX 4080 SUPER (Windows)
Apple M1 (macOS)
Key finding: Apple MPS doesn't win on wall-clock time for mining — it's actually 2x slower on larger workloads due to CPU-GPU data transfer overhead with small embedding batches. However, MPS provides a 12x reduction in CPU utilization (5-6% vs 60-69%), leaving the processor almost entirely free for other work during mining. This matters for developers running mines in the background while working.
The implementation defaults to CPU on Apple Silicon since wall-clock time is the primary concern for most users. Those who prefer lower CPU load can opt in with
--device mps.NVIDIA CUDA shows consistent 3-6x wall-clock gains with no such trade-off.
Full benchmark methodology: benchmarks/apple_m1_results.md
Scope
This proposal covers only the embedding acceleration layer. Kept separate from #353 (incremental update) so each can be reviewed independently.
Files changed (from upstream)
mempalace/embeddings.pymempalace/miner.pycollection.add()(100 docs/call instead of 1)mempalace/convo_miner.pycollection.add()mempalace/config.pydevicepropertymempalace/cli.py--deviceflagmempalace/searcher.pymempalace/mcp_server.pymempalace/layers.pymempalace/palace_graph.pypyproject.tomlgpuoptional dependency groupAdditional discoveries (implemented in fork)
collection.add()hard limit at 5,461 items — batch flush now chunks at 5,000verify_embedding_compatibilitycheck: when falling back to default embedder, queries the collection with a known test vector and warns if L2 distance suggests incompatible embedding spaces (cached per session so it only runs once per collection)Reference
Working implementation: phobicdotno/mempalace-gpu
Happy to open a PR from a clean branch once the approach is discussed. The previous #351 bundled too much — this is the focused GPU-only slice.