Skip to content

feat: GPU-accelerated embeddings via optional sentence-transformers #515

@phobicdotno

Description

@phobicdotno

Problem

Mining large directories is CPU-bound on the embedding step. On a 500-file directory producing ~9,000 drawers, a single mine takes 8+ minutes on CPU. For palaces with 50k+ drawers, full re-mines become impractical.

Proposal

Add optional GPU-accelerated embeddings via sentence-transformers, keeping the existing ONNX/CPU path as the default. This is implemented and battle-tested in my fork mempalace-gpu.

Design

  • New mempalace/embeddings.py — shared embedding function factory with device detection
  • All ChromaDB collection access goes through embeddings.get_collection() for vector compatibility
  • sentence-transformers all-MiniLM-L6-v2 produces identical vectors to ChromaDB's default ONNX model — existing palaces stay compatible
  • Device selection via --device auto|cuda|rocm|mps|cpu, MEMPALACE_DEVICE env var, or config.json
  • Change detection uses content hash (md5) rather than mtime — more reliable across file systems and git operations
  • Zero new required dependencies — GPU is an optional extra: pip install mempalace[gpu]

Architecture alignment

  • Local first: everything runs on the user's machine, no cloud
  • Zero API by default: base install unchanged, GPU is opt-in
  • Verbatim storage: embedding layer only affects indexing speed, not content
  • Palace structure preserved: wings/halls/rooms untouched

Benchmark results

NVIDIA RTX 4080 SUPER (Windows)

Test Files Drawers CPU CUDA Speedup
Large mixed codebase 118 13,673 156.7s 26.3s 6.0x
Medium Flutter app 145 2,906 37.3s 10.7s 3.5x

Apple M1 (macOS)

Test Files Drawers CPU MPS Winner
~/Documents 500 1,239 6:09 5:48 MPS 1.06x
~/home 500 8,886 8:28 17:16 CPU 2.0x

Key finding: Apple MPS doesn't win on wall-clock time for mining — it's actually 2x slower on larger workloads due to CPU-GPU data transfer overhead with small embedding batches. However, MPS provides a 12x reduction in CPU utilization (5-6% vs 60-69%), leaving the processor almost entirely free for other work during mining. This matters for developers running mines in the background while working.

The implementation defaults to CPU on Apple Silicon since wall-clock time is the primary concern for most users. Those who prefer lower CPU load can opt in with --device mps.

NVIDIA CUDA shows consistent 3-6x wall-clock gains with no such trade-off.

Full benchmark methodology: benchmarks/apple_m1_results.md

Scope

This proposal covers only the embedding acceleration layer. Kept separate from #353 (incremental update) so each can be reviewed independently.

Files changed (from upstream)

File Change
mempalace/embeddings.py New — GPU detection, embedding factory, batch flush, collection wrapper
mempalace/miner.py Batched collection.add() (100 docs/call instead of 1)
mempalace/convo_miner.py Batched collection.add()
mempalace/config.py device property
mempalace/cli.py --device flag
mempalace/searcher.py Shared embedding function
mempalace/mcp_server.py Shared embedding function
mempalace/layers.py Shared embedding function
mempalace/palace_graph.py Shared embedding function
pyproject.toml gpu optional dependency group

Additional discoveries (implemented in fork)

  • ChromaDB collection.add() hard limit at 5,461 items — batch flush now chunks at 5,000
  • Embedding function conflict when opening collections created with different embedders — added fallback with warning
  • One-time verify_embedding_compatibility check: when falling back to default embedder, queries the collection with a known test vector and warns if L2 distance suggests incompatible embedding spaces (cached per session so it only runs once per collection)

Reference

Working implementation: phobicdotno/mempalace-gpu

Happy to open a PR from a clean branch once the approach is discussed. The previous #351 bundled too much — this is the focused GPU-only slice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningarea/searchSearch and retrievalenhancementNew feature or requestperformancePerformance improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions