feat: GPU-accelerated embeddings via optional sentence-transformers

## Problem

Mining large directories is CPU-bound on the embedding step. On a 500-file directory producing ~9,000 drawers, a single mine takes 8+ minutes on CPU. For palaces with 50k+ drawers, full re-mines become impractical.

## Proposal

Add optional GPU-accelerated embeddings via `sentence-transformers`, keeping the existing ONNX/CPU path as the default. This is implemented and battle-tested in my fork [mempalace-gpu](https://github.com/phobicdotno/mempalace-gpu).

### Design

- New `mempalace/embeddings.py` — shared embedding function factory with device detection
- All ChromaDB collection access goes through `embeddings.get_collection()` for vector compatibility
- `sentence-transformers` `all-MiniLM-L6-v2` produces identical vectors to ChromaDB's default ONNX model — existing palaces stay compatible
- Device selection via `--device auto|cuda|rocm|mps|cpu`, `MEMPALACE_DEVICE` env var, or `config.json`
- Change detection uses content hash (md5) rather than mtime — more reliable across file systems and git operations
- Zero new required dependencies — GPU is an optional extra: `pip install mempalace[gpu]`

### Architecture alignment

- **Local first**: everything runs on the user's machine, no cloud
- **Zero API by default**: base install unchanged, GPU is opt-in
- **Verbatim storage**: embedding layer only affects indexing speed, not content
- **Palace structure preserved**: wings/halls/rooms untouched

## Benchmark results

### NVIDIA RTX 4080 SUPER (Windows)

| Test | Files | Drawers | CPU | CUDA | Speedup |
|------|-------|---------|-----|------|---------|
| Large mixed codebase | 118 | 13,673 | 156.7s | 26.3s | **6.0x** |
| Medium Flutter app | 145 | 2,906 | 37.3s | 10.7s | **3.5x** |

### Apple M1 (macOS)

| Test | Files | Drawers | CPU | MPS | Winner |
|------|-------|---------|-----|-----|--------|
| ~/Documents | 500 | 1,239 | 6:09 | 5:48 | MPS 1.06x |
| ~/home | 500 | 8,886 | **8:28** | 17:16 | **CPU 2.0x** |

**Key finding:** Apple MPS doesn't win on wall-clock time for mining — it's actually 2x slower on larger workloads due to CPU-GPU data transfer overhead with small embedding batches. However, MPS provides a **12x reduction in CPU utilization** (5-6% vs 60-69%), leaving the processor almost entirely free for other work during mining. This matters for developers running mines in the background while working.

The implementation defaults to CPU on Apple Silicon since wall-clock time is the primary concern for most users. Those who prefer lower CPU load can opt in with `--device mps`.

NVIDIA CUDA shows consistent 3-6x wall-clock gains with no such trade-off.

Full benchmark methodology: [benchmarks/apple_m1_results.md](https://github.com/phobicdotno/mempalace-gpu/blob/main/benchmarks/apple_m1_results.md)

## Scope

This proposal covers only the embedding acceleration layer. Kept separate from #353 (incremental update) so each can be reviewed independently.

### Files changed (from upstream)

| File | Change |
|------|--------|
| `mempalace/embeddings.py` | **New** — GPU detection, embedding factory, batch flush, collection wrapper |
| `mempalace/miner.py` | Batched `collection.add()` (100 docs/call instead of 1) |
| `mempalace/convo_miner.py` | Batched `collection.add()` |
| `mempalace/config.py` | `device` property |
| `mempalace/cli.py` | `--device` flag |
| `mempalace/searcher.py` | Shared embedding function |
| `mempalace/mcp_server.py` | Shared embedding function |
| `mempalace/layers.py` | Shared embedding function |
| `mempalace/palace_graph.py` | Shared embedding function |
| `pyproject.toml` | `gpu` optional dependency group |

### Additional discoveries (implemented in fork)

- ChromaDB `collection.add()` hard limit at 5,461 items — batch flush now chunks at 5,000
- Embedding function conflict when opening collections created with different embedders — added fallback with warning
- One-time `verify_embedding_compatibility` check: when falling back to default embedder, queries the collection with a known test vector and warns if L2 distance suggests incompatible embedding spaces (cached per session so it only runs once per collection)

## Reference

Working implementation: [phobicdotno/mempalace-gpu](https://github.com/phobicdotno/mempalace-gpu)

Happy to open a PR from a clean branch once the approach is discussed. The previous #351 bundled too much — this is the focused GPU-only slice.

File	Change
`mempalace/embeddings.py`	New — GPU detection, embedding factory, batch flush, collection wrapper
`mempalace/miner.py`	Batched `collection.add()` (100 docs/call instead of 1)
`mempalace/convo_miner.py`	Batched `collection.add()`
`mempalace/config.py`	`device` property
`mempalace/cli.py`	`--device` flag
`mempalace/searcher.py`	Shared embedding function
`mempalace/mcp_server.py`	Shared embedding function
`mempalace/layers.py`	Shared embedding function
`mempalace/palace_graph.py`	Shared embedding function
`pyproject.toml`	`gpu` optional dependency group

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU-accelerated embeddings via optional sentence-transformers #515

Problem

Proposal

Design

Architecture alignment

Benchmark results

NVIDIA RTX 4080 SUPER (Windows)

Apple M1 (macOS)

Scope

Files changed (from upstream)

Additional discoveries (implemented in fork)

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Test	Files	Drawers	CPU	CUDA	Speedup
Large mixed codebase	118	13,673	156.7s	26.3s	6.0x
Medium Flutter app	145	2,906	37.3s	10.7s	3.5x

Test	Files	Drawers	CPU	MPS	Winner
~/Documents	500	1,239	6:09	5:48	MPS 1.06x
~/home	500	8,886	8:28	17:16	CPU 2.0x

feat: GPU-accelerated embeddings via optional sentence-transformers #515

Description

Problem

Proposal

Design

Architecture alignment

Benchmark results

NVIDIA RTX 4080 SUPER (Windows)

Apple M1 (macOS)

Scope

Files changed (from upstream)

Additional discoveries (implemented in fork)

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions