GPU-accelerated fork of milla-jovovich/mempalace
This fork adds GPU-accelerated embeddings and batch processing to MemPalace. Supports NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (MPS). For documentation on MemPalace itself (palace structure, AAAK dialect, MCP tools, benchmarks), see the upstream README.
Embeddings are computed via sentence-transformers on GPU when available, falling back to ChromaDB's default CPU/ONNX model when not.
mempalace mine ~/myproject --device auto # auto-detect best GPU
mempalace mine ~/myproject --device cuda # NVIDIA
mempalace mine ~/myproject --device rocm # AMD
mempalace mine ~/myproject --device mps # Apple Silicon (M1-M5)
mempalace mine ~/myproject --device cpu # force CPUAlso configurable via MEMPALACE_DEVICE env var or "device" in ~/.mempalace/config.json.
collection.add() calls are batched (100 documents per call instead of 1), reducing ChromaDB overhead regardless of CPU or GPU mode.
The MCP server includes a mempalace_self_update tool that pulls the latest version from PyPI, callable directly from your AI assistant.
Tested on two real-world codebases. GPU: NVIDIA GeForce RTX 4080 SUPER. Same files, same drawers — only the device changes.
| Test | Files | Drawers | Size | CPU | RTX 4080 SUPER | Speedup |
|---|---|---|---|---|---|---|
| Large mixed codebase (JS/TS/Dart/Python/HTML) | 118 | 13,673 | ~1.7 GB | 156.7s | 26.3s | 6.0x |
| Medium Flutter app (Dart/YAML/JSON) | 145 | 2,906 | ~85 MB | 37.3s | 10.7s | 3.5x |
Speedup scales with drawer count. More chunks = more embedding work = bigger GPU advantage. Results will vary by GPU — expect similar gains on any modern NVIDIA/AMD/Apple Silicon GPU.
Tested on MacBook M1. Key finding: CPU outperforms MPS for mining due to data transfer overhead with small embedding batches.
| Test | Files | Drawers | MPS (GPU) | CPU | Winner |
|---|---|---|---|---|---|
| ~/Documents (mixed files) | 500 | 1,239 | 5:48 | 6:09 | MPS 1.06x |
| ~/phobic (mixed files) | 500 | 8,886 | 17:16 | 8:28 | CPU 2.0x |
MPS uses 12x less CPU (5-6% vs 60-69%), freeing the processor for other work — but wall-clock time is worse. The fork defaults to CPU on Apple Silicon for this reason. Use --device mps to override.
Full results: benchmarks/apple_m1_results.md
pip install mempalace-gpu
claude mcp add mempalace-gpu -- python -m mempalace.mcp_serverRestart Claude Code — mempalace-gpu appears in /plugin with all tools. Works on NVIDIA, AMD, and Apple Silicon — GPU is auto-detected.
AMD GPUs need the ROCm version of PyTorch installed first:
pip install torch --index-url https://download.pytorch.org/whl/rocm6.2
pip install mempalace-gpuInstalling or upgrading mempalace-gpu only replaces the Python code. Your mined data lives in ~/.mempalace/palace/ (ChromaDB files) and is never touched. Existing palaces remain fully compatible.
git clone https://github.com/phobicdotno/mempalace-gpu.git
cd mempalace-gpu
pip install -e .git remote add upstream https://github.com/milla-jovovich/mempalace.git
git fetch upstream
git merge upstream/mainRun your palace on a GPU machine and access it from any device over the network.
pip install mempalace-gpu[serve]
mempalace serve --port 8420 --token <your-token> --device cudaOn your local machine, add a proxy that forwards MCP calls to the remote server:
claude mcp add mempalace-remote \
-e MEMPALACE_REMOTE_URL=http://<gpu-host>:8420 \
-e MEMPALACE_TOKEN=<your-token> \
-- python -m mempalace.mcp_proxyThe proxy speaks MCP stdio to Claude Code and HTTP to the server. All tool calls are forwarded transparently.
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | /health | No | Server status |
| GET | /tools | Yes | List available tools |
| POST | /tool/{name} | Yes | Call a tool with JSON body |
| File | Change |
|---|---|
mempalace/embeddings.py |
New -- GPU detection (NVIDIA/AMD/Apple), embedding factory, batch flush |
mempalace/miner.py |
Batched collection.add(), content hashing, update() command |
mempalace/convo_miner.py |
Batched collection.add() |
mempalace/config.py |
device property (auto/cuda/rocm/mps/cpu) |
mempalace/cli.py |
--device flag, update subcommand |
mempalace/mcp_server.py |
mempalace_self_update tool, shared embeddings |
mempalace/searcher.py |
Shared embedding function for vector compatibility |
mempalace/layers.py |
Shared embedding function |
mempalace/palace_graph.py |
Shared embedding function |
mempalace/http_server.py |
New -- FastAPI HTTP server for remote GPU access |
mempalace/mcp_proxy.py |
New -- MCP-to-HTTP proxy for remote palace access |
pyproject.toml |
gpu optional dependency group |
All other files are unmodified from upstream. Existing palaces remain compatible.
MIT -- same as upstream.