feat: add LanceDB backend abstraction and migration path#574
feat: add LanceDB backend abstraction and migration path#574dekoza wants to merge 11 commits intoMemPalace:developfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR migrates MemPalace’s storage stack to a LanceDB-first architecture, introduces a pluggable embedding system (ONNX default + optional GPU/Ollama), and moves the knowledge graph into LanceDB tables while preserving a legacy Chroma/SQLite migration path.
Changes:
- Added a LanceDB/ChromaDB collection abstraction (
db.py) and routed consumers throughpalace.get_collection(), plus CLI migration and reindex tooling. - Implemented pluggable embedders (
embeddings.py) with model aliases, caching, and CLI discovery. - Migrated knowledge graph storage to LanceDB tables by default, updated MCP/server and benchmarks, and refreshed tests/docs accordingly.
Reviewed changes
Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| UPGRADE.md | New v4 upgrade guide covering backend migration, embedders, KG changes, benchmarks |
| PLAN.md | New/updated phased plan documenting backend/vectorizer/KG work and benchmarks |
| pyproject.toml | Switch core deps to LanceDB/ONNX stack; move Chroma + sentence-transformers behind extras |
| mempalace/db.py | New backend abstraction (LanceCollection/ChromaCollection), backend detection, open factory |
| mempalace/embeddings.py | New embedding backends (ONNX/ST/Ollama), aliasing, caching, listing utilities |
| mempalace/palace.py | Centralized collection open logic (now backend-agnostic) |
| mempalace/searcher.py | Updated search to use get_collection() instead of direct Chroma client |
| mempalace/miner.py | Updated status/mining paths to use backend-agnostic collection access |
| mempalace/mcp_server.py | Removed Chroma client caching and switched to palace collection abstraction; KG default updated |
| mempalace/palace_graph.py | Switched graph building to backend-agnostic collection retrieval |
| mempalace/layers.py | Updated layered context builders to use backend-agnostic collection access |
| mempalace/normalize.py | Minor formatting change in file-size error message |
| mempalace/split_mega_files.py | Minor formatting changes to size-limit skip messages |
| mempalace/init.py | Adjusted dependency logger silencing (incl. sentence-transformers) |
| benchmarks/results_v4_comparison.json | Added committed benchmark outputs for v4 comparison |
| benchmarks/longmemeval_v4.py | New benchmark runner using db abstraction + embedders |
| benchmarks/BENCHMARKS.md | Added v4 backend comparison section and updated tables/notes |
| benchmarks/BENCHMARKS_V4.md | Added dedicated v4 benchmark write-up and reproduction steps |
| tests/test_searcher.py | Updated mocking to patch get_collection() and changed empty-palace expectations |
| tests/test_palace_graph.py | Removed import-time Chroma mocking; direct palace_graph imports |
| tests/test_miner.py | Updated tests to use backend-agnostic get_collection() helper |
| tests/test_mcp_server.py | Updated collection helper and status expectations for empty palace |
| tests/test_layers.py | Updated tests to patch _get_palace_collection instead of Chroma client |
| tests/test_knowledge_graph.py | Updated WAL test setup to explicitly use SQLite backend path |
| tests/test_knowledge_graph_extra.py | Updated fixture to use palace directory (Lance-backed KG) |
| tests/test_embeddings.py | New tests for embedder backends, aliases, caching, and db metadata tracking |
| tests/test_convo_miner.py | Updated tests to use backend-agnostic get_collection() helper |
| tests/test_cli.py | Updated CLI tests to patch get_collection and removed Chroma module mocking |
| tests/conftest.py | Updated fixtures to use backend-agnostic collections and Lance-backed KG fixture |
| tests/benchmarks/test_layers_bench.py | Minor assertion formatting change |
Comments suppressed due to low confidence (2)
mempalace/searcher.py:34
- This
except Exceptionpath prints "No palace found" for all failures (including "empty palace", backend-specific errors, andImportErrorwhenchromadbisn't installed for legacy palaces). Narrowing the exception handling (e.g., detect missing directory vs empty vs missing[chroma]extra) would avoid misleading guidance and make migration failures actionable.
except Exception:
print(f"\n No palace found at {palace_path}")
print(" Run: mempalace init <dir> then mempalace mine <dir>")
raise SearchError(f"No palace found at {palace_path}")
mempalace/searcher.py:111
- Empty palaces (
count()==0) are currently reported as "No palace found" here as well. Since the API surface returns structured errors, it would be more accurate to return a distinct error/hint for "palace exists but contains 0 drawers" and to preserve actionable info for legacy Chroma palaces whenchromadbisn't installed (suggest installingmempalace[chroma]).
col = get_collection(palace_path)
if col.count() == 0:
raise Exception("Empty palace")
except Exception as e:
logger.error("No palace found at %s: %s", palace_path, e)
return {
"error": "No palace found",
"hint": "Run: mempalace init <dir> && mempalace mine <dir>",
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Connect to palace | ||
| try: | ||
| client = chromadb.PersistentClient(path=palace_path) | ||
| col = client.get_collection("mempalace_drawers") | ||
| col = get_collection(palace_path) | ||
| if col.count() == 0: | ||
| raise Exception("Empty palace") | ||
| except Exception: | ||
| print(f"\n No palace found at {palace_path}") | ||
| print(" Run: mempalace init <dir> then mempalace mine <dir>") | ||
| sys.exit(1) |
There was a problem hiding this comment.
cmd_compress treats any failure to open/read the collection (including an existing-but-empty palace, or an ImportError due to missing [chroma] for a legacy palace) as "No palace found" and exits. Consider distinguishing these cases so users get the right remediation (mine first vs install mempalace[chroma] vs fix path).
| preds = set() | ||
| if self._triples_table: | ||
| for r in self._lance_get(self._triples_table, "id != ''", limit=100_000): | ||
| preds.add(r.get("predicate", "")) |
There was a problem hiding this comment.
relationship_types for Lance is built by iterating up to 100k rows in Python, which can both miss predicates beyond the limit and be slow/memory-heavy. If LanceDB supports it, prefer SELECT DISTINCT predicate-style aggregation, or at least document/handle truncation explicitly.
| def test_onnx_embedder_embed(): | ||
| e = OnnxEmbedder() | ||
| result = e.embed(["hello world", "test sentence"]) | ||
| assert len(result) == 2 | ||
| assert len(result[0]) == 384 | ||
| assert all(isinstance(x, float) for x in result[0]) | ||
|
|
There was a problem hiding this comment.
These ONNX embedder tests call OnnxEmbedder.embed() directly, which triggers a real model download/extract into the (test-isolated) HOME cache on first run. That makes the unit test suite network-dependent and potentially slow/flaky. Consider mocking _ensure_onnx_model()/InferenceSession (or marking these as an opt-in integration test) so the default test run stays hermetic.
| def get_collection( | ||
| palace_path: str, collection_name: str = "mempalace_drawers", backend: str = None, embedder=None | ||
| ): | ||
| """Get or create the palace collection. | ||
|
|
||
| This is the main entry point for all palace database access. | ||
| Auto-detects the backend (LanceDB or ChromaDB) based on existing data. | ||
| """ | ||
| return open_collection( | ||
| palace_path=palace_path, | ||
| collection_name=collection_name, | ||
| backend=backend, | ||
| embedder=embedder, | ||
| ) |
There was a problem hiding this comment.
get_collection() accepts a backend override, and docs/UPGRADE mention MEMPALACE_BACKEND / config backend to force a specific backend, but the default path (backend=None) currently only auto-detects via detect_backend() and never consults MempalaceConfig().backend. As a result, the env/config knobs appear to have no effect unless every caller explicitly passes backend. Consider defaulting backend from MempalaceConfig().backend (or reading the env var directly) before falling back to auto-detection.
519b9cd to
df316e2
Compare
web3guru888
left a comment
There was a problem hiding this comment.
This is the PR I've been waiting for. We've been pinning chromadb>=0.6.0,<0.7.0 in our integration for months specifically because of ChromaDB API surface volatility — having a backend abstraction that isolates the rest of the codebase from that is a significant quality-of-life improvement.
The LanceCollection / ChromaCollection + unified get_collection() factory is clean. A few things I noticed in db.py:
Where-clause translation (_chroma_where_to_sql): The current implementation handles $and, $or, and the common comparison operators. One gap: $in / $nin (used in some of MemPalace's own code via {"wing": {"$in": ["astrophysics", "economics"]}}). If any callers depend on those, they'll silently fall through without adding a condition. Worth either handling or adding an explicit NotImplementedError for unsupported operators.
get() with large offset and no limit: The fallback of limit(100_000) when only an offset is provided could return 100k rows into memory. The pattern appears intentional (used in get_changes_since in PR #575), but on large palaces this materializes everything. A generator/cursor approach would be better long-term, but I understand that's a bigger change.
upsert() fallback path: The merge_insert failure → delete+add fallback is pragmatic, but it means on a partial failure (merge fails midway, then delete+add also fails on some records) you get a partially-committed batch with no error surfaced to the caller. The outer try/except Exception on _to_records calls would hide this. Consider at minimum logging logger.warning("upsert partial failure...") in the fallback.
_check_dimension(): The RuntimeError message is excellent — actionable and includes the exact model names. Good pattern.
Embeddings: ONNX default (all-MiniLM-L6-v2, ~60MB vs 3GB torch) is the right default for most users. The reindex command is essential — you can't switch models without it, and it's not obvious why search quality drops when someone changes their config. The error message from _check_dimension should cite mempalace reindex as the fix, which it does. Nice.
Benchmark results committed: The results_v4_comparison.json in the diff is a nice touch for reproducibility, though committed benchmark artifacts can get stale. Consider a note in BENCHMARKS.md about when they were generated.
222 passing with 2 intentional skips is solid coverage for a change this large. I'd want to see the ChromaDB → LanceDB migration path tested with a real palace that has KG triples before merging, but that may be in the existing test suite.
Overall this is solid architectural work. The abstraction boundary is clean and the migration story (mempalace migrate) is well thought out.
|
Thanks, addressed your concerns in recent pushes |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Storage backend: 'lance' (default) or 'chroma' (legacy).""" | ||
| env_val = os.environ.get("MEMPALACE_BACKEND") | ||
| if env_val: | ||
| return env_val | ||
| return self._file_config.get("backend", DEFAULT_BACKEND) |
There was a problem hiding this comment.
MempalaceConfig.backend returns DEFAULT_BACKEND ('lance') when the user hasn't explicitly configured a backend. This makes it impossible for callers to distinguish “unset” vs “forced lance”, and breaks the advertised auto-detection path for existing ChromaDB palaces. Consider returning None when the config/env key is absent (and let db.detect_backend() decide), while still honoring an explicit env/config value.
| """Storage backend: 'lance' (default) or 'chroma' (legacy).""" | |
| env_val = os.environ.get("MEMPALACE_BACKEND") | |
| if env_val: | |
| return env_val | |
| return self._file_config.get("backend", DEFAULT_BACKEND) | |
| """Storage backend override, or None when not explicitly configured.""" | |
| env_val = os.environ.get("MEMPALACE_BACKEND") | |
| if env_val: | |
| return env_val | |
| return self._file_config.get("backend") |
| try: | ||
| client = chromadb.PersistentClient(path=palace_path) | ||
| col = client.get_collection("mempalace_drawers") | ||
| col = get_collection(palace_path) | ||
| except ImportError: | ||
| print(f"\n Palace at {palace_path} uses ChromaDB but 'chromadb' is not installed.") | ||
| print(" Install with: pip install 'mempalace[chroma]'") | ||
| print(" Or migrate: mempalace migrate") | ||
| raise SearchError(f"Missing chromadb dependency for {palace_path}") | ||
| except Exception: | ||
| print(f"\n No palace found at {palace_path}") | ||
| print(" Run: mempalace init <dir> then mempalace mine <dir>") | ||
| raise SearchError(f"No palace found at {palace_path}") |
There was a problem hiding this comment.
The broad except Exception here will also catch a LanceDB embedder dimension mismatch RuntimeError and report it as “No palace found”. That makes recovery steps opaque. Consider catching the dimension-mismatch RuntimeError separately and surfacing its actionable message (e.g., instruct to run mempalace reindex).
| # Read all existing records | ||
| from .db import open_collection | ||
|
|
||
| col = open_collection(palace_path, embedder=embedder) | ||
| total = col.count() |
There was a problem hiding this comment.
cmd_reindex opens the collection using the target embedder. If the existing LanceDB table has a different vector dimension, open_collection() will raise the dimension-mismatch guard before any data can be read—so reindex can’t be used to fix a mismatch. The reindex flow likely needs to (1) open/read with the existing table schema (or bypass dimension check) and (2) recreate/rewrite the table with the new embedder.
| if has_embeddings: | ||
| print(" Transferring with original embeddings (no re-embedding needed)...") | ||
| else: | ||
| print(" Re-embedding all drawers (ChromaDB embeddings not available)...") | ||
|
|
||
| lance_col = open_collection(palace_path, backend="lance") | ||
|
|
There was a problem hiding this comment.
Migration writes transferred ChromaDB embeddings into a new LanceDB table but creates the LanceCollection with whatever embedder is active in config. If the configured embedder has a different dimension than the transferred vectors, the resulting palace will immediately fail to reopen/query (and may even fail queries within the same process). Consider forcing the migration embedder to match the transferred embeddings (MiniLM 384d) or, if the user’s configured embedder differs, ignore transferred embeddings and re-embed so the Lance schema and active embedder stay consistent. Also ensure embedding_model metadata reflects the actual model that produced the stored vectors.
| def _to_records(self, documents, ids, metadatas, embeddings=None): | ||
| """Convert to LanceDB record format, computing embeddings if needed.""" | ||
| if embeddings is None: | ||
| embeddings = self._embedder.embed(documents) | ||
|
|
||
| records = [] | ||
| for doc, id_, meta, vec in zip(documents, ids, metadatas, embeddings): | ||
| # Inject the embedding model name so we can detect mismatches later | ||
| meta_with_model = dict(meta) | ||
| meta_with_model.setdefault("embedding_model", self._embedder.model_name) | ||
| record = { | ||
| "id": id_, | ||
| "document": doc, | ||
| "vector": vec, | ||
| "wing": str(meta.get("wing", "")), | ||
| "room": str(meta.get("room", "")), | ||
| "source_file": str(meta.get("source_file", "")), | ||
| "node_id": str(meta.get("node_id", "")), | ||
| "seq": int(meta.get("seq", 0)), | ||
| "metadata_json": json.dumps(meta_with_model, default=str), | ||
| } |
There was a problem hiding this comment.
When embeddings are provided (e.g., during migration), _to_records() stores vectors without validating that their length matches self._embedder.dimension. This can create a table whose vector column dimension conflicts with the embedder attached to the collection instance, leading to hard-to-diagnose failures later. Add a dimension consistency check (and a clearer error) when embeddings are supplied.
| def __init__(self, db_path: str = None, palace_path: str = None): | ||
| if palace_path is not None: | ||
| self._backend = "lance" | ||
| self._palace_path = palace_path | ||
| self._db = None | ||
| self._entities_table = None | ||
| self._triples_table = None | ||
| self._init_lance() | ||
| elif db_path is not None and not db_path.endswith(".sqlite3"): | ||
| # Treat non-sqlite path as a palace directory | ||
| self._backend = "lance" | ||
| self._palace_path = db_path | ||
| self._db = None | ||
| self._entities_table = None | ||
| self._triples_table = None | ||
| self._init_lance() |
There was a problem hiding this comment.
KnowledgeGraph.__init__ treats any db_path not ending in .sqlite3 as a palace directory (Lance backend). This is a breaking change for existing callers that used other SQLite extensions (e.g., .db), and can also mis-handle an actual SQLite file path by creating a directory at that location. Consider detecting “directory vs file” (e.g., os.path.isdir) and/or accepting common SQLite suffixes instead of relying solely on .sqlite3.
| --- | ||
|
|
||
| ## 4. Unified Knowledge Graph (Phase 5) | ||
|
|
||
| The knowledge graph (entities + triples) has moved from a separate SQLite |
There was a problem hiding this comment.
Section numbering jumps from “2. Pluggable Embedders” to “4. Unified Knowledge Graph”, which is confusing in an upgrade guide. Consider renumbering the headings (or adding a section 3) so references remain consistent.
| def test_onnx_embedder_embed(): | ||
| e = OnnxEmbedder() | ||
| result = e.embed(["hello world", "test sentence"]) | ||
| assert len(result) == 2 | ||
| assert len(result[0]) == 384 | ||
| assert all(isinstance(x, float) for x in result[0]) | ||
|
|
||
|
|
||
| def test_onnx_embedder_dimension(): | ||
| e = OnnxEmbedder() | ||
| assert e.dimension == 384 | ||
|
|
||
|
|
||
| def test_onnx_embedder_output_normalized(): | ||
| """ONNX embedder output vectors must be unit length.""" | ||
| e = OnnxEmbedder() | ||
| result = e.embed(["normalization check"]) | ||
| norm = np.linalg.norm(result[0]) | ||
| assert abs(norm - 1.0) < 1e-5 | ||
|
|
||
|
|
||
| def test_onnx_embedder_batch(): | ||
| """ONNX embedder handles batches larger than internal batch size.""" | ||
| e = OnnxEmbedder() | ||
| texts = [f"document number {i}" for i in range(50)] | ||
| result = e.embed(texts) | ||
| assert len(result) == 50 | ||
| assert all(len(v) == 384 for v in result) | ||
|
|
||
|
|
||
| def test_onnx_embedder_deterministic(): | ||
| """Same input produces identical output across calls.""" | ||
| e = OnnxEmbedder() | ||
| r1 = e.embed(["reproducibility test"]) | ||
| r2 = e.embed(["reproducibility test"]) | ||
| assert r1 == r2 |
There was a problem hiding this comment.
These ONNX embedder tests call OnnxEmbedder.embed() directly, which will download ~tens of MB of model artifacts on first run and perform real inference. This makes the unit test suite slow and potentially flaky in environments without network access or with restricted home directories. Consider marking these as integration tests (skip unless an env var is set) and/or mocking the download/session/tokenizer so unit tests remain offline and deterministic.
| if name == "ollama": | ||
| model = options.get("model", "nomic-embed-text") | ||
| base_url = options.get("base_url", "http://localhost:11434") | ||
| timeout = float(options.get("timeout", 60.0)) | ||
| cache_key = f"ollama:{model}@{base_url}" | ||
|
|
||
| if cache_key not in _embedder_cache: | ||
| _embedder_cache[cache_key] = OllamaEmbedder( | ||
| model=model, base_url=base_url, timeout=timeout | ||
| ) | ||
| return _embedder_cache[cache_key] |
There was a problem hiding this comment.
get_embedder() caches Ollama embedders using a key that ignores timeout. If a caller changes only the timeout in embedder_options, they’ll still get the previously cached instance with the old timeout value. Consider including timeout in the cache key (or otherwise keying on the full options dict) to avoid surprising behavior.
web3guru888
left a comment
There was a problem hiding this comment.
@dekoza — the four key fixes in commit 7a15a461 address the main issues I raised:
$in/$nin in _chroma_where_to_sql — the silent data loss path is now closed. Mapping $in/$nin to IN (...)/NOT IN (...) with proper quoting is correct, and the ValueError on unknown operators is the right defensive response.
KG count_rows(filter=) vs len(get(limit=100_000)) — this is the right fix. count_rows is O(1) at the Lance layer vs the full materialization that get(limit=100k) was doing. This also removes the silent truncation bug at exactly 100k triples.
ImportError catch in _open_chroma with actionable message — the "install mempalace[chromadb]" message makes the missing-dependency case debuggable without reading source.
merge_insert fallback log upgraded to warning — correct, it signals a performance regression, not a debug detail.
The searcher empty-palace / missing-palace / missing-chromadb distinction is a nice UX improvement.
One remaining observation (minor, non-blocking): the upsert fallback in the case where merge_insert isn't available still swallows partial failures by iterating per-record. Worth a follow-up to surface the partial failure count in the warning log. Not a blocker for merge.
Approving — the backend abstraction is solid and this is a significant capability addition.
[MemPalace-AGI integration — 215 tests, 710 KG entities]
web3guru888
left a comment
There was a problem hiding this comment.
This is a significant architectural change and the scope is substantial (21 files, full backend abstraction). The LanceCollection/ChromaCollection unified API approach is the right design — a storage seam that lets the two backends be swapped without changing call sites.
A few notes from the design review:
LanceDB V4 Windows blocker: This is the same issue flagged in #529 — lancedb still imports fcntl on Windows in some builds. If this PR is meant to work cross-platform, that needs to be resolved before merge. The [extras] approach (LanceDB as optional) helps, but the import at the top of db.py may still fail on Windows even without the extra.
Migration path completeness: mempalace migrate for ChromaDB → LanceDB is the right surface. The key question is whether the migration handles embedding re-indexing — LanceDB uses its own vector format and ChromaDB's serialized embeddings are not directly transferable. If the migration re-embeds from stored text (slower but correct), that should be documented explicitly.
ChromaDB as [chroma] optional extra: Making chromadb optional is the right long-term direction, but it means every existing installation breaks unless they either install the extra or get a clean upgrade path. The pyproject.toml changeset should include ChromaDB in a compatibility extra (e.g. pip install mempalace[chroma]) and the migration docs should be prominent.
Breaking change scope: This touches palace.py, searcher.py, miner.py, mcp_server.py — every major path. Given the number of community PRs currently open against these files (#542, #544, #562, etc.), the merge conflict surface is enormous. Worth coordinating with the maintainer before the review cycle closes on those PRs.
Benchmark data: benchmarks/BENCHMARKS_V4.md + results_v4_comparison.json is good evidence. Would be useful to see the search latency comparison at 10K+ drawers where LanceDB's columnar scan should outperform ChromaDB's HNSW.
This is the right architectural direction — LanceDB's embedded model without a server process is a better fit for local-first tooling. The coordination and migration story are the main things to nail before this is merge-ready.
[MemPalace-AGI integration — 215 tests, 710 KG entities]
|
@dekoza — thanks for the fast fix commits. The specific issues I raised in my initial review are addressed in The separate COMMENTED review at 22:47Z raised some broader design-level questions that are separate from the code correctness — flagging them here since they bear on merge readiness: Windows / Migration re-embedding: Does Merge conflict surface: There are several open PRs (#542, #544, #562) that touch None of these are blockers if you can address the Windows import and the migration note — the architecture is solid. [MemPalace-AGI integration — 540+ discoveries, 5,251 KG triples] |
|
@web3guru888 — thanks for the thorough follow-up. Addressing each point: 1. Windows / All The 2. Migration re-embedding — doc fix pushed. You are right that this deserved explicit documentation. Added a note to UPGRADE.md (c329713) clarifying that 3. Merge conflict surface — mostly resolved. Of the three PRs flagged: #544 is merged, #562 is closed, #542 is still open but its overlap with #574 is limited (security/test hardening, not storage layer). Should be manageable on rebase. |
|
@dekoza — all three concerns addressed cleanly. The implementation details are better than I assumed: Windows / Migration re-embedding — the Merge conflict surface — good to hear #544 is merged and #562 is closed. #542 touching security/test hardening shouldn't conflict with the storage layer changes. The architecture is solid and the open concerns are addressed. LGTM. [MemPalace-AGI integration — 540+ discoveries, 5,251 KG triples] |
7a15a46 to
161b511
Compare
161b511 to
f82155f
Compare
|
Rebased onto Key changes in this update:
|
Database abstraction (Phase 1): - New db.py with LanceCollection/ChromaCollection sharing identical API - New embeddings.py with SentenceTransformerEmbedder (default) - palace.py now delegates to db.open_collection() with auto-detection - All consumers (searcher, layers, mcp_server, miner, palace_graph) updated to use palace.get_collection() instead of direct chromadb - Added 'mempalace migrate' for ChromaDB -> LanceDB migration - LanceDB is new default; ChromaDB moved to optional [chroma] extra - Dependencies: lancedb>=0.14, sentence-transformers>=2.2.0 Pluggable vectorizers (Phase 2): - OllamaEmbedder for GPU server usage via HTTP - Model aliases: bge-small, bge-base, e5-base, nomic, ollama - 'mempalace reindex' to re-embed with different model - 'mempalace embedders' to list available models - embedding_model tracked in every record's metadata - Config via ~/.mempalace/config.json embedder/embedder_options Tests: 552 passed, 0 failed (18 new embedding tests)
knowledge_graph.py rewritten with dual backend: - LanceDB (default): kg_entities + kg_triples tables in palace dir - SQLite (legacy): preserved for existing .sqlite3 paths - All operations (add_entity, add_triple, invalidate, query_entity, query_relationship, timeline, stats, seed_from_entity_facts) work on both backends with identical API MCP server updated: - KnowledgeGraph now uses palace_path instead of separate sqlite file - One data directory, one format, one sync unit UPGRADE.md updated with Phase 5 how-to and migration notes. PLAN.md Phase 5 marked done. Tests: 588 passed (28 KG tests all pass on LanceDB backend)
New benchmark runner (benchmarks/longmemeval_v4.py): - Runs LongMemEval against multiple backends in one invocation - Modes: chroma-default, lance-default, lance-bge-small, lance-bge-base, lance-nomic, and custom embedder - Produces side-by-side comparison table with R@5, R@10, NDCG, ms/query - Per-type breakdown across question categories - Presets: 'all' (3 modes), 'quick' (2), 'embedders' (4) Results on full 500 questions: ChromaDB + MiniLM (v3.x): R@5=0.966 R@10=0.982 NDCG@5=0.888 1165ms/q LanceDB + MiniLM (v4.0): R@5=0.966 R@10=0.982 NDCG@5=0.888 638ms/q LanceDB + BGE-small: R@5=0.962 R@10=0.978 NDCG@5=0.895 2624ms/q Key findings: - Zero retrieval regression: LanceDB matches ChromaDB exactly - 1.8x faster queries (638ms vs 1165ms) with cosine distance - BGE-small trades tiny R@5 drop for better NDCG (ranking quality) Docs: - benchmarks/BENCHMARKS_V4.md — full results + reproduction steps - UPGRADE.md updated with benchmark section + how-to - PLAN.md Phase 6 marked done — all 6 phases complete Tests: 588 passed
Added v4.0 Backend Comparison section at top of BENCHMARKS.md: - LanceDB+MiniLM: R@5=0.966, identical to ChromaDB, 1.8x faster (638ms vs 1165ms) - LanceDB+BGE-small: R@5=0.962, higher NDCG (0.895 vs 0.888) - Per-type breakdown showing BGE-small tradeoffs - Reproduction commands for longmemeval_v4.py Updated existing sections: - Score progression table: added LanceDB and BGE-small rows - Comparison table: added v4 LanceDB entry alongside v3 ChromaDB - Tradeoffs table: added v4 column (sync, pluggable embedders, query speed) - Results files table: added results_v4_comparison.json Tests: 588 passed
- Add OnnxEmbedder as default (onnxruntime+tokenizers, no torch) - Move sentence-transformers to [gpu] optional extra - Strip all sync code from db.py, cli.py, config.py, UPGRADE.md - Apply ruff lint fixes from f21ba0a to migration-branch files - Remove unused chromadb import from conftest.py - Regenerate uv.lock for migration-only dependencies config.py: mempalace/config.py
…sync from benchmark - Add _check_dimension() to LanceCollection.__init__: raises RuntimeError if existing table vector dimension doesn't match the active embedder - Replace invalid port 99999 with RFC discard port 9 in Ollama test - Remove sync_meta import and sync_identity from benchmark script
These fields are used by the sync layer for indexed queries instead of full table scans. Adding them to FILTER_COLUMNS and SCHEMA_COLUMNS now avoids a schema migration later. Records without sync metadata get defaults (node_id='', seq=0).
- UPGRADE.md: fix default dep table (onnxruntime+tokenizers, not sentence-transformers) - db.py: catch ImportError in _open_chroma with actionable message - db.py: upgrade merge_insert fallback log to warning - db.py: add $in/$nin support to _chroma_where_to_sql - searcher.py: distinguish empty palace vs missing palace vs missing chromadb - cli.py: same distinction in cmd_compress - knowledge_graph.py: use count_rows(filter=) instead of len(get(limit=100k)) - knowledge_graph.py: remove limit=100 from _lance_timeline queries - palace.py: get_collection reads MempalaceConfig().backend before auto-detect
…irectory creation
…tream backend seam - Move LanceCollection and LanceBackend from db.py to backends/lance.py - LanceCollection now extends BaseCollection (upstream's ABC) - Add detect_backend() to backends/__init__.py - Update palace.py to use backends package for both Lance and Chroma - LanceDB is the default backend for new palaces - Remove db.py (all logic migrated to backends/lance.py) - Update cli.py and test imports accordingly
Required by upstream's tool_update_drawer (PR MemPalace#667). LanceCollection implements update as get+upsert (re-embeds on content change). ChromaCollection delegates to chromadb's native update.
f82155f to
d3085f9
Compare
|
Rebased onto current
Conflict resolutions:
New in this update:
All 679 tests pass (2 skipped). |
Formalizes the BaseCollection/BaseBackend contract introduced as a seam in #413 into an interchangeability spec that third-party backends can build to. Driven by six in-flight backend PRs (#574, #643, #665, #697, #700, #381) each implementing the interface differently. Key decisions captured: entry-point distribution, typed QueryResult/ GetResult replacing Chroma dict shape, daemon-first multi-palace model via PalaceRef, required where-clause subset (incl. $contains), mandatory embedder injection with model-identity validation, capability tokens, shared pytest conformance suite, and a backend-neutral migrate/verify CLI.
Incorporates review feedback from skuznetsov (Postgres, #665) and dekoza (Lance, #574) on issue #737: - §1.5: split 'accepts embeddings=' (signature compliance) from 'persists embeddings as-is' (correctness). Adds supports_embeddings_passthrough capability; the former is universal, the latter is required to label a migration lossless. - §1.5: model identity check becomes a three-state machine (known_match / known_mismatch / unknown) so legacy palaces without recorded identity don't hard-fail on upgrade. - §1.4: makes explicit that supports_contains_fast is the ONLY performance floor the spec promises; without it callers MUST assume O(n). $contains is a correctness requirement, not a performance one. - §3.3: clarifies auto-detect is an upgrade-compat path only, never the selection mechanism for new palaces. - §8.2: migrate CLI refuses to run against a target lacking supports_embeddings_passthrough unless --accept-re-embed is passed; migration record now captures lossless status and model identities.
…nd registry (RFC 001 §10) Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres, #700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target to align against. Scope (this PR): - Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps existing callers working while the attribute-access migration proceeds. - BaseCollection is now kwargs-only across add/upsert/query/get/delete/update with ABC defaults for estimated_count/close/health and a non-atomic default update() (§1.1–1.2). - PalaceRef replaces raw path strings at the backend boundary (§2.2). - BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3). - mempalace.backends entry-point group + in-tree registry with resolve_backend_for_palace priority order matching §3.2–3.3. - ChromaCollection normalizes chroma returns into typed results; unknown where-clause operators raise UnsupportedFilterError (no silent drop, §1.4). - ChromaBackend absorbs the inode/mtime client-cache freshness check previously duplicated in mcp_server._get_client() (§10 + PR #757). - searcher.py migrated to typed-attribute access as the reference call site; remaining callers land in a follow-up. - pyproject: chroma registered via [project.entry-points."mempalace.backends"]. Out of scope (explicit follow-ups): - Full caller migration off the dict-compat shim across palace.py, mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py, palace_graph.py, cli.py, closet_llm.py. - Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5). - maintenance_state() / run_maintenance() benchmark hooks (§7.3). - AbstractBackendContractSuite full coverage (§7.1–7.2). - mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8). Tests: 970 passed (up from 967 on develop); new coverage for typed results, empty-result outer-shape preservation, \$regex rejection, registry lookup, priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection. Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).
Scanned all 233 open upstream PRs today against our open PRs and fork-ahead / planned-work items. Findings merged into README: - P2 (decay) and P3 Tier-0 (LLM rerank): both covered by MemPalace#1032 (@zackchiutw, MERGEABLE, 2026-04-19 — Weibull decay + 4-stage rerank pipeline). Older simpler version at MemPalace#337. Dropped as fork work; watching MemPalace#1032. - P7 (alternative storage): formally out of scope. RFC 001 MemPalace#743 (@igorls) defines the plugin contract; four backend PRs already in flight (MemPalace#700, MemPalace#381 Qdrant; MemPalace#574, MemPalace#575 LanceDB). Fork consumes, does not rebuild. - P0 (multi-label tags): still fork/upstream candidate. MemPalace#1033 (@zackchiutw) ships adjacent privacy-tag + progressive disclosure but not the full multi-label scheme. - Merged MemPalace#1023 section acknowledges complementary MemPalace#976 (felipetruman) which adds broader mine_global_lock() + HNSW num_threads pin. Gives future-us a map so we don't re-file MemPalace#1036-style duplicates.
Summary
Replace ChromaDB with LanceDB as the default storage backend, add pluggable embedding system, and unify the knowledge graph into LanceDB tables.
Changes
Database abstraction (Phase 1)
db.pywithLanceCollection/ChromaCollectionsharing identical APIpalace.pydelegates todb.open_collection()with auto-detectionmempalace migratefor ChromaDB → LanceDB migration[chroma]extraPluggable vectorizers (Phase 2)
all-MiniLM-L6-v2via ONNX Runtime (no torch, ~60MB vs ~3GB)OllamaEmbedderfor GPU server usage via HTTP (zero extra deps)SentenceTransformerEmbedderavailable via[gpu]extra for cuda/mps or other HF modelsbge-small,bge-base,e5-base,nomic,ollamamempalace reindexto re-embed with a different modelmempalace embeddersto list available modelsKnowledge graph unification (Phase 5)
Benchmarks (Phase 6)
Hardening
RuntimeErrorwith actionable messageDependencies
Test results
222 passed, 2 skipped (sentence-transformers integration tests without
[gpu])