Skip to content

feat: add LanceDB backend abstraction and migration path#574

Open
dekoza wants to merge 11 commits intoMemPalace:developfrom
dekoza:feat/lancedb-migration
Open

feat: add LanceDB backend abstraction and migration path#574
dekoza wants to merge 11 commits intoMemPalace:developfrom
dekoza:feat/lancedb-migration

Conversation

@dekoza
Copy link
Copy Markdown

@dekoza dekoza commented Apr 10, 2026

Summary

Replace ChromaDB with LanceDB as the default storage backend, add pluggable embedding system, and unify the knowledge graph into LanceDB tables.

Changes

Database abstraction (Phase 1)

  • New db.py with LanceCollection/ChromaCollection sharing identical API
  • palace.py delegates to db.open_collection() with auto-detection
  • mempalace migrate for ChromaDB → LanceDB migration
  • LanceDB is new default; ChromaDB moved to optional [chroma] extra

Pluggable vectorizers (Phase 2)

  • ONNX default embedderall-MiniLM-L6-v2 via ONNX Runtime (no torch, ~60MB vs ~3GB)
  • OllamaEmbedder for GPU server usage via HTTP (zero extra deps)
  • SentenceTransformerEmbedder available via [gpu] extra for cuda/mps or other HF models
  • Model aliases: bge-small, bge-base, e5-base, nomic, ollama
  • mempalace reindex to re-embed with a different model
  • mempalace embedders to list available models

Knowledge graph unification (Phase 5)

  • KG storage migrated from standalone SQLite to LanceDB tables
  • Single storage engine, single data format

Benchmarks (Phase 6)

  • LongMemEval v3 vs v4 comparison benchmarks
  • Results documented in BENCHMARKS.md

Hardening

  • Dimension mismatch guard: reopening a LanceDB collection with a different embedder dimension raises RuntimeError with actionable message
  • Invalid test port fixed (99999 → 9)
  • Full ruff lint pass

Dependencies

# Core (default install)
lancedb>=0.14, pyyaml, onnxruntime>=1.14, tokenizers>=0.13

# Optional
[gpu]    → sentence-transformers (torch + full HF model zoo)
[chroma] → chromadb (legacy backend for migration)

Test results

222 passed, 2 skipped (sentence-transformers integration tests without [gpu])

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates MemPalace’s storage stack to a LanceDB-first architecture, introduces a pluggable embedding system (ONNX default + optional GPU/Ollama), and moves the knowledge graph into LanceDB tables while preserving a legacy Chroma/SQLite migration path.

Changes:

  • Added a LanceDB/ChromaDB collection abstraction (db.py) and routed consumers through palace.get_collection(), plus CLI migration and reindex tooling.
  • Implemented pluggable embedders (embeddings.py) with model aliases, caching, and CLI discovery.
  • Migrated knowledge graph storage to LanceDB tables by default, updated MCP/server and benchmarks, and refreshed tests/docs accordingly.

Reviewed changes

Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
UPGRADE.md New v4 upgrade guide covering backend migration, embedders, KG changes, benchmarks
PLAN.md New/updated phased plan documenting backend/vectorizer/KG work and benchmarks
pyproject.toml Switch core deps to LanceDB/ONNX stack; move Chroma + sentence-transformers behind extras
mempalace/db.py New backend abstraction (LanceCollection/ChromaCollection), backend detection, open factory
mempalace/embeddings.py New embedding backends (ONNX/ST/Ollama), aliasing, caching, listing utilities
mempalace/palace.py Centralized collection open logic (now backend-agnostic)
mempalace/searcher.py Updated search to use get_collection() instead of direct Chroma client
mempalace/miner.py Updated status/mining paths to use backend-agnostic collection access
mempalace/mcp_server.py Removed Chroma client caching and switched to palace collection abstraction; KG default updated
mempalace/palace_graph.py Switched graph building to backend-agnostic collection retrieval
mempalace/layers.py Updated layered context builders to use backend-agnostic collection access
mempalace/normalize.py Minor formatting change in file-size error message
mempalace/split_mega_files.py Minor formatting changes to size-limit skip messages
mempalace/init.py Adjusted dependency logger silencing (incl. sentence-transformers)
benchmarks/results_v4_comparison.json Added committed benchmark outputs for v4 comparison
benchmarks/longmemeval_v4.py New benchmark runner using db abstraction + embedders
benchmarks/BENCHMARKS.md Added v4 backend comparison section and updated tables/notes
benchmarks/BENCHMARKS_V4.md Added dedicated v4 benchmark write-up and reproduction steps
tests/test_searcher.py Updated mocking to patch get_collection() and changed empty-palace expectations
tests/test_palace_graph.py Removed import-time Chroma mocking; direct palace_graph imports
tests/test_miner.py Updated tests to use backend-agnostic get_collection() helper
tests/test_mcp_server.py Updated collection helper and status expectations for empty palace
tests/test_layers.py Updated tests to patch _get_palace_collection instead of Chroma client
tests/test_knowledge_graph.py Updated WAL test setup to explicitly use SQLite backend path
tests/test_knowledge_graph_extra.py Updated fixture to use palace directory (Lance-backed KG)
tests/test_embeddings.py New tests for embedder backends, aliases, caching, and db metadata tracking
tests/test_convo_miner.py Updated tests to use backend-agnostic get_collection() helper
tests/test_cli.py Updated CLI tests to patch get_collection and removed Chroma module mocking
tests/conftest.py Updated fixtures to use backend-agnostic collections and Lance-backed KG fixture
tests/benchmarks/test_layers_bench.py Minor assertion formatting change
Comments suppressed due to low confidence (2)

mempalace/searcher.py:34

  • This except Exception path prints "No palace found" for all failures (including "empty palace", backend-specific errors, and ImportError when chromadb isn't installed for legacy palaces). Narrowing the exception handling (e.g., detect missing directory vs empty vs missing [chroma] extra) would avoid misleading guidance and make migration failures actionable.
    except Exception:
        print(f"\n  No palace found at {palace_path}")
        print("  Run: mempalace init <dir> then mempalace mine <dir>")
        raise SearchError(f"No palace found at {palace_path}")

mempalace/searcher.py:111

  • Empty palaces (count()==0) are currently reported as "No palace found" here as well. Since the API surface returns structured errors, it would be more accurate to return a distinct error/hint for "palace exists but contains 0 drawers" and to preserve actionable info for legacy Chroma palaces when chromadb isn't installed (suggest installing mempalace[chroma]).
        col = get_collection(palace_path)
        if col.count() == 0:
            raise Exception("Empty palace")
    except Exception as e:
        logger.error("No palace found at %s: %s", palace_path, e)
        return {
            "error": "No palace found",
            "hint": "Run: mempalace init <dir> && mempalace mine <dir>",
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread UPGRADE.md Outdated
Comment thread mempalace/db.py Outdated
Comment thread mempalace/searcher.py Outdated
Comment thread mempalace/cli.py
Comment on lines 549 to 557
# Connect to palace
try:
client = chromadb.PersistentClient(path=palace_path)
col = client.get_collection("mempalace_drawers")
col = get_collection(palace_path)
if col.count() == 0:
raise Exception("Empty palace")
except Exception:
print(f"\n No palace found at {palace_path}")
print(" Run: mempalace init <dir> then mempalace mine <dir>")
sys.exit(1)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_compress treats any failure to open/read the collection (including an existing-but-empty palace, or an ImportError due to missing [chroma] for a legacy palace) as "No palace found" and exits. Consider distinguishing these cases so users get the right remediation (mine first vs install mempalace[chroma] vs fix path).

Copilot uses AI. Check for mistakes.
Comment thread mempalace/knowledge_graph.py Outdated
Comment thread mempalace/knowledge_graph.py Outdated
Comment on lines +620 to +623
preds = set()
if self._triples_table:
for r in self._lance_get(self._triples_table, "id != ''", limit=100_000):
preds.add(r.get("predicate", ""))
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relationship_types for Lance is built by iterating up to 100k rows in Python, which can both miss predicates beyond the limit and be slow/memory-heavy. If LanceDB supports it, prefer SELECT DISTINCT predicate-style aggregation, or at least document/handle truncation explicitly.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_embeddings.py
Comment on lines +141 to +147
def test_onnx_embedder_embed():
e = OnnxEmbedder()
result = e.embed(["hello world", "test sentence"])
assert len(result) == 2
assert len(result[0]) == 384
assert all(isinstance(x, float) for x in result[0])

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ONNX embedder tests call OnnxEmbedder.embed() directly, which triggers a real model download/extract into the (test-isolated) HOME cache on first run. That makes the unit test suite network-dependent and potentially slow/flaky. Consider mocking _ensure_onnx_model()/InferenceSession (or marking these as an opt-in integration test) so the default test run stays hermetic.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines +39 to +52
def get_collection(
palace_path: str, collection_name: str = "mempalace_drawers", backend: str = None, embedder=None
):
"""Get or create the palace collection.

This is the main entry point for all palace database access.
Auto-detects the backend (LanceDB or ChromaDB) based on existing data.
"""
return open_collection(
palace_path=palace_path,
collection_name=collection_name,
backend=backend,
embedder=embedder,
)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_collection() accepts a backend override, and docs/UPGRADE mention MEMPALACE_BACKEND / config backend to force a specific backend, but the default path (backend=None) currently only auto-detects via detect_backend() and never consults MempalaceConfig().backend. As a result, the env/config knobs appear to have no effect unless every caller explicitly passes backend. Consider defaulting backend from MempalaceConfig().backend (or reading the env var directly) before falling back to auto-detection.

Copilot uses AI. Check for mistakes.
@dekoza dekoza force-pushed the feat/lancedb-migration branch from 519b9cd to df316e2 Compare April 10, 2026 21:15
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the PR I've been waiting for. We've been pinning chromadb>=0.6.0,<0.7.0 in our integration for months specifically because of ChromaDB API surface volatility — having a backend abstraction that isolates the rest of the codebase from that is a significant quality-of-life improvement.

The LanceCollection / ChromaCollection + unified get_collection() factory is clean. A few things I noticed in db.py:

Where-clause translation (_chroma_where_to_sql): The current implementation handles $and, $or, and the common comparison operators. One gap: $in / $nin (used in some of MemPalace's own code via {"wing": {"$in": ["astrophysics", "economics"]}}). If any callers depend on those, they'll silently fall through without adding a condition. Worth either handling or adding an explicit NotImplementedError for unsupported operators.

get() with large offset and no limit: The fallback of limit(100_000) when only an offset is provided could return 100k rows into memory. The pattern appears intentional (used in get_changes_since in PR #575), but on large palaces this materializes everything. A generator/cursor approach would be better long-term, but I understand that's a bigger change.

upsert() fallback path: The merge_insert failure → delete+add fallback is pragmatic, but it means on a partial failure (merge fails midway, then delete+add also fails on some records) you get a partially-committed batch with no error surfaced to the caller. The outer try/except Exception on _to_records calls would hide this. Consider at minimum logging logger.warning("upsert partial failure...") in the fallback.

_check_dimension(): The RuntimeError message is excellent — actionable and includes the exact model names. Good pattern.

Embeddings: ONNX default (all-MiniLM-L6-v2, ~60MB vs 3GB torch) is the right default for most users. The reindex command is essential — you can't switch models without it, and it's not obvious why search quality drops when someone changes their config. The error message from _check_dimension should cite mempalace reindex as the fix, which it does. Nice.

Benchmark results committed: The results_v4_comparison.json in the diff is a nice touch for reproducibility, though committed benchmark artifacts can get stale. Consider a note in BENCHMARKS.md about when they were generated.

222 passing with 2 intentional skips is solid coverage for a change this large. I'd want to see the ChromaDB → LanceDB migration path tested with a real palace that has KG triples before merging, but that may be in the existing test suite.

Overall this is solid architectural work. The abstraction boundary is clean and the migration story (mempalace migrate) is well thought out.

@dekoza
Copy link
Copy Markdown
Author

dekoza commented Apr 10, 2026

Thanks, addressed your concerns in recent pushes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 34 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/config.py
Comment on lines +158 to +162
"""Storage backend: 'lance' (default) or 'chroma' (legacy)."""
env_val = os.environ.get("MEMPALACE_BACKEND")
if env_val:
return env_val
return self._file_config.get("backend", DEFAULT_BACKEND)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MempalaceConfig.backend returns DEFAULT_BACKEND ('lance') when the user hasn't explicitly configured a backend. This makes it impossible for callers to distinguish “unset” vs “forced lance”, and breaks the advertised auto-detection path for existing ChromaDB palaces. Consider returning None when the config/env key is absent (and let db.detect_backend() decide), while still honoring an explicit env/config value.

Suggested change
"""Storage backend: 'lance' (default) or 'chroma' (legacy)."""
env_val = os.environ.get("MEMPALACE_BACKEND")
if env_val:
return env_val
return self._file_config.get("backend", DEFAULT_BACKEND)
"""Storage backend override, or None when not explicitly configured."""
env_val = os.environ.get("MEMPALACE_BACKEND")
if env_val:
return env_val
return self._file_config.get("backend")

Copilot uses AI. Check for mistakes.
Comment thread mempalace/searcher.py
Comment on lines 26 to 36
try:
client = chromadb.PersistentClient(path=palace_path)
col = client.get_collection("mempalace_drawers")
col = get_collection(palace_path)
except ImportError:
print(f"\n Palace at {palace_path} uses ChromaDB but 'chromadb' is not installed.")
print(" Install with: pip install 'mempalace[chroma]'")
print(" Or migrate: mempalace migrate")
raise SearchError(f"Missing chromadb dependency for {palace_path}")
except Exception:
print(f"\n No palace found at {palace_path}")
print(" Run: mempalace init <dir> then mempalace mine <dir>")
raise SearchError(f"No palace found at {palace_path}")
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The broad except Exception here will also catch a LanceDB embedder dimension mismatch RuntimeError and report it as “No palace found”. That makes recovery steps opaque. Consider catching the dimension-mismatch RuntimeError separately and surfacing its actionable message (e.g., instruct to run mempalace reindex).

Copilot uses AI. Check for mistakes.
Comment thread mempalace/cli.py
Comment on lines +275 to +279
# Read all existing records
from .db import open_collection

col = open_collection(palace_path, embedder=embedder)
total = col.count()
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_reindex opens the collection using the target embedder. If the existing LanceDB table has a different vector dimension, open_collection() will raise the dimension-mismatch guard before any data can be read—so reindex can’t be used to fix a mismatch. The reindex flow likely needs to (1) open/read with the existing table schema (or bypass dimension check) and (2) recreate/rewrite the table with the new embedder.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/cli.py
Comment on lines +442 to +448
if has_embeddings:
print(" Transferring with original embeddings (no re-embedding needed)...")
else:
print(" Re-embedding all drawers (ChromaDB embeddings not available)...")

lance_col = open_collection(palace_path, backend="lance")

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration writes transferred ChromaDB embeddings into a new LanceDB table but creates the LanceCollection with whatever embedder is active in config. If the configured embedder has a different dimension than the transferred vectors, the resulting palace will immediately fail to reopen/query (and may even fail queries within the same process). Consider forcing the migration embedder to match the transferred embeddings (MiniLM 384d) or, if the user’s configured embedder differs, ignore transferred embeddings and re-embed so the Lance schema and active embedder stay consistent. Also ensure embedding_model metadata reflects the actual model that produced the stored vectors.

Copilot uses AI. Check for mistakes.
Comment on lines +146 to +166
def _to_records(self, documents, ids, metadatas, embeddings=None):
"""Convert to LanceDB record format, computing embeddings if needed."""
if embeddings is None:
embeddings = self._embedder.embed(documents)

records = []
for doc, id_, meta, vec in zip(documents, ids, metadatas, embeddings):
# Inject the embedding model name so we can detect mismatches later
meta_with_model = dict(meta)
meta_with_model.setdefault("embedding_model", self._embedder.model_name)
record = {
"id": id_,
"document": doc,
"vector": vec,
"wing": str(meta.get("wing", "")),
"room": str(meta.get("room", "")),
"source_file": str(meta.get("source_file", "")),
"node_id": str(meta.get("node_id", "")),
"seq": int(meta.get("seq", 0)),
"metadata_json": json.dumps(meta_with_model, default=str),
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When embeddings are provided (e.g., during migration), _to_records() stores vectors without validating that their length matches self._embedder.dimension. This can create a table whose vector column dimension conflicts with the embedder attached to the collection instance, leading to hard-to-diagnose failures later. Add a dimension consistency check (and a clearer error) when embeddings are supplied.

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +58
def __init__(self, db_path: str = None, palace_path: str = None):
if palace_path is not None:
self._backend = "lance"
self._palace_path = palace_path
self._db = None
self._entities_table = None
self._triples_table = None
self._init_lance()
elif db_path is not None and not db_path.endswith(".sqlite3"):
# Treat non-sqlite path as a palace directory
self._backend = "lance"
self._palace_path = db_path
self._db = None
self._entities_table = None
self._triples_table = None
self._init_lance()
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KnowledgeGraph.__init__ treats any db_path not ending in .sqlite3 as a palace directory (Lance backend). This is a breaking change for existing callers that used other SQLite extensions (e.g., .db), and can also mis-handle an actual SQLite file path by creating a directory at that location. Consider detecting “directory vs file” (e.g., os.path.isdir) and/or accepting common SQLite suffixes instead of relying solely on .sqlite3.

Copilot uses AI. Check for mistakes.
Comment thread UPGRADE.md
Comment on lines +124 to +128
---

## 4. Unified Knowledge Graph (Phase 5)

The knowledge graph (entities + triples) has moved from a separate SQLite
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section numbering jumps from “2. Pluggable Embedders” to “4. Unified Knowledge Graph”, which is confusing in an upgrade guide. Consider renumbering the headings (or adding a section 3) so references remain consistent.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_embeddings.py
Comment on lines +141 to +176
def test_onnx_embedder_embed():
e = OnnxEmbedder()
result = e.embed(["hello world", "test sentence"])
assert len(result) == 2
assert len(result[0]) == 384
assert all(isinstance(x, float) for x in result[0])


def test_onnx_embedder_dimension():
e = OnnxEmbedder()
assert e.dimension == 384


def test_onnx_embedder_output_normalized():
"""ONNX embedder output vectors must be unit length."""
e = OnnxEmbedder()
result = e.embed(["normalization check"])
norm = np.linalg.norm(result[0])
assert abs(norm - 1.0) < 1e-5


def test_onnx_embedder_batch():
"""ONNX embedder handles batches larger than internal batch size."""
e = OnnxEmbedder()
texts = [f"document number {i}" for i in range(50)]
result = e.embed(texts)
assert len(result) == 50
assert all(len(v) == 384 for v in result)


def test_onnx_embedder_deterministic():
"""Same input produces identical output across calls."""
e = OnnxEmbedder()
r1 = e.embed(["reproducibility test"])
r2 = e.embed(["reproducibility test"])
assert r1 == r2
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These ONNX embedder tests call OnnxEmbedder.embed() directly, which will download ~tens of MB of model artifacts on first run and perform real inference. This makes the unit test suite slow and potentially flaky in environments without network access or with restricted home directories. Consider marking these as integration tests (skip unless an env var is set) and/or mocking the download/session/tokenizer so unit tests remain offline and deterministic.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/embeddings.py
Comment on lines +377 to +387
if name == "ollama":
model = options.get("model", "nomic-embed-text")
base_url = options.get("base_url", "http://localhost:11434")
timeout = float(options.get("timeout", 60.0))
cache_key = f"ollama:{model}@{base_url}"

if cache_key not in _embedder_cache:
_embedder_cache[cache_key] = OllamaEmbedder(
model=model, base_url=base_url, timeout=timeout
)
return _embedder_cache[cache_key]
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_embedder() caches Ollama embedders using a key that ignores timeout. If a caller changes only the timeout in embedder_options, they’ll still get the previously cached instance with the old timeout value. Consider including timeout in the cache key (or otherwise keying on the full options dict) to avoid surprising behavior.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dekoza — the four key fixes in commit 7a15a461 address the main issues I raised:

$in/$nin in _chroma_where_to_sql — the silent data loss path is now closed. Mapping $in/$nin to IN (...)/NOT IN (...) with proper quoting is correct, and the ValueError on unknown operators is the right defensive response.

KG count_rows(filter=) vs len(get(limit=100_000)) — this is the right fix. count_rows is O(1) at the Lance layer vs the full materialization that get(limit=100k) was doing. This also removes the silent truncation bug at exactly 100k triples.

ImportError catch in _open_chroma with actionable message — the "install mempalace[chromadb]" message makes the missing-dependency case debuggable without reading source.

merge_insert fallback log upgraded to warning — correct, it signals a performance regression, not a debug detail.

The searcher empty-palace / missing-palace / missing-chromadb distinction is a nice UX improvement.

One remaining observation (minor, non-blocking): the upsert fallback in the case where merge_insert isn't available still swallows partial failures by iterating per-record. Worth a follow-up to surface the partial failure count in the warning log. Not a blocker for merge.

Approving — the backend abstraction is solid and this is a significant capability addition.


[MemPalace-AGI integration — 215 tests, 710 KG entities]

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a significant architectural change and the scope is substantial (21 files, full backend abstraction). The LanceCollection/ChromaCollection unified API approach is the right design — a storage seam that lets the two backends be swapped without changing call sites.

A few notes from the design review:

LanceDB V4 Windows blocker: This is the same issue flagged in #529lancedb still imports fcntl on Windows in some builds. If this PR is meant to work cross-platform, that needs to be resolved before merge. The [extras] approach (LanceDB as optional) helps, but the import at the top of db.py may still fail on Windows even without the extra.

Migration path completeness: mempalace migrate for ChromaDB → LanceDB is the right surface. The key question is whether the migration handles embedding re-indexing — LanceDB uses its own vector format and ChromaDB's serialized embeddings are not directly transferable. If the migration re-embeds from stored text (slower but correct), that should be documented explicitly.

ChromaDB as [chroma] optional extra: Making chromadb optional is the right long-term direction, but it means every existing installation breaks unless they either install the extra or get a clean upgrade path. The pyproject.toml changeset should include ChromaDB in a compatibility extra (e.g. pip install mempalace[chroma]) and the migration docs should be prominent.

Breaking change scope: This touches palace.py, searcher.py, miner.py, mcp_server.py — every major path. Given the number of community PRs currently open against these files (#542, #544, #562, etc.), the merge conflict surface is enormous. Worth coordinating with the maintainer before the review cycle closes on those PRs.

Benchmark data: benchmarks/BENCHMARKS_V4.md + results_v4_comparison.json is good evidence. Would be useful to see the search latency comparison at 10K+ drawers where LanceDB's columnar scan should outperform ChromaDB's HNSW.

This is the right architectural direction — LanceDB's embedded model without a server process is a better fit for local-first tooling. The coordination and migration story are the main things to nail before this is merge-ready.


[MemPalace-AGI integration — 215 tests, 710 KG entities]

@web3guru888
Copy link
Copy Markdown

@dekoza — thanks for the fast fix commits. The specific issues I raised in my initial review are addressed in 7a15a461. The APPROVED review at 22:33Z stands for those.

The separate COMMENTED review at 22:47Z raised some broader design-level questions that are separate from the code correctness — flagging them here since they bear on merge readiness:

Windows / fcntl import: The [extras] approach helps, but if import lancedb at module top-level (without lazy import) still pulls in fcntl-dependent code on Windows, existing users who pip install mempalace will get an import error before they even try to use LanceDB. Lazy-import (import lancedb inside function body or TYPE_CHECKING block) would be the safe path. Happy to be wrong here if you have tested on Windows.

Migration re-embedding: Does mempalace migrate re-embed from stored text, or does it try to transfer raw ChromaDB embedding vectors? If the latter, a dimension mismatch between the old embedder and the new ONNX default will silently corrupt the index. Worth one sentence in UPGRADE.md either way.

Merge conflict surface: There are several open PRs (#542, #544, #562) that touch palace.py, miner.py, mcp_server.py. If those merge before #574, rebase becomes significant work. Flagging so you and the maintainer can coordinate timing.

None of these are blockers if you can address the Windows import and the migration note — the architecture is solid.


[MemPalace-AGI integration — 540+ discoveries, 5,251 KG triples]

@dekoza
Copy link
Copy Markdown
Author

dekoza commented Apr 11, 2026

@web3guru888 — thanks for the thorough follow-up. Addressing each point:

1. Windows / fcntl import — already covered.

All import lancedb calls are already lazy (inside function bodies): db.py:475, knowledge_graph.py:69, benchmarks/longmemeval_v4.py:63. Zero top-level imports. lancedb itself has no fcntl dependency — it is Rust/PyArrow under the hood.

The fcntl usage in sync_meta.py is also guarded — try/except ImportError with a _HAS_FCNTL/_HAS_MSVCRT dual-path pattern (lines 23–30). Windows users hit msvcrt.locking() instead. No change needed here.

2. Migration re-embedding — doc fix pushed.

You are right that this deserved explicit documentation. mempalace migrate transfers raw ChromaDB vectors as-is (no re-embedding). If the old palace used a different embedder than the current config, vectors and queries would use mismatched models. There is a dimension guard (_check_dimension() in db.py) that raises RuntimeError on open if dimensions differ, so a hard mismatch cannot corrupt silently — but same-dimension model swaps (e.g. two 384d models) would degrade quality without error.

Added a note to UPGRADE.md (c329713) clarifying that mempalace reindex should be run after migration if the embedder changed.

3. Merge conflict surface — mostly resolved.

Of the three PRs flagged: #544 is merged, #562 is closed, #542 is still open but its overlap with #574 is limited (security/test hardening, not storage layer). Should be manageable on rebase.

@web3guru888
Copy link
Copy Markdown

@dekoza — all three concerns addressed cleanly. The implementation details are better than I assumed:

Windows / fcntl — lazy imports throughout + the _HAS_FCNTL/_HAS_MSVCRT dual-path in sync_meta.py is the right approach. No change needed there.

Migration re-embedding — the _check_dimension() guard covering the hard case (different dimensions = RuntimeError on open) and the UPGRADE.md note covering the soft case (same-dimension model swap degrades silently) is exactly the right documentation split.

Merge conflict surface — good to hear #544 is merged and #562 is closed. #542 touching security/test hardening shouldn't conflict with the storage layer changes.

The architecture is solid and the open concerns are addressed. LGTM.


[MemPalace-AGI integration — 540+ discoveries, 5,251 KG triples]

@dekoza dekoza force-pushed the feat/lancedb-migration branch from 7a15a46 to 161b511 Compare April 11, 2026 19:38
@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:21
@dekoza dekoza force-pushed the feat/lancedb-migration branch from 161b511 to f82155f Compare April 12, 2026 00:13
@dekoza
Copy link
Copy Markdown
Author

dekoza commented Apr 12, 2026

Rebased onto develop, resolved all conflicts and aligned the architecture with the backend seam from PR #413.

Key changes in this update:

  • Aligned with backends/ architecture. Instead of a standalone db.py that replaced the backend seam, the LanceDB implementation now lives in backends/lance.py as LanceCollection(BaseCollection) + LanceBackend — extending the same pattern established by backends/chroma.py. The old db.py has been removed.
  • LanceDB is the default backend for new palaces. Existing ChromaDB palaces are auto-detected and continue to work. Migration path via mempalace migrate is unchanged.
  • Fixed create=False semantics — read-only operations like status no longer create palace directories as a side effect.
  • All 627 tests pass (2 skipped), including backend seam tests from Мempalace backend seam #413.

dekoza added 2 commits April 12, 2026 08:01
Database abstraction (Phase 1):
- New db.py with LanceCollection/ChromaCollection sharing identical API
- New embeddings.py with SentenceTransformerEmbedder (default)
- palace.py now delegates to db.open_collection() with auto-detection
- All consumers (searcher, layers, mcp_server, miner, palace_graph)
  updated to use palace.get_collection() instead of direct chromadb
- Added 'mempalace migrate' for ChromaDB -> LanceDB migration
- LanceDB is new default; ChromaDB moved to optional [chroma] extra
- Dependencies: lancedb>=0.14, sentence-transformers>=2.2.0

Pluggable vectorizers (Phase 2):
- OllamaEmbedder for GPU server usage via HTTP
- Model aliases: bge-small, bge-base, e5-base, nomic, ollama
- 'mempalace reindex' to re-embed with different model
- 'mempalace embedders' to list available models
- embedding_model tracked in every record's metadata
- Config via ~/.mempalace/config.json embedder/embedder_options

Tests: 552 passed, 0 failed (18 new embedding tests)
knowledge_graph.py rewritten with dual backend:
- LanceDB (default): kg_entities + kg_triples tables in palace dir
- SQLite (legacy): preserved for existing .sqlite3 paths
- All operations (add_entity, add_triple, invalidate, query_entity,
  query_relationship, timeline, stats, seed_from_entity_facts) work
  on both backends with identical API

MCP server updated:
- KnowledgeGraph now uses palace_path instead of separate sqlite file
- One data directory, one format, one sync unit

UPGRADE.md updated with Phase 5 how-to and migration notes.
PLAN.md Phase 5 marked done.

Tests: 588 passed (28 KG tests all pass on LanceDB backend)
dekoza added 9 commits April 12, 2026 08:04
New benchmark runner (benchmarks/longmemeval_v4.py):
- Runs LongMemEval against multiple backends in one invocation
- Modes: chroma-default, lance-default, lance-bge-small, lance-bge-base,
  lance-nomic, and custom embedder
- Produces side-by-side comparison table with R@5, R@10, NDCG, ms/query
- Per-type breakdown across question categories
- Presets: 'all' (3 modes), 'quick' (2), 'embedders' (4)

Results on full 500 questions:
  ChromaDB + MiniLM (v3.x):  R@5=0.966  R@10=0.982  NDCG@5=0.888  1165ms/q
  LanceDB  + MiniLM (v4.0):  R@5=0.966  R@10=0.982  NDCG@5=0.888   638ms/q
  LanceDB  + BGE-small:      R@5=0.962  R@10=0.978  NDCG@5=0.895  2624ms/q

Key findings:
- Zero retrieval regression: LanceDB matches ChromaDB exactly
- 1.8x faster queries (638ms vs 1165ms) with cosine distance
- BGE-small trades tiny R@5 drop for better NDCG (ranking quality)

Docs:
- benchmarks/BENCHMARKS_V4.md — full results + reproduction steps
- UPGRADE.md updated with benchmark section + how-to
- PLAN.md Phase 6 marked done — all 6 phases complete

Tests: 588 passed
Added v4.0 Backend Comparison section at top of BENCHMARKS.md:
- LanceDB+MiniLM: R@5=0.966, identical to ChromaDB, 1.8x faster (638ms vs 1165ms)
- LanceDB+BGE-small: R@5=0.962, higher NDCG (0.895 vs 0.888)
- Per-type breakdown showing BGE-small tradeoffs
- Reproduction commands for longmemeval_v4.py

Updated existing sections:
- Score progression table: added LanceDB and BGE-small rows
- Comparison table: added v4 LanceDB entry alongside v3 ChromaDB
- Tradeoffs table: added v4 column (sync, pluggable embedders, query speed)
- Results files table: added results_v4_comparison.json

Tests: 588 passed
- Add OnnxEmbedder as default (onnxruntime+tokenizers, no torch)
- Move sentence-transformers to [gpu] optional extra
- Strip all sync code from db.py, cli.py, config.py, UPGRADE.md
- Apply ruff lint fixes from f21ba0a to migration-branch files
- Remove unused chromadb import from conftest.py
- Regenerate uv.lock for migration-only dependencies

config.py:
  mempalace/config.py
…sync from benchmark

- Add _check_dimension() to LanceCollection.__init__: raises RuntimeError
  if existing table vector dimension doesn't match the active embedder
- Replace invalid port 99999 with RFC discard port 9 in Ollama test
- Remove sync_meta import and sync_identity from benchmark script
These fields are used by the sync layer for indexed queries instead of
full table scans. Adding them to FILTER_COLUMNS and SCHEMA_COLUMNS now
avoids a schema migration later.

Records without sync metadata get defaults (node_id='', seq=0).
- UPGRADE.md: fix default dep table (onnxruntime+tokenizers, not sentence-transformers)
- db.py: catch ImportError in _open_chroma with actionable message
- db.py: upgrade merge_insert fallback log to warning
- db.py: add $in/$nin support to _chroma_where_to_sql
- searcher.py: distinguish empty palace vs missing palace vs missing chromadb
- cli.py: same distinction in cmd_compress
- knowledge_graph.py: use count_rows(filter=) instead of len(get(limit=100k))
- knowledge_graph.py: remove limit=100 from _lance_timeline queries
- palace.py: get_collection reads MempalaceConfig().backend before auto-detect
…tream backend seam

- Move LanceCollection and LanceBackend from db.py to backends/lance.py
- LanceCollection now extends BaseCollection (upstream's ABC)
- Add detect_backend() to backends/__init__.py
- Update palace.py to use backends package for both Lance and Chroma
- LanceDB is the default backend for new palaces
- Remove db.py (all logic migrated to backends/lance.py)
- Update cli.py and test imports accordingly
Required by upstream's tool_update_drawer (PR MemPalace#667). LanceCollection
implements update as get+upsert (re-embeds on content change).
ChromaCollection delegates to chromadb's native update.
@dekoza dekoza force-pushed the feat/lancedb-migration branch from f82155f to d3085f9 Compare April 12, 2026 06:10
@dekoza
Copy link
Copy Markdown
Author

dekoza commented Apr 12, 2026

Rebased onto current develop (4621f85), picking up 5 new upstream commits:

Conflict resolutions:

  • layers.py — merged upstream's new build_where_filter import with our _get_palace_collection.
  • mcp_server.py — replaced ChromaDB-specific client/inode caching with backend-agnostic collection cache; removed import chromadb.
  • knowledge_graph.py — integrated upstream's threading.Lock around all SQLite operations into our dual-backend (LanceDB + SQLite) structure.
  • searcher.py — merged upstream's build_where_filter helper with our empty-palace validation checks.

New in this update:

All 679 tests pass (2 skipped).

igorls added a commit that referenced this pull request Apr 12, 2026
Formalizes the BaseCollection/BaseBackend contract introduced as a seam
in #413 into an interchangeability spec that third-party backends can
build to. Driven by six in-flight backend PRs (#574, #643, #665, #697,
#700, #381) each implementing the interface differently.

Key decisions captured: entry-point distribution, typed QueryResult/
GetResult replacing Chroma dict shape, daemon-first multi-palace model
via PalaceRef, required where-clause subset (incl. $contains),
mandatory embedder injection with model-identity validation, capability
tokens, shared pytest conformance suite, and a backend-neutral
migrate/verify CLI.
igorls added a commit that referenced this pull request Apr 12, 2026
Incorporates review feedback from skuznetsov (Postgres, #665) and
dekoza (Lance, #574) on issue #737:

- §1.5: split 'accepts embeddings=' (signature compliance) from
  'persists embeddings as-is' (correctness). Adds
  supports_embeddings_passthrough capability; the former is universal,
  the latter is required to label a migration lossless.
- §1.5: model identity check becomes a three-state machine
  (known_match / known_mismatch / unknown) so legacy palaces without
  recorded identity don't hard-fail on upgrade.
- §1.4: makes explicit that supports_contains_fast is the ONLY
  performance floor the spec promises; without it callers MUST assume
  O(n). $contains is a correctness requirement, not a performance one.
- §3.3: clarifies auto-detect is an upgrade-compat path only, never
  the selection mechanism for new palaces.
- §8.2: migrate CLI refuses to run against a target lacking
  supports_embeddings_passthrough unless --accept-re-embed is passed;
  migration record now captures lossless status and model identities.
@igorls igorls added area/cli CLI commands area/install pip/uv/pipx/plugin install and packaging area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request storage labels Apr 14, 2026
igorls added a commit that referenced this pull request Apr 18, 2026
…nd registry (RFC 001 §10)

Advances RFC 001 §10 cleanup so backend-author PRs (#574 LanceDB, #665 Postgres,
#700 Qdrant, #697 hosted, #643 PalaceStore, #381 Qdrant) have a stable target
to align against.

Scope (this PR):

- Typed QueryResult / GetResult dataclasses replace Chroma's dict shape at
  the BaseCollection boundary (§1.3). A transitional _DictCompatMixin keeps
  existing callers working while the attribute-access migration proceeds.
- BaseCollection is now kwargs-only across add/upsert/query/get/delete/update
  with ABC defaults for estimated_count/close/health and a non-atomic default
  update() (§1.1–1.2).
- PalaceRef replaces raw path strings at the backend boundary (§2.2).
- BaseBackend ABC with get_collection/close_palace/close/health/detect (§2.3).
- mempalace.backends entry-point group + in-tree registry with
  resolve_backend_for_palace priority order matching §3.2–3.3.
- ChromaCollection normalizes chroma returns into typed results; unknown
  where-clause operators raise UnsupportedFilterError (no silent drop, §1.4).
- ChromaBackend absorbs the inode/mtime client-cache freshness check
  previously duplicated in mcp_server._get_client() (§10 + PR #757).
- searcher.py migrated to typed-attribute access as the reference call
  site; remaining callers land in a follow-up.
- pyproject: chroma registered via [project.entry-points."mempalace.backends"].

Out of scope (explicit follow-ups):

- Full caller migration off the dict-compat shim across palace.py,
  mcp_server.py, miner.py, convo_miner.py, dedup.py, repair.py, exporter.py,
  palace_graph.py, cli.py, closet_llm.py.
- Embedder injection + three-state EmbedderIdentityMismatchError check (§1.5).
- maintenance_state() / run_maintenance() benchmark hooks (§7.3).
- AbstractBackendContractSuite full coverage (§7.1–7.2).
- mempalace migrate / mempalace verify CLI rewrites through BaseCollection (§8).

Tests: 970 passed (up from 967 on develop); new coverage for typed results,
empty-result outer-shape preservation, \$regex rejection, registry lookup,
priority resolver, and PalaceRef-kwarg ChromaBackend.get_collection.

Refs: #743 (RFC 001), #989 (RFC 002 tracking issue).
jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
Scanned all 233 open upstream PRs today against our open PRs and
fork-ahead / planned-work items. Findings merged into README:

- P2 (decay) and P3 Tier-0 (LLM rerank): both covered by MemPalace#1032
  (@zackchiutw, MERGEABLE, 2026-04-19 — Weibull decay + 4-stage
  rerank pipeline). Older simpler version at MemPalace#337. Dropped as
  fork work; watching MemPalace#1032.
- P7 (alternative storage): formally out of scope. RFC 001 MemPalace#743
  (@igorls) defines the plugin contract; four backend PRs already
  in flight (MemPalace#700, MemPalace#381 Qdrant; MemPalace#574, MemPalace#575 LanceDB). Fork consumes,
  does not rebuild.
- P0 (multi-label tags): still fork/upstream candidate. MemPalace#1033
  (@zackchiutw) ships adjacent privacy-tag + progressive disclosure
  but not the full multi-label scheme.
- Merged MemPalace#1023 section acknowledges complementary MemPalace#976 (felipetruman)
  which adds broader mine_global_lock() + HNSW num_threads pin.

Gives future-us a map so we don't re-file MemPalace#1036-style duplicates.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/install pip/uv/pipx/plugin install and packaging area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants