Skip to content

V4.0 proposal - Ditch ChromaDB in favor of LanceDB, add sync capabilities#529

Closed
dekoza wants to merge 8 commits intoMemPalace:mainfrom
dekoza:feat/lance-and-share
Closed

V4.0 proposal - Ditch ChromaDB in favor of LanceDB, add sync capabilities#529
dekoza wants to merge 8 commits intoMemPalace:mainfrom
dekoza:feat/lance-and-share

Conversation

@dekoza
Copy link
Copy Markdown

@dekoza dekoza commented Apr 10, 2026

What does this PR do?

(PI + Claude Opus 4.6)-assisted refactor of the codebase:

  1. removal of inferior ChromaDB in favor of LanceDB
  2. add sync capabilities to sync your laptop's palace with home's palace
  3. Provides migration path from ChromaDB to LanceDB.
  4. 1.8x faster than original

How to test

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

dekoza added 8 commits April 10, 2026 12:38
Database abstraction (Phase 1):
- New db.py with LanceCollection/ChromaCollection sharing identical API
- New embeddings.py with SentenceTransformerEmbedder (default)
- palace.py now delegates to db.open_collection() with auto-detection
- All consumers (searcher, layers, mcp_server, miner, palace_graph)
  updated to use palace.get_collection() instead of direct chromadb
- Added 'mempalace migrate' for ChromaDB -> LanceDB migration
- LanceDB is new default; ChromaDB moved to optional [chroma] extra
- Dependencies: lancedb>=0.14, sentence-transformers>=2.2.0

Pluggable vectorizers (Phase 2):
- OllamaEmbedder for GPU server usage via HTTP
- Model aliases: bge-small, bge-base, e5-base, nomic, ollama
- 'mempalace reindex' to re-embed with different model
- 'mempalace embedders' to list available models
- embedding_model tracked in every record's metadata
- Config via ~/.mempalace/config.json embedder/embedder_options

Tests: 552 passed, 0 failed (18 new embedding tests)
New sync_meta.py:
- NodeIdentity: persistent 12-char node_id (uuid4, written to ~/.mempalace/node_id)
- Atomic sequence counter with fcntl file locking (~/.mempalace/seq)
- inject_sync_meta() adds node_id, seq, updated_at to every write batch
- Thread-safe: 4 threads × 50 ops = 200 unique seqs (tested)

Integration into db.py:
- LanceCollection.upsert/add now call _inject_sync() before writing
- Every record stored carries node_id, seq, updated_at in metadata_json
- open_collection() accepts optional sync_identity for testing

Config:
- MempalaceConfig.node_id property for easy access

Tests: 568 passed (16 new sync_meta tests including thread-safety
and end-to-end LanceDB integration)
Sync engine (mempalace/sync.py):
- SyncEngine: get_changes_since(remote_vv), apply_changes(changeset)
- VersionVector: persistent JSON-backed node_id→seq mapping
- ChangeSet/SyncRecord: serialisable record batches
- Conflict resolution: last-writer-wins by updated_at, node_id tiebreak
- apply_changes uses _raw=True upsert to preserve original sync metadata

HTTP server (mempalace/sync_server.py):
- FastAPI app: GET /health, GET /sync/status, POST /sync/push, POST /sync/pull
- Lazy engine initialisation on first request
- 'mempalace serve --host 0.0.0.0 --port 7433'

HTTP client (mempalace/sync_client.py):
- SyncClient: is_reachable(), push(), pull(), sync(engine)
- Full bidirectional sync in one call: push local → pull remote
- 'mempalace sync --server URL [--auto --interval 300]'

CLI (mempalace/cli.py):
- 'mempalace serve' — start sync server
- 'mempalace sync --server URL' — one-shot sync
- 'mempalace sync --server URL --auto' — repeat every N seconds

Dependencies:
- fastapi + uvicorn + httpx as optional [server] extra in pyproject.toml

Tests: 588 passed (20 new sync tests):
- VersionVector CRUD + persistence
- ChangeSet serialisation round-trip
- SyncEngine: empty, new records, VV-filtered, conflict both directions
- Two-node bidirectional simulation (direct engine, no HTTP)
- Second sync is no-op (convergence)
- FastAPI TestClient: health, status, push+pull end-to-end
New UPGRADE.md covering:
- Breaking changes summary (ChromaDB→LanceDB)
- LanceDB backend: migration how-to, backend selection
- Pluggable embedders: model list, reindex workflow, Ollama config
- Multi-device sync: architecture, server setup, client usage,
  offline operation, conflict resolution
- New CLI commands reference
- New files inventory
- Full configuration reference (config.json, env vars, generated files)

PLAN.md updated with documentation requirement for all future phases.
knowledge_graph.py rewritten with dual backend:
- LanceDB (default): kg_entities + kg_triples tables in palace dir
- SQLite (legacy): preserved for existing .sqlite3 paths
- All operations (add_entity, add_triple, invalidate, query_entity,
  query_relationship, timeline, stats, seed_from_entity_facts) work
  on both backends with identical API

MCP server updated:
- KnowledgeGraph now uses palace_path instead of separate sqlite file
- One data directory, one format, one sync unit

UPGRADE.md updated with Phase 5 how-to and migration notes.
PLAN.md Phase 5 marked done.

Tests: 588 passed (28 KG tests all pass on LanceDB backend)
New benchmark runner (benchmarks/longmemeval_v4.py):
- Runs LongMemEval against multiple backends in one invocation
- Modes: chroma-default, lance-default, lance-bge-small, lance-bge-base,
  lance-nomic, and custom embedder
- Produces side-by-side comparison table with R@5, R@10, NDCG, ms/query
- Per-type breakdown across question categories
- Presets: 'all' (3 modes), 'quick' (2), 'embedders' (4)

Results on full 500 questions:
  ChromaDB + MiniLM (v3.x):  R@5=0.966  R@10=0.982  NDCG@5=0.888  1165ms/q
  LanceDB  + MiniLM (v4.0):  R@5=0.966  R@10=0.982  NDCG@5=0.888   638ms/q
  LanceDB  + BGE-small:      R@5=0.962  R@10=0.978  NDCG@5=0.895  2624ms/q

Key findings:
- Zero retrieval regression: LanceDB matches ChromaDB exactly
- 1.8x faster queries (638ms vs 1165ms) with cosine distance
- BGE-small trades tiny R@5 drop for better NDCG (ranking quality)

Docs:
- benchmarks/BENCHMARKS_V4.md — full results + reproduction steps
- UPGRADE.md updated with benchmark section + how-to
- PLAN.md Phase 6 marked done — all 6 phases complete

Tests: 588 passed
Added v4.0 Backend Comparison section at top of BENCHMARKS.md:
- LanceDB+MiniLM: R@5=0.966, identical to ChromaDB, 1.8x faster (638ms vs 1165ms)
- LanceDB+BGE-small: R@5=0.962, higher NDCG (0.895 vs 0.888)
- Per-type breakdown showing BGE-small tradeoffs
- Reproduction commands for longmemeval_v4.py

Updated existing sections:
- Score progression table: added LanceDB and BGE-small rows
- Comparison table: added v4 LanceDB entry alongside v3 ChromaDB
- Tradeoffs table: added v4 column (sync, pluggable embedders, query speed)
- Results files table: added results_v4_comparison.json

Tests: 588 passed
Ruff fixes:
- benchmarks/longmemeval_v4.py: removed unused var, replaced lambda
  assignments with dict dispatch + def
- tests/test_palace_graph.py: moved import to top of file

Test fix:
- test_conflict_last_writer_wins / test_conflict_local_wins_when_newer:
  used _raw=True to bypass sync injection and fixed timestamps to
  2020/2099 so conflict resolution is deterministic regardless of
  wall clock time

ruff check: 0 errors, pytest: 588 passed
Copilot AI review requested due to automatic review settings April 10, 2026 12:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR proposes MemPalace v4.0 by switching the default vector store from ChromaDB to LanceDB, introducing pluggable embedding backends, and adding multi-device sync (engine + server + client), along with updated docs, tests, and benchmark artifacts.

Changes:

  • Add a backend-agnostic DB abstraction (db.py) with LanceDB as default and ChromaDB as an optional legacy backend + migration command.
  • Introduce pluggable embedders (sentence-transformers, Ollama) and CLI support (reindex, embedders).
  • Add sync metadata + sync engine/server/client + CLI commands (serve, sync) plus expanded tests and benchmark docs/results.

Reviewed changes

Copilot reviewed 37 out of 40 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
UPGRADE.md v4.0 upgrade guide covering LanceDB, embedders, sync, KG changes, and CLI commands.
PLAN.md Implementation plan/status tracker for phases 1–6.
pyproject.toml Switch default deps to lancedb + sentence-transformers; add optional extras for chroma/server; add pre-commit to dev group.
mempalace/db.py New DB abstraction + LanceDB implementation + backend detection + factory.
mempalace/embeddings.py New pluggable embedder implementations and factory/cache.
mempalace/sync_meta.py Node identity + sequence counter + metadata injection for sync.
mempalace/sync.py Sync types, version vectors, sync engine, and conflict resolution.
mempalace/sync_server.py FastAPI sync server exposing health/status/push/pull endpoints.
mempalace/sync_client.py urllib-based sync client implementing push/pull flow.
mempalace/palace.py Centralized backend-agnostic get_collection() entry point.
mempalace/searcher.py Switch search to backend-agnostic collection access + updated error behavior.
mempalace/miner.py Replace direct Chroma usage with backend-agnostic collection access.
mempalace/mcp_server.py Remove Chroma client caching; use palace collection abstraction; KG defaults to palace-backed path.
mempalace/layers.py Replace direct Chroma usage with palace collection abstraction and empty-palace checks.
mempalace/palace_graph.py Replace direct Chroma usage with palace collection abstraction.
mempalace/knowledge_graph.py Move KG default storage to LanceDB tables with SQLite fallback.
mempalace/cli.py Add migrate/reindex/embedders/serve/sync commands; refactor repair/compress to use collection abstraction.
mempalace/init.py Update dependency logger silencing (incl. sentence-transformers).
mempalace/normalize.py Minor formatting change in size error message.
mempalace/split_mega_files.py Minor formatting change in size skip message.
benchmarks/longmemeval_v4.py New benchmark runner using db abstraction + embedders for v4 comparison.
benchmarks/results_v4_comparison.json Committed benchmark results for v4 comparisons.
benchmarks/BENCHMARKS.md Document v4 comparison results and update benchmark tables.
benchmarks/BENCHMARKS_V4.md New focused v4 benchmark report and reproduction instructions.
tests/conftest.py Update fixtures to use palace collection abstraction + upsert semantics + KG palace path.
tests/test_sync.py New sync engine + server integration tests.
tests/test_sync_meta.py New tests for node identity/seq locking and sync metadata injection.
tests/test_embeddings.py New tests for embedder implementations and metadata tracking.
tests/test_searcher.py Update searcher tests for new collection access + error semantics.
tests/test_palace_graph.py Remove import-time chromadb patching; import palace_graph directly.
tests/test_miner.py Use palace collection helper instead of Chroma client.
tests/test_convo_miner.py Use palace collection helper instead of Chroma client.
tests/test_mcp_server.py Update test helper to use palace collection helper.
tests/test_layers.py Update layer tests to patch _get_palace_collection instead of Chroma client.
tests/test_knowledge_graph.py Adjust WAL-mode test to explicitly use SQLite backend path.
tests/test_knowledge_graph_extra.py Switch KG fixture to palace-backed path.
tests/benchmarks/test_layers_bench.py Minor assert formatting change.
Comments suppressed due to low confidence (1)

mempalace/searcher.py:35

  • Empty palaces are treated as “No palace found”: col.count()==0 triggers the same error message/exception path as an invalid path. This is user-visible and misleading (an existing but empty palace isn’t missing). Consider differentiating these cases and raising a clearer SearchError/message such as “Palace is empty; run mempalace mine …”.
    try:
        col = get_collection(palace_path)
        # Verify palace has data
        if col.count() == 0:
            raise Exception("Empty palace")
    except Exception:
        print(f"\n  No palace found at {palace_path}")
        print("  Run: mempalace init <dir> then mempalace mine <dir>")
        raise SearchError(f"No palace found at {palace_path}")


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/sync.py
Comment on lines +171 to +200
def get_changes_since(self, remote_vv: dict[str, int]) -> ChangeSet:
"""Get all local records that the remote hasn't seen.

Scans records written by THIS node whose seq > remote_vv[our_node_id].
"""
our_node = self._identity.node_id
remote_knows = remote_vv.get(our_node, 0)

# Scan all records and filter by node_id + seq
# (LanceDB doesn't have complex metadata queries inside metadata_json,
# so we scan and filter in Python.)
all_records = self._col.get(limit=100_000, include=["documents", "metadatas"])

changeset = ChangeSet(source_node=our_node)

for id_, doc, meta in zip(
all_records["ids"], all_records["documents"], all_records["metadatas"]
):
rec_node = meta.get("node_id", "")
rec_seq = meta.get("seq", 0)
if isinstance(rec_seq, str):
rec_seq = int(rec_seq)

if rec_node == our_node and rec_seq > remote_knows:
changeset.records.append(
SyncRecord(
id=id_,
document=doc,
metadata=meta,
)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_changes_since() only includes records written by our_node (it filters rec_node == our_node). On the server this prevents clients from pulling changes that originated from other clients (hub-and-spoke replication won’t converge). Consider selecting any record whose (node_id, seq) is newer than what remote_vv reports for that node_id (i.e., per-record comparison against remote_vv.get(rec_node, 0)), not just the local node’s records.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/sync.py
Comment on lines +264 to +276
def _remote_wins(self, remote_meta: dict, local_meta: dict) -> bool:
"""Return True if the remote record should overwrite the local one.

Comparison: updated_at descending, then node_id descending as tiebreak.
"""
r_time = remote_meta.get("updated_at", "")
l_time = local_meta.get("updated_at", "")

if r_time > l_time:
return True
if r_time < l_time:
return False

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflict resolution compares updated_at as plain strings. Lexicographic comparison can be wrong when timestamps use different timezone offsets or formats (e.g., Z vs +00:00). Parse updated_at into timezone-aware datetime (normalize to UTC) before comparing, and define a clear fallback for missing/invalid timestamps.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/sync.py
Comment on lines +222 to +258
to_upsert_docs = []
to_upsert_ids = []
to_upsert_metas = []
to_upsert_embs = []

for rec in changeset.records:
local_meta = existing_map.get(rec.id)

if local_meta is None:
# New record — accept
to_upsert_docs.append(rec.document)
to_upsert_ids.append(rec.id)
to_upsert_metas.append(rec.metadata)
to_upsert_embs.append(rec.embedding)
result.accepted += 1
else:
# Conflict — last-writer-wins
if self._remote_wins(rec.metadata, local_meta):
to_upsert_docs.append(rec.document)
to_upsert_ids.append(rec.id)
to_upsert_metas.append(rec.metadata)
to_upsert_embs.append(rec.embedding)
result.accepted += 1
else:
result.rejected_conflicts += 1

if to_upsert_ids:
# If any records lack embeddings, let the collection re-embed
has_embs = all(e is not None for e in to_upsert_embs)
self._col.upsert(
documents=to_upsert_docs,
ids=to_upsert_ids,
metadatas=to_upsert_metas,
embeddings=to_upsert_embs if has_embs else None,
_raw=True, # preserve original sync metadata
)

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_changes() treats embeddings as all-or-nothing: if any incoming record has embedding=None, it passes embeddings=None to upsert, which triggers re-embedding for all records in the batch (including those that provided embeddings). This can waste compute and can overwrite remote-provided vectors. Consider splitting the upsert into two groups (with embeddings vs without) or computing embeddings only for the missing ones and passing a complete embeddings list.

Suggested change
to_upsert_docs = []
to_upsert_ids = []
to_upsert_metas = []
to_upsert_embs = []
for rec in changeset.records:
local_meta = existing_map.get(rec.id)
if local_meta is None:
# New record — accept
to_upsert_docs.append(rec.document)
to_upsert_ids.append(rec.id)
to_upsert_metas.append(rec.metadata)
to_upsert_embs.append(rec.embedding)
result.accepted += 1
else:
# Conflict — last-writer-wins
if self._remote_wins(rec.metadata, local_meta):
to_upsert_docs.append(rec.document)
to_upsert_ids.append(rec.id)
to_upsert_metas.append(rec.metadata)
to_upsert_embs.append(rec.embedding)
result.accepted += 1
else:
result.rejected_conflicts += 1
if to_upsert_ids:
# If any records lack embeddings, let the collection re-embed
has_embs = all(e is not None for e in to_upsert_embs)
self._col.upsert(
documents=to_upsert_docs,
ids=to_upsert_ids,
metadatas=to_upsert_metas,
embeddings=to_upsert_embs if has_embs else None,
_raw=True, # preserve original sync metadata
)
to_upsert_docs_with_embs = []
to_upsert_ids_with_embs = []
to_upsert_metas_with_embs = []
to_upsert_embs = []
to_upsert_docs_without_embs = []
to_upsert_ids_without_embs = []
to_upsert_metas_without_embs = []
for rec in changeset.records:
local_meta = existing_map.get(rec.id)
if local_meta is None:
# New record — accept
if rec.embedding is None:
to_upsert_docs_without_embs.append(rec.document)
to_upsert_ids_without_embs.append(rec.id)
to_upsert_metas_without_embs.append(rec.metadata)
else:
to_upsert_docs_with_embs.append(rec.document)
to_upsert_ids_with_embs.append(rec.id)
to_upsert_metas_with_embs.append(rec.metadata)
to_upsert_embs.append(rec.embedding)
result.accepted += 1
else:
# Conflict — last-writer-wins
if self._remote_wins(rec.metadata, local_meta):
if rec.embedding is None:
to_upsert_docs_without_embs.append(rec.document)
to_upsert_ids_without_embs.append(rec.id)
to_upsert_metas_without_embs.append(rec.metadata)
else:
to_upsert_docs_with_embs.append(rec.document)
to_upsert_ids_with_embs.append(rec.id)
to_upsert_metas_with_embs.append(rec.metadata)
to_upsert_embs.append(rec.embedding)
result.accepted += 1
else:
result.rejected_conflicts += 1
if to_upsert_ids_with_embs:
self._col.upsert(
documents=to_upsert_docs_with_embs,
ids=to_upsert_ids_with_embs,
metadatas=to_upsert_metas_with_embs,
embeddings=to_upsert_embs,
_raw=True, # preserve original sync metadata
)
if to_upsert_ids_without_embs:
self._col.upsert(
documents=to_upsert_docs_without_embs,
ids=to_upsert_ids_without_embs,
metadatas=to_upsert_metas_without_embs,
embeddings=None,
_raw=True, # preserve original sync metadata
)

Copilot uses AI. Check for mistakes.
Comment thread mempalace/sync_meta.py
"""

import os
import fcntl
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_meta.py imports and uses fcntl.flock, which is unavailable on Windows. This will raise at import time and breaks the library/tests on that platform. Consider a cross-platform locking approach (e.g., conditional import with a Windows implementation such as msvcrt.locking, or a small dependency like filelock/portalocker), or fall back to an in-process lock when file locking isn’t supported.

Suggested change
import fcntl
try:
import fcntl
except ImportError:
try:
import msvcrt
class _CompatFcntl:
LOCK_EX = 1
LOCK_UN = 2
@staticmethod
def flock(fileobj, operation):
fd = fileobj.fileno()
fileobj.seek(0)
if operation == _CompatFcntl.LOCK_UN:
msvcrt.locking(fd, msvcrt.LK_UNLCK, 1)
else:
msvcrt.locking(fd, msvcrt.LK_LOCK, 1)
fcntl = _CompatFcntl()
except ImportError:
class _CompatFcntl:
LOCK_EX = 1
LOCK_UN = 2
@staticmethod
def flock(fileobj, operation):
return None
fcntl = _CompatFcntl()

Copilot uses AI. Check for mistakes.
Comment thread mempalace/palace.py
Comment on lines +39 to +52
def get_collection(
palace_path: str, collection_name: str = "mempalace_drawers", backend: str = None, embedder=None
):
"""Get or create the palace collection.

This is the main entry point for all palace database access.
Auto-detects the backend (LanceDB or ChromaDB) based on existing data.
"""
return open_collection(
palace_path=palace_path,
collection_name=collection_name,
backend=backend,
embedder=embedder,
)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new MempalaceConfig.backend / MEMPALACE_BACKEND setting is not currently applied by the default get_collection() path (most callers pass backend=None, which triggers auto-detection in open_collection). This diverges from the upgrade docs that suggest MEMPALACE_BACKEND can force a backend. Consider defaulting backend to MempalaceConfig().backend when backend is None, or having open_collection() check MEMPALACE_BACKEND before auto-detecting.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_embeddings.py
Comment on lines +51 to +57
def test_st_embedder_embed():
"""Integration test — actually loads the default model."""
e = SentenceTransformerEmbedder()
result = e.embed(["hello world", "test sentence"])
assert len(result) == 2
assert len(result[0]) == 384 # MiniLM dimension
assert all(isinstance(x, float) for x in result[0])
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_st_embedder_embed loads a real sentence-transformers model and runs embeddings. This makes the unit test suite dependent on large downloads and network/cache state, and can be slow/flaky in CI. Consider mocking sentence_transformers.SentenceTransformer (returning a small deterministic embedding) or marking this as an optional/slow integration test that’s skipped by default.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_embeddings.py

def test_ollama_embedder_connection_error():
"""OllamaEmbedder raises ConnectionError when server unreachable."""
e = OllamaEmbedder(base_url="http://localhost:99999", timeout=1.0)
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses an invalid port (99999), which can raise a ValueError (“port out of range”) before any network call, and won’t be caught by the URLError handler that wraps it into ConnectionError. Use a valid but likely-closed port (e.g. 65534 or 9999) so the code path under test is exercised reliably.

Suggested change
e = OllamaEmbedder(base_url="http://localhost:99999", timeout=1.0)
e = OllamaEmbedder(base_url="http://localhost:65534", timeout=1.0)

Copilot uses AI. Check for mistakes.
@web3guru888
Copy link
Copy Markdown

This is a significant piece of work — a phased, well-structured rewrite with benchmarks included. The 6-phase commit history makes the review story clear, which I appreciate. Sharing observations from running a production MemPalace integration with heavy KG use:

What's good:

The abstraction layer in db.py is the right call architecturally. Wrapping both backends behind an identical Collection API means existing callers don't need to know which backend is active. The _chroma_where_to_sql() translator is pragmatic and covers the operators that actually matter in practice ($and/$or, comparison operators).

The embeddings protocol design in embeddings.py is clean. Using a Protocol with @runtime_checkable means third-party embedders work without subclassing, which is important for extensibility.

The sync architecture (node_id + monotonic seq + version vectors) is the standard approach for this kind of eventually-consistent replication and should work well for the hub-and-spoke case.

Real concerns:

  1. fcntl in sync_meta.py — Copilot already flagged this and it's real. Windows is a significant platform for MemPalace users (see Bug: MCP Tools diary_write and kg_add Return Internal Error -32000 #526, Bug: MCP server fails with surrogate error on Windows when writing CJK content #503, fix: comprehensive Windows and CJK support for MCP server #512). fcntl.flock raises AttributeError at import time on Windows. A cross-platform lock (e.g. filelock library, or a threading.Lock + explicit O_EXCL retry loop) is needed before this can merge.

  2. get_changes_since() only returns records from our_node — The Copilot review caught this. In a hub-and-spoke setup where multiple clients sync through a server, the server holds records from many nodes. If node A pushes to the hub, then node B pulls from the hub, node B needs the hub's full changeset filtered by node B's version vector — not just records the hub itself wrote. The filter should be seq > vv.get(rec_node) across all nodes, not just rec_node == our_node.

  3. Backward compatibility surface — Making LanceDB the default and ChromaDB optional [chroma] is a breaking change for all existing users. The UPGRADE.md covers the migration path, but the plugin marketplace version (which bmaltais and others use) won't auto-migrate. Users who installed via pip install mempalace without the [chroma] extra after this lands will have a broken install even if they have existing ChromaDB palaces. Worth thinking through whether to keep ChromaDB as the default until a migration wizard is in the CLI or whether to add an auto-detection fallback that runs migrate on first startup when it finds a chroma/ directory.

  4. Embedding dimension mismatch on re-open — If a user switches embedders (e.g. MiniLM → BGE-small), the existing LanceDB table schema has the old vector dimension baked in. The reindex command addresses this for deliberate switches, but there's no guard against accidentally opening with the wrong embedder. Adding a dimension check at open_collection() time and failing clearly (RuntimeError: expected 384d but collection has 768d) would prevent subtle corruption.

  5. The sync test port 99999 issue — Copilot flagged this; confirm the test fixture is using a valid port range.

Our integration perspective:

We run 710 KG entities and 1,014 triples through MemPalace. The KG-in-LanceDB approach (Phase 5) is interesting — our current setup uses the SQLite KG path with a custom bridge layer on top. The main question for our use case is whether LanceDB KG tables round-trip the entity_facts JSON reliably. The metadata_json blob approach in db.py should handle it if the serialization is consistent, but worth a round-trip test with a non-trivial entity fact dict.

The benchmark numbers are honest and the methodology is sound. 1.8x query speedup with zero retrieval regression is a real result worth having.

The Windows portability issue is the blocker for us. Once that's resolved and the get_changes_since scope bug is fixed, this is a serious and mergeable proposal.


MemPalace-AGI dashboard

@web3guru888 web3guru888 mentioned this pull request Apr 10, 2026
3 tasks
@nanoscopic
Copy link
Copy Markdown

@dekoza Could you split this PR up into different pieces of functionality? As it stands right now it is mixing a bunch of feature additions together.

@web3guru888
Copy link
Copy Markdown

@nanoscopic — the split request is reasonable for the features that stand alone, but the storage abstraction (#529 core) and the current-storage fixes (#520, #523, etc.) aren't necessarily in competition here.

@dekoza — this is worth thinking through. A few features in this PR could land independently: the db.py abstraction layer could ship even with only ChromaDB behind it, and the sync capabilities could be a follow-up. The LanceDB backend then becomes an opt-in addition rather than a required migration.

That said, sequencing matters: landing the ChromaDB chunking fix (#520) before refactoring storage isn't hackily adding code — it's keeping the current userbase unbroken while the abstraction layer matures. Both can be true.

@dekoza
Copy link
Copy Markdown
Author

dekoza commented Apr 10, 2026

@nanoscopic I'll split it into two separate PRs - one with change from Chroma to Lance and the second with synchronization stack on first since sync requires Lance's features.

@web3guru888
Copy link
Copy Markdown

@dekoza That split makes sense — Lance backend first, then sync on top. That ordering also reflects the dependency: sync needs Lance's features, so it can't reasonably be reviewed independently anyway. Looking forward to the first PR.

@dekoza dekoza closed this Apr 10, 2026
@dekoza
Copy link
Copy Markdown
Author

dekoza commented Apr 10, 2026

Superseded by #574 and #575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants