Skip to content

feat: add multilingual support via embedding-based semantic classification#488

Open
EndeavorYen wants to merge 1 commit intoMemPalace:developfrom
EndeavorYen:feat/multilingual-support
Open

feat: add multilingual support via embedding-based semantic classification#488
EndeavorYen wants to merge 1 commit intoMemPalace:developfrom
EndeavorYen:feat/multilingual-support

Conversation

@EndeavorYen
Copy link
Copy Markdown

@EndeavorYen EndeavorYen commented Apr 10, 2026

Summary

Replaces per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. One new optional dependency (sentence-transformers), fully backward-compatible.

What changed

Room classification → embedding-based

Replaced TOPIC_KEYWORDS string matching with cosine similarity against room description embeddings. Any language works — Chinese, French, Korean, etc. — with zero keyword lists.

Memory extraction → embedding-based

Replaced per-language regex markers with cosine similarity against memory type description embeddings (decision, preference, milestone, problem, emotional). Each paragraph classified independently.

Entity detection → Chinese name support

Added 百家姓 (Baijiaxing) surname tables for Chinese (simplified + traditional) + Chinese verb/dialogue patterns. Stays rule-based because NER is fundamentally pattern matching.

Spellcheck → CJK-safe

Auto-skips non-English text via Unicode detection. 14 regression tests verify CJK text is never corrupted.

Embedding provider → pluggable

Centralized get_embedding_function() in config.py — all ChromaDB consumers route through it. Supports:

Knowledge graph → auto-extraction

Added kg_extraction.py with hybrid NER+LLM triple extraction (spaCy + optional Claude Haiku + co-occurrence analysis). 7 new MCP tools for KG operations (query, add, invalidate, timeline, traverse, find_path, extract).

Dialect → CJK support

Added Chinese stop words and CJK bigram extraction for topic keywords.

Benchmark results — 173 test cases, 8 languages

Dimension zh-Hans zh-Hant en fr es de ja ko
Language Detection
Room Classification
Memory Extraction
Entity Detection
Search Quality

Overall: 173/173 (100%), Grade A

Test coverage

652 tests passing, 0 failures. Core multilingual module coverage:

Module Coverage What it tests
language_detect.py 100% Unicode-based language detection
searcher.py 100% Memory search
entity_detector.py 97% English + Chinese entity detection
spellcheck.py 97% CJK-safe spell correction
hooks_cli.py 98% Session hooks + wake-up context
general_extractor.py 89% Embedding-based memory extraction
knowledge_graph.py 93% Temporal KG with SQLite
config.py 82% Config + Ollama + embedding caching

Multilingual tests skip gracefully in CI when sentence-transformers is not installed — zero impact on existing CI pipeline.

Installation

# English-only (existing behavior, no change)
pip install mempalace

# Multilingual support (~120MB model, CPU-only)
pip install mempalace[multilingual]

# Ollama embedding (no extra dependency — uses ChromaDB built-in OllamaEmbeddingFunction)
MEMPALACE_EMBEDDING_MODEL=ollama:qwen3-embedding-8b mempalace mine <dir>

Design decisions

  1. Embedding over regex: Cosine similarity against description embeddings generalizes to any language with zero per-language code
  2. Optional dependency: sentence-transformers>=2.0 only needed for multilingual. English regex fallback preserved — existing behavior unchanged
  3. Rule-based NER: Entity detection stays regex/pattern-based — NER is fundamentally pattern matching, not classification
  4. Ollama pluggable: get_embedding_function() supports ollama:<model> prefix so Qwen3-8B (Domain-scoped collections + local embedding model = better retrieval at scale #273) is a config change, not code change
  5. Zero external API calls: Everything runs locally, consistent with MemPalace's design philosophy

Coordination with #273

Per maintainer request: the embedding function is pluggable enough that Qwen3-Embedding-8B via Ollama drops in as a config change:

{"embedding_model": "ollama:qwen3-embedding-8b", "embedding_endpoint": "http://localhost:11434"}

Or via env var: MEMPALACE_EMBEDDING_MODEL=ollama:qwen3-embedding-8b

Test plan

  • 173 multilingual benchmark tests (8 languages, 100%)
  • 652 unit tests passing, 0 failures
  • 14 CJK spellcheck corruption regression tests
  • Ollama config parsing + caching tests
  • Lint clean (ruff check + ruff format)
  • CI-compatible (multilingual tests skip without sentence-transformers)
  • Reviewer: verify pip install mempalace[multilingual] works from clean env
  • Reviewer: test with non-Latin content in a real palace

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really solid PR — this is a meaningful upgrade from the old regex-heuristic approach, and the design choices are mostly right. A few things worth discussing:

KG extraction ()

The hybrid NER + co-occurrence fallback is the right architecture — pure spaCy NER misses a lot of implicit relationships (especially in scientific/technical domains), so having the co-occurrence layer as a baseline is good. The Claude Haiku path is nice when available.

My concern: the PR says Claude Haiku is "optional", but what does the quality curve actually look like in a fully local setup? In my experience running triple extraction without an LLM, spaCy alone tends to do fine on explicit subject-verb-object structures but drops off sharply on:

  • Nominalized predicates ("the influence of X on Y")
  • Multi-hop implicit relationships across sentences
  • Domain-specific compound entities

Has the 173 multilingual benchmark been run with only the spaCy+co-occurrence path (no Haiku)? Would be useful to see precision/recall split by extraction path so users know what they're signing up for in the purely-local case. Not a blocker, but worth documenting.

Global embedding cache in multi-process MCP scenarios

cached at module level is fine for single-process use, but MCP servers can be forked (e.g. gunicorn-style or when the host spawns workers). After fork, the child inherits the cache reference but the underlying model state (CUDA contexts, tokenizer threads) may not be fork-safe depending on the backend.

Concrete case: if someone runs this with and the HTTP session gets copied into children, you'll get connection errors that are hard to diagnose. Worth either:

  1. Documenting the "single-process only" assumption explicitly, or
  2. Keying the cache on so each child re-initializes cleanly

Optional dependency design

The graceful CI skip is exactly right — I've seen too many multilingual PRs that just Collecting sentence-transformers
Downloading sentence_transformers-5.4.0-py3-none-any.whl (570 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 570.8/570.8 KB 6.6 MB/s eta 0:00:00
Collecting transformers<6.0.0,>=4.41.0
Downloading transformers-5.5.3-py3-none-any.whl (10.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 24.9 MB/s eta 0:00:00
Requirement already satisfied: typing_extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.15.0)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.2.6)
Collecting tqdm>=4.0.0
Downloading tqdm-4.67.3-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.4/78.4 KB 30.0 MB/s eta 0:00:00
Collecting scikit-learn>=0.22.0
Downloading scikit_learn-1.7.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 62.2 MB/s eta 0:00:00
Collecting huggingface-hub>=0.23.0
Downloading huggingface_hub-1.10.1-py3-none-any.whl (642 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 642.6/642.6 KB 89.2 MB/s eta 0:00:00
Collecting scipy>=1.0.0
Downloading scipy-1.15.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37.7/37.7 MB 46.6 MB/s eta 0:00:00
Collecting torch>=1.11.0
Downloading torch-2.11.0-cp310-cp310-manylinux_2_28_x86_64.whl (530.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 530.6/530.6 MB 5.2 MB/s eta 0:00:00
Collecting httpx<1,>=0.23.0
Downloading httpx-0.28.1-py3-none-any.whl (73 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.5/73.5 KB 28.6 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->sentence-transformers) (26.0)
Collecting typer
Downloading typer-0.24.1-py3-none-any.whl (56 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.1/56.1 KB 18.4 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.4.3
Downloading hf_xet-1.4.3-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 76.1 MB/s eta 0:00:00
Collecting filelock>=3.10.0
Downloading filelock-3.25.2-py3-none-any.whl (26 kB)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->sentence-transformers) (6.0.3)
Collecting fsspec>=2023.5.0
Downloading fsspec-2026.3.0-py3-none-any.whl (202 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 202.6/202.6 KB 59.3 MB/s eta 0:00:00
Collecting threadpoolctl>=3.1.0
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
Downloading joblib-1.5.3-py3-none-any.whl (309 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 309.1/309.1 KB 54.3 MB/s eta 0:00:00
Collecting cuda-bindings<14,>=13.0.3
Downloading cuda_bindings-13.2.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (6.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 78.9 MB/s eta 0:00:00
Collecting nvidia-nvshmem-cu13==3.4.5
Downloading nvidia_nvshmem_cu13-3.4.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (60.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 MB 34.6 MB/s eta 0:00:00
Collecting nvidia-cusparselt-cu13==0.8.0
Downloading nvidia_cusparselt_cu13-0.8.0-py3-none-manylinux2014_x86_64.whl (169.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 169.9/169.9 MB 16.1 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu13==9.19.0.56
Downloading nvidia_cudnn_cu13-9.19.0.56-py3-none-manylinux_2_27_x86_64.whl (366.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 366.1/366.1 MB 6.6 MB/s eta 0:00:00
Collecting cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2
Downloading cuda_toolkit-13.0.2-py2.py3-none-any.whl (2.4 kB)
Collecting sympy>=1.13.3
Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 86.3 MB/s eta 0:00:00
Collecting nvidia-nccl-cu13==2.28.9
Downloading nvidia_nccl_cu13-2.28.9-py3-none-manylinux_2_18_x86_64.whl (196.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.5/196.5 MB 12.3 MB/s eta 0:00:00
Requirement already satisfied: setuptools<82 in /usr/lib/python3/dist-packages (from torch>=1.11.0->sentence-transformers) (59.6.0)
Collecting networkx>=2.5.1
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 96.7 MB/s eta 0:00:00
Collecting triton==3.6.0
Downloading triton-3.6.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (188.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.1/188.1 MB 11.9 MB/s eta 0:00:00
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.6)
Collecting nvidia-curand==10.4.0.35.*
Downloading nvidia_curand-10.4.0.35-py3-none-manylinux_2_27_x86_64.whl (59.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.5/59.5 MB 31.2 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime==13.0.96.*
Downloading nvidia_cuda_runtime-13.0.96-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (2.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 84.0 MB/s eta 0:00:00
Collecting nvidia-cufft==12.0.0.61.*
Downloading nvidia_cufft-12.0.0.61-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (214.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 214.1/214.1 MB 8.4 MB/s eta 0:00:00
Collecting nvidia-cufile==1.15.1.6.*
Downloading nvidia_cufile-1.15.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 92.3 MB/s eta 0:00:00
Collecting nvidia-nvjitlink==13.0.88.*
Downloading nvidia_nvjitlink-13.0.88-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (40.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.7/40.7 MB 46.3 MB/s eta 0:00:00
Collecting nvidia-cusolver==12.0.4.66.*
Downloading nvidia_cusolver-12.0.4.66-py3-none-manylinux_2_27_x86_64.whl (200.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.9/200.9 MB 11.7 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti==13.0.85.*
Downloading nvidia_cuda_cupti-13.0.85-py3-none-manylinux_2_25_x86_64.whl (10.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.7/10.7 MB 86.3 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc==13.0.88.*
Downloading nvidia_cuda_nvrtc-13.0.88-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (90.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.2/90.2 MB 24.7 MB/s eta 0:00:00
Collecting nvidia-nvtx==13.0.85.*
Downloading nvidia_nvtx-13.0.85-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (148 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.0/148.0 KB 49.7 MB/s eta 0:00:00
Collecting nvidia-cublas==13.1.0.3.*
Downloading nvidia_cublas-13.1.0.3-py3-none-manylinux_2_27_x86_64.whl (423.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 423.1/423.1 MB 5.5 MB/s eta 0:00:00
Collecting nvidia-cusparse==12.6.3.3.*
Downloading nvidia_cusparse-12.6.3.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (145.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 145.9/145.9 MB 15.2 MB/s eta 0:00:00
Collecting regex>=2025.10.22
Downloading regex-2026.4.4-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (793 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 793.6/793.6 KB 89.0 MB/s eta 0:00:00
Collecting safetensors>=0.4.3
Downloading safetensors-0.7.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (507 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 507.2/507.2 KB 78.7 MB/s eta 0:00:00
Collecting tokenizers<=0.23.0,>=0.22.0
Downloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 75.7 MB/s eta 0:00:00
Collecting cuda-pathfinder~=1.1
Downloading cuda_pathfinder-1.5.2-py3-none-any.whl (49 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.0/50.0 KB 21.4 MB/s eta 0:00:00
Collecting anyio
Downloading anyio-4.13.0-py3-none-any.whl (114 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.4/114.4 KB 40.0 MB/s eta 0:00:00
Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.23.0->sentence-transformers) (3.11)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.23.0->sentence-transformers) (2026.2.25)
Collecting httpcore==1.*
Downloading httpcore-1.0.9-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.8/78.8 KB 29.9 MB/s eta 0:00:00
Collecting h11>=0.16
Downloading h11-0.16.0-py3-none-any.whl (37 kB)
Collecting mpmath<1.4,>=1.1.0
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 77.1 MB/s eta 0:00:00
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (3.0.3)
Collecting annotated-doc>=0.0.2
Downloading annotated_doc-0.0.4-py3-none-any.whl (5.3 kB)
Collecting rich>=12.3.0
Downloading rich-14.3.3-py3-none-any.whl (310 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 310.5/310.5 KB 69.6 MB/s eta 0:00:00
Collecting shellingham>=1.3.0
Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.10/dist-packages (from typer->huggingface-hub>=0.23.0->sentence-transformers) (8.3.1)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=12.3.0->typer->huggingface-hub>=0.23.0->sentence-transformers) (2.19.2)
Collecting markdown-it-py>=2.2.0
Downloading markdown_it_py-4.0.0-py3-none-any.whl (87 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.3/87.3 KB 36.5 MB/s eta 0:00:00
Requirement already satisfied: exceptiongroup>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from anyio->httpx<1,>=0.23.0->huggingface-hub>=0.23.0->sentence-transformers) (1.3.1)
Collecting mdurl~=0.1
Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: nvidia-cusparselt-cu13, mpmath, cuda-toolkit, triton, tqdm, threadpoolctl, sympy, shellingham, scipy, safetensors, regex, nvidia-nvtx, nvidia-nvshmem-cu13, nvidia-nvjitlink, nvidia-nccl-cu13, nvidia-curand, nvidia-cufile, nvidia-cuda-runtime, nvidia-cuda-nvrtc, nvidia-cuda-cupti, nvidia-cublas, networkx, mdurl, joblib, hf-xet, h11, fsspec, filelock, cuda-pathfinder, annotated-doc, scikit-learn, nvidia-cusparse, nvidia-cufft, nvidia-cudnn-cu13, markdown-it-py, httpcore, cuda-bindings, anyio, rich, nvidia-cusolver, httpx, typer, torch, huggingface-hub, tokenizers, transformers, sentence-transformers unconditionally and break CI for users who don't have a GPU. The way this is structured keeps the core install lightweight.

Coverage gap in (82%)

Config is the highest-traffic surface in any integration — mishandled config is almost always the root cause when users report weird behavior. What's the missing 18%? Guessing it might be the Ollama model-name parsing path (the syntax) or the fallback when isn't installed. Those paths tend to get exercised only in production, which is exactly when you want them tested.


Overall this is a significant quality addition and the backward-compat story is clean. The above are discussion points, not blockers — marking COMMENT rather than requesting changes. Nice work on the benchmark suite especially.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really solid PR — this is a meaningful upgrade from the old regex-heuristic approach, and the design choices are mostly right. A few things worth discussing:

KG extraction (kg_extraction.py)

The hybrid NER + co-occurrence fallback is the right architecture — pure spaCy NER misses a lot of implicit relationships (especially in scientific/technical domains), so having the co-occurrence layer as a baseline is good. The Claude Haiku path is nice when available.

My concern: the PR says Claude Haiku is "optional", but what does the quality curve actually look like in a fully local setup? In my experience running triple extraction without an LLM, spaCy alone tends to do fine on explicit subject-verb-object structures but drops off sharply on:

  • Nominalized predicates (e.g. "the influence of X on Y")
  • Multi-hop implicit relationships across sentences
  • Domain-specific compound entities

Has the 173 multilingual benchmark been run with only the spaCy+co-occurrence path (no Haiku)? Would be useful to see precision/recall split by extraction path so users know what they're signing up for in the purely-local case. Not a blocker, but worth documenting.

Global embedding cache in multi-process MCP scenarios

get_embedding_function() cached at module level is fine for single-process use, but MCP servers can be forked (e.g. gunicorn-style or when the host spawns workers). After fork, the child inherits the cache reference but the underlying model state (CUDA contexts, tokenizer threads) may not be fork-safe depending on the backend.

Concrete case: if someone runs this with the ollama backend and the HTTP session gets copied into forked children, you can get connection errors that are hard to diagnose. Worth either:

  1. Documenting the "single-process only" assumption explicitly, or
  2. Keying the cache on os.getpid() so each child re-initializes cleanly

Optional dependency design

The graceful CI skip is exactly right — I've seen too many multilingual PRs that just unconditionally install sentence-transformers and break CI for users without GPU access. The way this is structured keeps the core install lightweight. Good call.

Coverage gap in config.py (82%)

Config is the highest-traffic surface in any integration — mishandled config is almost always the root cause when users report weird behavior. What's the missing 18%? Guessing it might be the Ollama model-name parsing path or the fallback behavior when sentence-transformers isn't installed. Those paths tend to only get exercised in production, which is exactly when you want them covered.


Overall this is a significant quality addition and the backward-compat story is clean. The above are discussion points, not blockers — marking COMMENT rather than requesting changes. Nice work on the benchmark suite especially.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a substantial and well-thought-out PR — excited to see it finally land. The architectural direction here is exactly right, and the scope of the changes reflects genuine engineering effort.

What's excellent:

The shift from TOPIC_KEYWORDS string matching to embedding-based cosine similarity for room classification is the correct long-term solution. String heuristics break the moment you introduce a second language — or even domain-specific vocabulary in your primary language. We've run into this ourselves working with scientific content (astrophysics, epidemiology, etc.) where existing keyword lists were completely useless for domain classification. Embedding similarity against room description embeddings is semantically principled in a way that keyword lists never can be.

The abstraction in config.py is the piece I'm most excited about architecturally. Previously every ChromaDB consumer was implicitly hardcoded to whatever the default was — this brings all of that under a single configurable, cacheable entrypoint. The global model cache (loaded once, not per-call) is also essential; loading sentence-transformers on every call would add 2-3s of latency per operation which would break interactive workflows. These two things together are a real improvement to the codebase independent of the multilingual story.

The CJK-safe spellcheck auto-skip via Unicode range detection is a nicely practical fix. Corrupted CJK from English spellcheckers was a real issue — glad it's solved simply.

One concern I think needs a warning before merge:

If a user deploys with the default embedding model (), builds up their memory palace, and then switches to via env var, the existing ChromaDB vectors are in an incompatible embedding space. Querying with new-model embeddings against old-model vectors gives semantically meaningless cosine scores — you get garbage retrieval without any error or warning. This is a well-known footgun in vector DB systems.

I don't think the PR needs to solve full migration, but it should at minimum:

  • Persist the embedding model name in a config file or ChromaDB metadata on first use
  • On startup, detect if the configured model differs from what existing vectors were built with
  • If mismatch, emit a clear warning (or fail loudly with instructions to run or similar)

Without this, users who experiment with different embedding models will silently corrupt their retrieval quality and have no idea why.

On the 7 new KG MCP tools:

Can you clarify how these relate to the existing , , etc. tools? If these are additive (covering operations that didn't exist before), that's great. If any of them overlap with existing tools with slightly different names or signatures, it will create confusion for MCP clients that enumerate available tools — they'll see two tools that look similar and won't know which to use. A short note in the PR description on what's net-new vs what (if anything) supersedes existing tools would help reviewers and future readers.

On dependency complexity:

The optional extras approach (Collecting mempalace[multilingual]
Downloading mempalace-3.1.0-py3-none-any.whl (110 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.7/110.7 KB 2.7 MB/s eta 0:00:00
Collecting chromadb<0.7,>=0.5.0
Downloading chromadb-0.6.3-py3-none-any.whl (611 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 611.1/611.1 KB 45.7 MB/s eta 0:00:00
Requirement already satisfied: pyyaml<7,>=6.0 in /usr/local/lib/python3.10/dist-packages (from mempalace[multilingual]) (6.0.3)
Collecting posthog>=2.4.0
Downloading posthog-7.10.3-py3-none-any.whl (217 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 217.0/217.0 KB 62.8 MB/s eta 0:00:00
Collecting pypika>=0.48.9
Downloading pypika-0.51.1-py2.py3-none-any.whl (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 KB 28.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.22.5 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.2.6)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0
Downloading opentelemetry_exporter_otlp_proto_grpc-1.41.0-py3-none-any.whl (20 kB)
Requirement already satisfied: typer>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.24.1)
Collecting grpcio>=1.58.0
Downloading grpcio-1.80.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (6.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.8/6.8 MB 71.9 MB/s eta 0:00:00
Collecting uvicorn[standard]>=0.18.3
Downloading uvicorn-0.44.0-py3-none-any.whl (69 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.4/69.4 KB 31.2 MB/s eta 0:00:00
Collecting opentelemetry-api>=1.2.0
Downloading opentelemetry_api-1.41.0-py3-none-any.whl (69 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.0/69.0 KB 30.5 MB/s eta 0:00:00
Requirement already satisfied: typing_extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.15.0)
Collecting bcrypt>=4.0.1
Downloading bcrypt-5.0.0-cp39-abi3-manylinux_2_34_x86_64.whl (278 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 278.2/278.2 KB 70.2 MB/s eta 0:00:00
Collecting build>=1.0.3
Downloading build-1.4.2-py3-none-any.whl (24 kB)
Collecting chroma-hnswlib==0.7.6
Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 87.7 MB/s eta 0:00:00
Collecting opentelemetry-sdk>=1.2.0
Downloading opentelemetry_sdk-1.41.0-py3-none-any.whl (180 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 180.2/180.2 KB 59.7 MB/s eta 0:00:00
Collecting overrides>=7.3.1
Downloading overrides-7.7.0-py3-none-any.whl (17 kB)
Collecting mmh3>=4.0.1
Downloading mmh3-5.2.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (101 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.2/101.2 KB 41.7 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.65.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.67.3)
Collecting kubernetes>=28.1.0
Downloading kubernetes-35.0.0-py2.py3-none-any.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 97.0 MB/s eta 0:00:00
Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.28.1)
Collecting fastapi>=0.95.2
Downloading fastapi-0.135.3-py3-none-any.whl (117 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 117.7/117.7 KB 44.6 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-fastapi>=0.41b0
Downloading opentelemetry_instrumentation_fastapi-0.62b0-py3-none-any.whl (13 kB)
Requirement already satisfied: tokenizers>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.22.2)
Collecting orjson>=3.9.12
Downloading orjson-3.11.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (133 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.9/133.9 KB 2.6 MB/s eta 0:00:00
Collecting tenacity>=8.2.3
Downloading tenacity-9.1.4-py3-none-any.whl (28 kB)
Collecting importlib-resources
Downloading importlib_resources-6.5.2-py3-none-any.whl (37 kB)
Collecting pydantic>=1.9
Downloading pydantic-2.12.5-py3-none-any.whl (463 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 463.6/463.6 KB 82.4 MB/s eta 0:00:00
Collecting onnxruntime>=1.14.1
Downloading onnxruntime-1.23.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (17.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.4/17.4 MB 70.0 MB/s eta 0:00:00
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (14.3.3)
Collecting pyproject_hooks
Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Requirement already satisfied: packaging>=24.0 in /usr/local/lib/python3.10/dist-packages (from build>=1.0.3->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (26.0)
Requirement already satisfied: tomli>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from build>=1.0.3->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.4.0)
Collecting starlette>=0.46.0
Downloading starlette-1.0.0-py3-none-any.whl (72 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.7/72.7 KB 33.1 MB/s eta 0:00:00
Collecting typing-inspection>=0.4.2
Downloading typing_inspection-0.4.2-py3-none-any.whl (14 kB)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.10/dist-packages (from fastapi>=0.95.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.0.4)
Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (3.11)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2026.2.25)
Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.13.0)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.->httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.16.0)
Collecting websocket-client!=0.40.0,!=0.41.
,!=0.42.*,>=0.32.0
Downloading websocket_client-1.9.0-py3-none-any.whl (82 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.6/82.6 KB 38.4 MB/s eta 0:00:00
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.17.0)
Requirement already satisfied: urllib3!=2.6.0,>=1.24.2 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.6.3)
Requirement already satisfied: python-dateutil>=2.5.3 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.9.0.post0)
Collecting requests-oauthlib
Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB)
Collecting durationpy>=0.7
Downloading durationpy-0.10-py3-none-any.whl (3.9 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.32.5)
Collecting coloredlogs
Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 KB 13.9 MB/s eta 0:00:00
Collecting flatbuffers
Downloading flatbuffers-25.12.19-py2.py3-none-any.whl (26 kB)
Collecting protobuf
Downloading protobuf-7.34.1-cp310-abi3-manylinux2014_x86_64.whl (324 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.3/324.3 KB 78.0 MB/s eta 0:00:00
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime>=1.14.1->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.14.0)
Collecting importlib-metadata<8.8.0,>=6.0
Downloading importlib_metadata-8.7.1-py3-none-any.whl (27 kB)
Collecting opentelemetry-proto==1.41.0
Downloading opentelemetry_proto-1.41.0-py3-none-any.whl (72 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.1/72.1 KB 30.9 MB/s eta 0:00:00
Collecting googleapis-common-protos~=1.57
Downloading googleapis_common_protos-1.74.0-py3-none-any.whl (300 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 300.7/300.7 KB 74.3 MB/s eta 0:00:00
Collecting opentelemetry-exporter-otlp-proto-common==1.41.0
Downloading opentelemetry_exporter_otlp_proto_common-1.41.0-py3-none-any.whl (18 kB)
Collecting protobuf
Downloading protobuf-6.33.6-cp39-abi3-manylinux2014_x86_64.whl (323 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 323.4/323.4 KB 50.2 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-asgi==0.62b0
Downloading opentelemetry_instrumentation_asgi-0.62b0-py3-none-any.whl (17 kB)
Collecting opentelemetry-util-http==0.62b0
Downloading opentelemetry_util_http-0.62b0-py3-none-any.whl (9.3 kB)
Collecting opentelemetry-instrumentation==0.62b0
Downloading opentelemetry_instrumentation-0.62b0-py3-none-any.whl (34 kB)
Collecting opentelemetry-semantic-conventions==0.62b0
Downloading opentelemetry_semantic_conventions-0.62b0-py3-none-any.whl (231 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 231.6/231.6 KB 72.7 MB/s eta 0:00:00
Collecting wrapt<3.0.0,>=1.0.0
Downloading wrapt-2.1.2-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (113 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.6/113.6 KB 29.5 MB/s eta 0:00:00
Collecting asgiref~=3.0
Downloading asgiref-3.11.1-py3-none-any.whl (24 kB)
Collecting backoff>=1.10.0
Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting distro>=1.5.0
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Collecting pydantic-core==2.41.5
Downloading pydantic_core-2.41.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 90.4 MB/s eta 0:00:00
Collecting annotated-types>=0.6.0
Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.19.2)
Requirement already satisfied: huggingface-hub<2.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.10.1)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.9.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.5.4)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.10/dist-packages (from typer>=0.9.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (8.3.1)
Collecting watchfiles>=0.20
Downloading watchfiles-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (455 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 455.6/455.6 KB 78.0 MB/s eta 0:00:00
Collecting httptools>=0.6.3
Downloading httptools-0.7.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (440 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 440.9/440.9 KB 83.6 MB/s eta 0:00:00
Collecting uvloop>=0.15.1
Downloading uvloop-0.22.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 MB 33.5 MB/s eta 0:00:00
Collecting websockets>=10.4
Downloading websockets-16.0-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (183 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 183.8/183.8 KB 60.7 MB/s eta 0:00:00
Collecting python-dotenv>=0.13
Downloading python_dotenv-1.2.2-py3-none-any.whl (22 kB)
Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.4.3)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2026.3.0)
Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (3.25.2)
Collecting zipp>=3.20
Downloading zipp-3.23.0-py3-none-any.whl (10 kB)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.1.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (3.4.6)
Requirement already satisfied: exceptiongroup>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.3.1)
Collecting humanfriendly>=9.1
Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 KB 37.7 MB/s eta 0:00:00
Collecting oauthlib>=3.0.0
Downloading oauthlib-3.3.1-py3-none-any.whl (160 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 160.1/160.1 KB 56.5 MB/s eta 0:00:00
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime>=1.14.1->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.3.0)
Installing collected packages: flatbuffers, durationpy, zipp, wrapt, websockets, websocket-client, uvloop, uvicorn, typing-inspection, tenacity, python-dotenv, pyproject_hooks, pypika, pydantic-core, protobuf, overrides, orjson, opentelemetry-util-http, oauthlib, mmh3, importlib-resources, humanfriendly, httptools, grpcio, distro, chroma-hnswlib, bcrypt, backoff, asgiref, annotated-types, requests-oauthlib, pydantic, posthog, opentelemetry-proto, importlib-metadata, googleapis-common-protos, coloredlogs, build, watchfiles, starlette, opentelemetry-exporter-otlp-proto-common, opentelemetry-api, onnxruntime, kubernetes, opentelemetry-semantic-conventions, fastapi, opentelemetry-sdk, opentelemetry-instrumentation, opentelemetry-instrumentation-asgi, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, chromadb, mempalace
Successfully installed annotated-types-0.7.0 asgiref-3.11.1 backoff-2.2.1 bcrypt-5.0.0 build-1.4.2 chroma-hnswlib-0.7.6 chromadb-0.6.3 coloredlogs-15.0.1 distro-1.9.0 durationpy-0.10 fastapi-0.135.3 flatbuffers-25.12.19 googleapis-common-protos-1.74.0 grpcio-1.80.0 httptools-0.7.1 humanfriendly-10.0 importlib-metadata-8.7.1 importlib-resources-6.5.2 kubernetes-35.0.0 mempalace-3.1.0 mmh3-5.2.1 oauthlib-3.3.1 onnxruntime-1.23.2 opentelemetry-api-1.41.0 opentelemetry-exporter-otlp-proto-common-1.41.0 opentelemetry-exporter-otlp-proto-grpc-1.41.0 opentelemetry-instrumentation-0.62b0 opentelemetry-instrumentation-asgi-0.62b0 opentelemetry-instrumentation-fastapi-0.62b0 opentelemetry-proto-1.41.0 opentelemetry-sdk-1.41.0 opentelemetry-semantic-conventions-0.62b0 opentelemetry-util-http-0.62b0 orjson-3.11.8 overrides-7.7.0 posthog-7.10.3 protobuf-6.33.6 pydantic-2.12.5 pydantic-core-2.41.5 pypika-0.51.1 pyproject_hooks-1.2.0 python-dotenv-1.2.2 requests-oauthlib-2.0.0 starlette-1.0.0 tenacity-9.1.4 typing-inspection-0.4.2 uvicorn-0.44.0 uvloop-0.22.1 watchfiles-1.1.1 websocket-client-1.9.0 websockets-16.0 wrapt-2.1.2 zipp-3.23.0) handles the sentence-transformers + spaCy + optional LLM API key chain well — users who don't need multilingual don't pull in the weight. One thing to document clearly: what's the behavior when is not installed but multilingual content is encountered? The fallback to English regex is mentioned, but making that degradation explicit in the README (with a clear note that CJK classification quality will be poor without the extra) would help users understand what they're getting.

173/173 on the benchmark suite is strong. Would be curious how those benchmarks handle edge cases like code-switching (mixed language within a single entry) or very short entries where there's not enough signal for the embedding to land confidently — but that can be a follow-on issue rather than a blocker here.

Overall: this is a meaningful contribution. The embedding model migration concern is the one I'd want resolved before merging — everything else is either already well handled or addressable in follow-on issues. Nice work @EndeavorYen.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a substantial and well-thought-out PR — excited to see it finally land. The architectural direction here is exactly right, and the scope of the changes reflects genuine engineering effort.

What's excellent:

The shift from TOPIC_KEYWORDS string matching to embedding-based cosine similarity for room classification is the correct long-term solution. String heuristics break the moment you introduce a second language — or even domain-specific vocabulary in your primary language. We've run into this ourselves working with scientific content (astrophysics, epidemiology, etc.) where existing keyword lists were completely useless for domain classification. Embedding similarity against room description embeddings is semantically principled in a way that keyword lists never can be.

The get_embedding_function() abstraction in config.py is the piece I'm most excited about architecturally. Previously every ChromaDB consumer was implicitly hardcoded to whatever the default was — this brings all of that under a single configurable, cacheable entrypoint. The global model cache (loaded once, not per-call) is also essential; loading sentence-transformers on every call would add 2-3s of latency per operation which would break interactive workflows. These two things together are a real improvement to the codebase independent of the multilingual story.

The CJK-safe spellcheck auto-skip via Unicode range detection is a nicely practical fix. Corrupted CJK from English spellcheckers was a real issue — glad it's solved simply.

One concern I think needs a warning before merge:

If a user deploys with the default embedding model, builds up their memory palace, and then switches to a different model via env var or config, the existing ChromaDB vectors are in an incompatible embedding space. Querying with new-model embeddings against old-model vectors gives semantically meaningless cosine scores — you get garbage retrieval without any error or warning. This is a well-known footgun in vector DB systems.

I don't think the PR needs to solve full migration, but it should at minimum:

  • Persist the embedding model name in a config file or ChromaDB metadata on first use
  • On startup, detect if the configured model differs from what existing vectors were built with
  • If mismatch, emit a clear warning (or fail loudly with instructions to run a rebuild command)

Without this, users who experiment with different embedding models will silently corrupt their retrieval quality and have no idea why.

On the 7 new KG MCP tools:

Can you clarify how these relate to the existing mempalace_kg_query, mempalace_kg_add, etc. tools? If these are additive (covering operations that didn't exist before), that's great. If any of them overlap with existing tools under slightly different names or signatures, it will create confusion for MCP clients that enumerate available tools — they'll see two tools that look similar and won't know which to use. A short note in the PR description on what's net-new vs what (if anything) supersedes existing tools would help reviewers and future readers.

On dependency complexity:

The optional extras approach (pip install mempalace[multilingual]) handles the sentence-transformers + spaCy + optional LLM API key chain well — users who don't need multilingual don't pull in the weight. One thing worth documenting clearly: what's the behavior when the multilingual extra is not installed but multilingual content is encountered? The fallback to English regex is mentioned, but making that degradation path explicit in the README would help users understand what they're getting without the optional install.

173/173 on the benchmark suite is strong. Would be curious how those benchmarks handle edge cases like code-switching (mixed language within a single entry) or very short entries where there's not enough signal for the embedding to land confidently — but that can be a follow-on issue rather than a blocker here.

Overall: this is a meaningful contribution. The embedding model migration concern is the one I'd want resolved before merging — everything else is either already well handled or addressable in follow-on issues. Nice work @EndeavorYen.

@EndeavorYen
Copy link
Copy Markdown
Author

Thanks for the thorough review, really helpful feedback. Addressing each point:

1. Embedding model mismatch detection

This is already implemented. I should have called it out in the PR description, that's on me. check_embedding_model_mismatch() in config.py (L275-294):

  • On collection create: persists model name in ChromaDB metadata
  • On collection open: compares stored model vs current config
  • On mismatch: logs warning with specific model names and re-mine instructions

Called in both palace.py:get_collection() and mcp_server.py:_get_collection().

Open question: currently it's a log warning. Would a louder signal (stderr or hard failure) be more appropriate?

2. KG MCP tools

All 7 are net-new. Upstream had no KG-related MCP tools:

  • mempalace_kg_query: query entity facts
  • mempalace_kg_add: add triples manually
  • mempalace_kg_invalidate: mark facts as no longer valid
  • mempalace_kg_timeline: chronological fact timeline
  • mempalace_kg_traverse: multi-hop graph traversal
  • mempalace_kg_find_path: shortest path between entities
  • mempalace_kg_extract: auto-extract triples from text

No overlap with existing tool names or signatures.

To be transparent: the KG extraction is a complementary addition that got bundled into this PR rather than being strictly multilingual scope.

Will keep PR scope tighter going forward.

3. Fallback behavior documentation

Good call. The degradation path:

Component With [multilingual] Without (fallback)
Room classification Embedding similarity (any language) English keyword matching
Memory extraction Embedding classification (any language) English regex markers
Entity detection English + Chinese patterns English patterns only
Spellcheck CJK auto-skip + English correction English correction only

Will add to README. Happy to do it in this PR or as a follow-up.

4. KG extraction quality without Haiku

Honest answer: the 173 benchmark covers room classification, memory extraction, entity detection, and search, but not KG triple extraction.

In the purely-local path (spaCy + co-occurrence, no LLM):

  • Explicit SVO structures: works well via dependency parsing
  • Co-occurrence: catches entities in the same paragraph
  • Nominalized predicates and implicit multi-hop: drops off significantly

A KG-specific benchmark with precision/recall by extraction path would be valuable.

Will open as a follow-on issue.

5. Multi-process fork safety

Valid concern. I'll add os.getpid() to the cache key in a follow-up push. Negligible cost and eliminates the issue for forked workers.

6. Config coverage (82%)

Mostly Ollama error paths and OS-specific permission branches.

Will add in a follow-up.

@web3guru888
Copy link
Copy Markdown

@EndeavorYen — thanks for the detailed follow-up. This is much clearer now.

Embedding model mismatch detection — Good, glad check_embedding_model_mismatch() is already there and wired into both palace.py and mcp_server.py. On the question of warning vs hard failure: I'd vote for a loud stderr warning (not a hard failure) on first mismatch, with a clear re-mine instruction. Hard failure is too aggressive for workflows that do a rolling model upgrade and want to query while re-mining completes. But a log-only warning is too quiet — users will miss it and wonder why results regressed.

KG MCP tools — Appreciate the transparency on scope. The 7 tools being net-new with no naming conflicts is the key thing. mempalace_kg_traverse and mempalace_kg_find_path are the ones we'd use immediately — multi-hop traversal is where KG earns its overhead. The scope expansion into KG is substantively useful even if it's not strictly multilingual.

Fallback degradation table — This is exactly what reviewers need. Adding it to the README (or even the PR description) would be a good addition, either now or as a follow-up.

KG extraction quality — The honest answer is the right answer. Co-occurrence works for co-located entities but drops off for implicit multi-hop relationships, and that's the hard case. A KG-specific benchmark issue makes sense — if you open it, we can contribute some test triples from our astrophysics and cryptography wings where the relationship structure is well-defined.

Fork safety (os.getpid() cache key) — Good, low-cost fix, worth landing before merge rather than as a follow-up since forked workers are a real deployment pattern with gunicorn and uvicorn.

Overall this is in good shape. The fork safety change is the one I'd land before merge; the rest can be follow-ups.

@EndeavorYen EndeavorYen force-pushed the feat/multilingual-support branch from 8218d2b to 0e6340a Compare April 10, 2026 05:35
…ation

Replace per-language keyword/regex heuristics with embedding-based semantic
classification, enabling MemPalace to work with 50+ languages using zero
per-language configuration.

Changes:
- Room classification: cosine similarity against room description embeddings
- Memory extraction: embedding-based classification (5 types, any language)
- Entity detection: add Chinese name patterns (百家姓 surnames)
- Spellcheck: auto-skip CJK text via Unicode detection
- Embedding provider: pluggable via get_embedding_function() with caching
  - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers)
  - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b)
  - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json
- Knowledge graph: temporal triples, multi-hop traversal, auto-extraction
- Dialect: CJK bigram extraction for topic keywords
- All ChromaDB consumers route through centralized embedding function

New optional dependency: sentence-transformers>=2.0
Install: pip install mempalace[multilingual]
Without it: English regex fallback (existing behavior unchanged)

Benchmark: 173/173 (100%) across 8 languages
(zh-Hans, zh-Hant, en, fr, es, de, ja, ko)

652 tests passing, 0 failures. CI-compatible (multilingual tests
skip gracefully when sentence-transformers is not installed).

Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
@EndeavorYen EndeavorYen force-pushed the feat/multilingual-support branch from 0e6340a to 2f90412 Compare April 10, 2026 05:40
@EndeavorYen
Copy link
Copy Markdown
Author

Thanks for the follow-up. Pushed a few updates based on your feedback:

1. Fork safety (os.getpid()): Added to the embedding cache key.
Forked workers now re-initialize cleanly. (config.py L275)

2. Scope cleanup: Found and removed 3 session checkpoint/restore/list MCP tools that were fork-specific and accidentally included. These would have bypassed upstream's diary_write path with truncated content written
directly to ChromaDB (only first 200 chars of progress, dropping decisions and memory triggers entirely).

The upstream save workflow (hook_stop/hook_precompact triggering AI-driven diary_write with full content) is now preserved as intended.

Added 2 storage integrity regression tests to guard against this class of bug:

  • diary and drawer write/read roundtrips now verify full content is preserved without truncation. (test_mcp_server.py::TestStorageIntegrity)

3. Mismatch warning: Noted on stderr vs hard failure. Will make it louder (stderr, non-blocking) in a follow-up, keeping the rolling upgrade behavior you described.

The diff is now strictly multilingual + KG + embedding infrastructure.
654 tests passing, lint clean.

4. fallback behavior table :

Fallback behavior (without [multilingual] extra)

Component With [multilingual] Without (English fallback)
Room classification Embedding similarity (any language) English keyword matching
Memory extraction Embedding classification (any language) English regex markers
Entity detection English + Chinese patterns (百家姓) English patterns only
Spellcheck CJK auto-skip + English correction English correction only
Embedding model paraphrase-multilingual-MiniLM-L12-v2 ChromaDB default (all-MiniLM-L6-v2)

All existing English-only behavior is unchanged. The [multilingual] extra adds capabilities without modifying the default path.

Will open the KG extraction benchmark issue once this lands.
Happy to receive test triples from your astrophysics/cryptography wings for that.

@web3guru888
Copy link
Copy Markdown

@EndeavorYen — the scope cleanup in this latest push is significant. Removing those 3 session checkpoint/restore/list MCP tools was the right call — they were silently truncating diary content to 200 chars, which would have been a hard-to-debug data loss bug in production (sessions look mined, retrieval returns partial content). The storage integrity regression tests catching that class of issue are worth keeping regardless of what else changes.

The fallback behavior table is exactly what the PR description needed. The per-component breakdown makes the no-[multilingual]-extra path completely clear.

On the KG test triples offer — yes, happy to provide some from our astrophysics and cryptography wings once this lands. A few representative examples of what we're extracting:

  • Astrophysics: (Gravitational wave event GW230422, detected_by, LIGO-Livingston), (GW230422, classification, binary neutron star merger)
  • Cryptography: (Lattice-based scheme XYZ, resists, Grover's algorithm), (post-quantum migration, blocked_by, key size constraints)

The interesting edge cases for your NER benchmark: multi-word proper nouns (observatory names, algorithm names with version numbers), entities that appear as both subject and object in different triples, and predicates that are semantically synonymous but lexically different ("detected_by" vs "observed_by"). Happy to structure a test set around those once you open the benchmark issue.

654 tests + the scope reduction to strictly multilingual + KG + embedding infra means this is in a good state to merge. LGTM from our side.

@web3guru888
Copy link
Copy Markdown

The two new traversal tools are the ones I'm most excited about in this batch — wanted to share some notes from our own graph traversal work.

On mempalace_kg_traverse: The hop_limit parameter is critical — please make sure it's not optional or silently uncapped. In our integration (710 entities, 1,014 triples), unconstrained BFS at certain densely-connected nodes produced combinatorial explosion before we added a hard ceiling. We settled on 4–5 hops as the practical sweet spot for inter-domain connections — beyond that you're usually in noise territory unless the graph is very sparse. Depth-limited BFS with a visited-set guard has been rock-solid for us.

On mempalace_kg_find_path: This one has been genuinely useful in ways I didn't anticipate. Beyond just navigation, it's become our go-to tool for explaining cross-domain discoveries — "how did we get from concept A to concept B" is a natural question when you're doing multi-hop reasoning, and shortest-path gives you a clean answer. We've validated A* across 9-hop chains in recent tests (astrophysics → economics → climate chains mostly settle at 4–6), and the explanability angle alone justifies having this as a first-class MCP tool.

One thing worth considering for kg_find_path: returning the full path as an ordered list of (entity, relation, entity) triples rather than just entity names makes it much more useful downstream — clients can render the reasoning chain without a follow-up query.

@NickShtefan
Copy link
Copy Markdown

Nice work @EndeavorYen — the embedding-based classification is solid.

Heads up: #442 includes infrastructure that your PR would benefit from but currently doesn't have:

  • Embedding model mismatch detection (commits from @FabioLissi) — stores model name in collection metadata, raises a hard error when the configured model doesn't match what the palace was created with. Without this, users who switch from MiniLM to e5-base (or vice versa) silently corrupt their search results.
  • mempalace re-mine — safe model switching: backup palace → drop collection → re-mine from original source files.
  • Device supportMEMPALACE_EMBEDDING_DEVICE for MPS/CUDA acceleration.
  • Configurable chunk sizeMEMPALACE_CHUNK_SIZE / MEMPALACE_CHUNK_OVERLAP env vars, since different models have different token limits.

These are 3 cherry-pickable commits on top of the base get_embedding_function():

  • 6a23c8f — chunk size env vars (standalone)
  • 7aa96cc — mismatch detection (standalone)
  • 94412e8 — re-mine command (depends on detection)

Also: we tested paraphrase-multilingual-MiniLM-L12-v2 in production and found its 128-token window too short — switched to intfloat/multilingual-e5-base (512 tokens). Russian search scores went from 0.19-0.40 to 0.70-0.77. Worth benchmarking against your 173-test suite.

Happy to help with integration if useful.

@web3guru888
Copy link
Copy Markdown

@NickShtefan — the Russian score jump from 0.19-0.40 → 0.70-0.77 with multilingual-e5-base is really useful data. That 128-token limit on paraphrase-multilingual-MiniLM-L12-v2 is a real ceiling — at 128 tokens you're truncating most substantive content, so you're comparing sentence openings rather than actual semantics.

The cherry-pick list from #442 is the right way to frame this. The embedding model mismatch detection (7aa96cc) is the most load-bearing of the three for #488 specifically — without it, any user who runs #488 on an existing palace built with MiniLM and then changes their embedding config will silently get cross-space similarity comparisons (MiniLM embeddings queried with e5-base vectors). The error would only surface as mysteriously degraded search quality, not a hard failure.

The re-mine command (94412e8) is the practical path to safely migrating a live palace to a new model — which is something users hitting the 128-token limit will immediately want to do.

One thing worth flagging to @EndeavorYen: if these commits cherry-pick cleanly into #488's branch, that's the lowest-friction path. If they don't, the alternative is a sequential merge order (442 → 488) which may need maintainer coordination.

Also cross-referencing #516 where @vincent067 and others are specifically hitting the Chinese-language case — the multilingual-e5-base scores you measured for Russian will likely generalize, since both Russian and Chinese are high-density morphological languages that suffer most from MiniLM's limited token budget.

@FabioLissi
Copy link
Copy Markdown

Let me know if I can help in any way.

@EndeavorYen
Copy link
Copy Markdown
Author

@NickShtefan @FabioLissi thanks for the work you've put into #442. The mismatch detection, re-mine command, and configurable chunk size are solid infrastructure, and this PR should build on top of that rather than duplicate it.

A few things that stood out to me:

  • The mismatch detection raising a hard error with recovery instructions is the right approach. My original log-only warning was too quiet, and @web3guru888's point about silent corruption being the worst failure mode really drove that home.
  • Having re-mine as a first-class command makes a lot of sense. Once users start switching models, they'll need a clean path to do that safely.
  • @NickShtefan your Russian production benchmarks (0.19-0.40 → 0.70-0.77) are exactly the kind of real-world data that settles the MiniLM-L12 vs e5-base question. I'll run e5-base against the 173-test suite to confirm, but honestly the 128 vs 512 token window gap alone makes the outcome pretty predictable.

Integration plan:

  1. Cherry-pick 6a23c8f7aa96cc94412e8 onto this branch
  2. Drop my overlapping mismatch detection code in favor of @FabioLissi's implementation
  3. If conflicts are too heavy (both PRs touched config.py and palace.py a lot), we'll go with sequential merge, feat: add configurable multilingual embedding model support #442 first then rebase feat: add multilingual support via embedding-based semantic classification #488 on top

Will push the result and report back here.

@NickShtefan
Copy link
Copy Markdown

@EndeavorYen sounds good. Sequential merge (#442 first, then rebase #488 on top) is the cleanest path — both PRs touch config.py and palace.py heavily, and cherry-picking individual commits across diverged branches tends to create more conflicts than it saves.

#442 was just rebased on latest upstream/main and includes @FabioLissi's MCP mismatch propagation fix (e64cdd8). Ready to merge from our side.

One heads-up: we also added iter_all_metadatas() to palace.py — upstream's PR #371 replaced the inline usage in mcp_server.py, but cli.py (re-mine) and miner.py still use it. When you rebase, keep that function in palace.py but drop the import from mcp_server.py.

Looking forward to seeing e5-base vs MiniLM-L12 on your 173-test suite.

@EndeavorYen
Copy link
Copy Markdown
Author

@NickShtefan Thanks! That makes sense. Given the overlap, sequential merge does seem like the cleanest path here.

I’m happy to follow that direction. If #442 lands first, I can rebase #488 on top afterward instead of forcing the cherry-pick route and resolving the same conflict surface twice.

If the preferred path is to update #488 independently before that, I can do that as well.

Also appreciate the iter_all_metadatas() note , I’ll keep that in mind for the rebase.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
@NickShtefan
Copy link
Copy Markdown

@EndeavorYen — heads up: #442 was just updated with a significant refactoring in response to maintainer feedback.

What changed in #442

@igorls requested that embedding model configuration should not live in env vars — it should be bound to the palace at init time and changeable only via a migrate/re-mine command. We've implemented this:

  • All embedding env vars removed: MEMPALACE_EMBEDDING_MODEL, MEMPALACE_EMBEDDING_DEVICE, MEMPALACE_CHUNK_SIZE, MEMPALACE_CHUNK_OVERLAP, MEMPALACE_FORCE_EMBEDDING
  • Model bound at init: mempalace init --model <name> stamps the model into ChromaDB collection metadata
  • Model changed via re-mine only: mempalace re-mine --model <new>
  • Device auto-detected: cuda > mps (arm64 only) > cpu — no configuration needed
  • get_embedding_function(model_name=, device=) takes explicit params instead of reading from config/env
  • read_collection_metadata() reads model from ChromaDB without instantiating the embedder

Impact on #488

Since #488 also uses MEMPALACE_EMBEDDING_MODEL and get_embedding_function(), you'll need to adapt when rebasing on top of #442:

  1. config.py: get_embedding_function() signature changed from (config=None) to (model_name=None, device=None). The global singleton cache is replaced by a dict keyed by (model_name, device).

  2. MempalaceConfig: embedding_model and force_embedding properties removed. embedding_device replaced by static detect_device().

  3. palace.py: get_collection() now accepts model=, chunk_size=, chunk_overlap= params for init/re-mine. It reads the model from collection metadata on normal opens.

  4. Your Ollama provider support (ollama:<model>): Should still work — just pass the model string through --model at init time instead of env var.

  5. Your MEMPALACE_EMBEDDING_MODEL references: Replace with read_collection_metadata(palace_path).get("embedding_model").

The sequential merge path we discussed still holds — once #442 lands, rebasing #488 should be straightforward since the interfaces are cleaner now. Happy to help with the rebase if useful.

@EndeavorYen
Copy link
Copy Markdown
Author

Thanks for the update, Nick. The latest changes in #442 make sense to me, especially binding the embedding model to palace metadata instead of configuring it through env vars.

Since #442 and #488 overlap in config.py, palace.py, and the embedding flow, I’m happy to wait for #442 to land first and then rebase #488 on top. I can adjust the remaining integration work to the new interfaces after that.

I think that will be cleaner than trying to resolve the same overlap in parallel.

@igorls igorls added area/hooks Claude Code hook scripts (Stop, PreCompact, SessionStart) area/install pip/uv/pipx/plugin install and packaging area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hooks Claude Code hook scripts (Stop, PreCompact, SessionStart) area/install pip/uv/pipx/plugin install and packaging area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining area/search Search and retrieval enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants