feat: add multilingual support via embedding-based semantic classification#488
feat: add multilingual support via embedding-based semantic classification#488EndeavorYen wants to merge 1 commit intoMemPalace:developfrom
Conversation
web3guru888
left a comment
There was a problem hiding this comment.
Really solid PR — this is a meaningful upgrade from the old regex-heuristic approach, and the design choices are mostly right. A few things worth discussing:
KG extraction ()
The hybrid NER + co-occurrence fallback is the right architecture — pure spaCy NER misses a lot of implicit relationships (especially in scientific/technical domains), so having the co-occurrence layer as a baseline is good. The Claude Haiku path is nice when available.
My concern: the PR says Claude Haiku is "optional", but what does the quality curve actually look like in a fully local setup? In my experience running triple extraction without an LLM, spaCy alone tends to do fine on explicit subject-verb-object structures but drops off sharply on:
- Nominalized predicates ("the influence of X on Y")
- Multi-hop implicit relationships across sentences
- Domain-specific compound entities
Has the 173 multilingual benchmark been run with only the spaCy+co-occurrence path (no Haiku)? Would be useful to see precision/recall split by extraction path so users know what they're signing up for in the purely-local case. Not a blocker, but worth documenting.
Global embedding cache in multi-process MCP scenarios
cached at module level is fine for single-process use, but MCP servers can be forked (e.g. gunicorn-style or when the host spawns workers). After fork, the child inherits the cache reference but the underlying model state (CUDA contexts, tokenizer threads) may not be fork-safe depending on the backend.
Concrete case: if someone runs this with and the HTTP session gets copied into children, you'll get connection errors that are hard to diagnose. Worth either:
- Documenting the "single-process only" assumption explicitly, or
- Keying the cache on so each child re-initializes cleanly
Optional dependency design
The graceful CI skip is exactly right — I've seen too many multilingual PRs that just Collecting sentence-transformers
Downloading sentence_transformers-5.4.0-py3-none-any.whl (570 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 570.8/570.8 KB 6.6 MB/s eta 0:00:00
Collecting transformers<6.0.0,>=4.41.0
Downloading transformers-5.5.3-py3-none-any.whl (10.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 24.9 MB/s eta 0:00:00
Requirement already satisfied: typing_extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.15.0)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.2.6)
Collecting tqdm>=4.0.0
Downloading tqdm-4.67.3-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.4/78.4 KB 30.0 MB/s eta 0:00:00
Collecting scikit-learn>=0.22.0
Downloading scikit_learn-1.7.2-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 62.2 MB/s eta 0:00:00
Collecting huggingface-hub>=0.23.0
Downloading huggingface_hub-1.10.1-py3-none-any.whl (642 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 642.6/642.6 KB 89.2 MB/s eta 0:00:00
Collecting scipy>=1.0.0
Downloading scipy-1.15.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37.7/37.7 MB 46.6 MB/s eta 0:00:00
Collecting torch>=1.11.0
Downloading torch-2.11.0-cp310-cp310-manylinux_2_28_x86_64.whl (530.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 530.6/530.6 MB 5.2 MB/s eta 0:00:00
Collecting httpx<1,>=0.23.0
Downloading httpx-0.28.1-py3-none-any.whl (73 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 73.5/73.5 KB 28.6 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->sentence-transformers) (26.0)
Collecting typer
Downloading typer-0.24.1-py3-none-any.whl (56 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.1/56.1 KB 18.4 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.4.3
Downloading hf_xet-1.4.3-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (4.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.2/4.2 MB 76.1 MB/s eta 0:00:00
Collecting filelock>=3.10.0
Downloading filelock-3.25.2-py3-none-any.whl (26 kB)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.23.0->sentence-transformers) (6.0.3)
Collecting fsspec>=2023.5.0
Downloading fsspec-2026.3.0-py3-none-any.whl (202 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 202.6/202.6 KB 59.3 MB/s eta 0:00:00
Collecting threadpoolctl>=3.1.0
Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
Downloading joblib-1.5.3-py3-none-any.whl (309 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 309.1/309.1 KB 54.3 MB/s eta 0:00:00
Collecting cuda-bindings<14,>=13.0.3
Downloading cuda_bindings-13.2.0-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (6.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 78.9 MB/s eta 0:00:00
Collecting nvidia-nvshmem-cu13==3.4.5
Downloading nvidia_nvshmem_cu13-3.4.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (60.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 MB 34.6 MB/s eta 0:00:00
Collecting nvidia-cusparselt-cu13==0.8.0
Downloading nvidia_cusparselt_cu13-0.8.0-py3-none-manylinux2014_x86_64.whl (169.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 169.9/169.9 MB 16.1 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu13==9.19.0.56
Downloading nvidia_cudnn_cu13-9.19.0.56-py3-none-manylinux_2_27_x86_64.whl (366.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 366.1/366.1 MB 6.6 MB/s eta 0:00:00
Collecting cuda-toolkit[cublas,cudart,cufft,cufile,cupti,curand,cusolver,cusparse,nvjitlink,nvrtc,nvtx]==13.0.2
Downloading cuda_toolkit-13.0.2-py2.py3-none-any.whl (2.4 kB)
Collecting sympy>=1.13.3
Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 86.3 MB/s eta 0:00:00
Collecting nvidia-nccl-cu13==2.28.9
Downloading nvidia_nccl_cu13-2.28.9-py3-none-manylinux_2_18_x86_64.whl (196.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.5/196.5 MB 12.3 MB/s eta 0:00:00
Requirement already satisfied: setuptools<82 in /usr/lib/python3/dist-packages (from torch>=1.11.0->sentence-transformers) (59.6.0)
Collecting networkx>=2.5.1
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 96.7 MB/s eta 0:00:00
Collecting triton==3.6.0
Downloading triton-3.6.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (188.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.1/188.1 MB 11.9 MB/s eta 0:00:00
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.6)
Collecting nvidia-curand==10.4.0.35.*
Downloading nvidia_curand-10.4.0.35-py3-none-manylinux_2_27_x86_64.whl (59.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.5/59.5 MB 31.2 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime==13.0.96.*
Downloading nvidia_cuda_runtime-13.0.96-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (2.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.2/2.2 MB 84.0 MB/s eta 0:00:00
Collecting nvidia-cufft==12.0.0.61.*
Downloading nvidia_cufft-12.0.0.61-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (214.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 214.1/214.1 MB 8.4 MB/s eta 0:00:00
Collecting nvidia-cufile==1.15.1.6.*
Downloading nvidia_cufile-1.15.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 92.3 MB/s eta 0:00:00
Collecting nvidia-nvjitlink==13.0.88.*
Downloading nvidia_nvjitlink-13.0.88-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (40.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.7/40.7 MB 46.3 MB/s eta 0:00:00
Collecting nvidia-cusolver==12.0.4.66.*
Downloading nvidia_cusolver-12.0.4.66-py3-none-manylinux_2_27_x86_64.whl (200.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.9/200.9 MB 11.7 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti==13.0.85.*
Downloading nvidia_cuda_cupti-13.0.85-py3-none-manylinux_2_25_x86_64.whl (10.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.7/10.7 MB 86.3 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc==13.0.88.*
Downloading nvidia_cuda_nvrtc-13.0.88-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (90.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90.2/90.2 MB 24.7 MB/s eta 0:00:00
Collecting nvidia-nvtx==13.0.85.*
Downloading nvidia_nvtx-13.0.85-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (148 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.0/148.0 KB 49.7 MB/s eta 0:00:00
Collecting nvidia-cublas==13.1.0.3.*
Downloading nvidia_cublas-13.1.0.3-py3-none-manylinux_2_27_x86_64.whl (423.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 423.1/423.1 MB 5.5 MB/s eta 0:00:00
Collecting nvidia-cusparse==12.6.3.3.*
Downloading nvidia_cusparse-12.6.3.3-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (145.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 145.9/145.9 MB 15.2 MB/s eta 0:00:00
Collecting regex>=2025.10.22
Downloading regex-2026.4.4-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (793 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 793.6/793.6 KB 89.0 MB/s eta 0:00:00
Collecting safetensors>=0.4.3
Downloading safetensors-0.7.0-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (507 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 507.2/507.2 KB 78.7 MB/s eta 0:00:00
Collecting tokenizers<=0.23.0,>=0.22.0
Downloading tokenizers-0.22.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 75.7 MB/s eta 0:00:00
Collecting cuda-pathfinder~=1.1
Downloading cuda_pathfinder-1.5.2-py3-none-any.whl (49 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.0/50.0 KB 21.4 MB/s eta 0:00:00
Collecting anyio
Downloading anyio-4.13.0-py3-none-any.whl (114 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.4/114.4 KB 40.0 MB/s eta 0:00:00
Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.23.0->sentence-transformers) (3.11)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->huggingface-hub>=0.23.0->sentence-transformers) (2026.2.25)
Collecting httpcore==1.*
Downloading httpcore-1.0.9-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.8/78.8 KB 29.9 MB/s eta 0:00:00
Collecting h11>=0.16
Downloading h11-0.16.0-py3-none-any.whl (37 kB)
Collecting mpmath<1.4,>=1.1.0
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 77.1 MB/s eta 0:00:00
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (3.0.3)
Collecting annotated-doc>=0.0.2
Downloading annotated_doc-0.0.4-py3-none-any.whl (5.3 kB)
Collecting rich>=12.3.0
Downloading rich-14.3.3-py3-none-any.whl (310 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 310.5/310.5 KB 69.6 MB/s eta 0:00:00
Collecting shellingham>=1.3.0
Downloading shellingham-1.5.4-py2.py3-none-any.whl (9.8 kB)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.10/dist-packages (from typer->huggingface-hub>=0.23.0->sentence-transformers) (8.3.1)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=12.3.0->typer->huggingface-hub>=0.23.0->sentence-transformers) (2.19.2)
Collecting markdown-it-py>=2.2.0
Downloading markdown_it_py-4.0.0-py3-none-any.whl (87 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.3/87.3 KB 36.5 MB/s eta 0:00:00
Requirement already satisfied: exceptiongroup>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from anyio->httpx<1,>=0.23.0->huggingface-hub>=0.23.0->sentence-transformers) (1.3.1)
Collecting mdurl~=0.1
Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: nvidia-cusparselt-cu13, mpmath, cuda-toolkit, triton, tqdm, threadpoolctl, sympy, shellingham, scipy, safetensors, regex, nvidia-nvtx, nvidia-nvshmem-cu13, nvidia-nvjitlink, nvidia-nccl-cu13, nvidia-curand, nvidia-cufile, nvidia-cuda-runtime, nvidia-cuda-nvrtc, nvidia-cuda-cupti, nvidia-cublas, networkx, mdurl, joblib, hf-xet, h11, fsspec, filelock, cuda-pathfinder, annotated-doc, scikit-learn, nvidia-cusparse, nvidia-cufft, nvidia-cudnn-cu13, markdown-it-py, httpcore, cuda-bindings, anyio, rich, nvidia-cusolver, httpx, typer, torch, huggingface-hub, tokenizers, transformers, sentence-transformers unconditionally and break CI for users who don't have a GPU. The way this is structured keeps the core install lightweight.
Coverage gap in (82%)
Config is the highest-traffic surface in any integration — mishandled config is almost always the root cause when users report weird behavior. What's the missing 18%? Guessing it might be the Ollama model-name parsing path (the syntax) or the fallback when isn't installed. Those paths tend to get exercised only in production, which is exactly when you want them tested.
Overall this is a significant quality addition and the backward-compat story is clean. The above are discussion points, not blockers — marking COMMENT rather than requesting changes. Nice work on the benchmark suite especially.
web3guru888
left a comment
There was a problem hiding this comment.
Really solid PR — this is a meaningful upgrade from the old regex-heuristic approach, and the design choices are mostly right. A few things worth discussing:
KG extraction (kg_extraction.py)
The hybrid NER + co-occurrence fallback is the right architecture — pure spaCy NER misses a lot of implicit relationships (especially in scientific/technical domains), so having the co-occurrence layer as a baseline is good. The Claude Haiku path is nice when available.
My concern: the PR says Claude Haiku is "optional", but what does the quality curve actually look like in a fully local setup? In my experience running triple extraction without an LLM, spaCy alone tends to do fine on explicit subject-verb-object structures but drops off sharply on:
- Nominalized predicates (e.g. "the influence of X on Y")
- Multi-hop implicit relationships across sentences
- Domain-specific compound entities
Has the 173 multilingual benchmark been run with only the spaCy+co-occurrence path (no Haiku)? Would be useful to see precision/recall split by extraction path so users know what they're signing up for in the purely-local case. Not a blocker, but worth documenting.
Global embedding cache in multi-process MCP scenarios
get_embedding_function() cached at module level is fine for single-process use, but MCP servers can be forked (e.g. gunicorn-style or when the host spawns workers). After fork, the child inherits the cache reference but the underlying model state (CUDA contexts, tokenizer threads) may not be fork-safe depending on the backend.
Concrete case: if someone runs this with the ollama backend and the HTTP session gets copied into forked children, you can get connection errors that are hard to diagnose. Worth either:
- Documenting the "single-process only" assumption explicitly, or
- Keying the cache on
os.getpid()so each child re-initializes cleanly
Optional dependency design
The graceful CI skip is exactly right — I've seen too many multilingual PRs that just unconditionally install sentence-transformers and break CI for users without GPU access. The way this is structured keeps the core install lightweight. Good call.
Coverage gap in config.py (82%)
Config is the highest-traffic surface in any integration — mishandled config is almost always the root cause when users report weird behavior. What's the missing 18%? Guessing it might be the Ollama model-name parsing path or the fallback behavior when sentence-transformers isn't installed. Those paths tend to only get exercised in production, which is exactly when you want them covered.
Overall this is a significant quality addition and the backward-compat story is clean. The above are discussion points, not blockers — marking COMMENT rather than requesting changes. Nice work on the benchmark suite especially.
web3guru888
left a comment
There was a problem hiding this comment.
This is a substantial and well-thought-out PR — excited to see it finally land. The architectural direction here is exactly right, and the scope of the changes reflects genuine engineering effort.
What's excellent:
The shift from TOPIC_KEYWORDS string matching to embedding-based cosine similarity for room classification is the correct long-term solution. String heuristics break the moment you introduce a second language — or even domain-specific vocabulary in your primary language. We've run into this ourselves working with scientific content (astrophysics, epidemiology, etc.) where existing keyword lists were completely useless for domain classification. Embedding similarity against room description embeddings is semantically principled in a way that keyword lists never can be.
The abstraction in config.py is the piece I'm most excited about architecturally. Previously every ChromaDB consumer was implicitly hardcoded to whatever the default was — this brings all of that under a single configurable, cacheable entrypoint. The global model cache (loaded once, not per-call) is also essential; loading sentence-transformers on every call would add 2-3s of latency per operation which would break interactive workflows. These two things together are a real improvement to the codebase independent of the multilingual story.
The CJK-safe spellcheck auto-skip via Unicode range detection is a nicely practical fix. Corrupted CJK from English spellcheckers was a real issue — glad it's solved simply.
One concern I think needs a warning before merge:
If a user deploys with the default embedding model (), builds up their memory palace, and then switches to via env var, the existing ChromaDB vectors are in an incompatible embedding space. Querying with new-model embeddings against old-model vectors gives semantically meaningless cosine scores — you get garbage retrieval without any error or warning. This is a well-known footgun in vector DB systems.
I don't think the PR needs to solve full migration, but it should at minimum:
- Persist the embedding model name in a config file or ChromaDB metadata on first use
- On startup, detect if the configured model differs from what existing vectors were built with
- If mismatch, emit a clear warning (or fail loudly with instructions to run or similar)
Without this, users who experiment with different embedding models will silently corrupt their retrieval quality and have no idea why.
On the 7 new KG MCP tools:
Can you clarify how these relate to the existing , , etc. tools? If these are additive (covering operations that didn't exist before), that's great. If any of them overlap with existing tools with slightly different names or signatures, it will create confusion for MCP clients that enumerate available tools — they'll see two tools that look similar and won't know which to use. A short note in the PR description on what's net-new vs what (if anything) supersedes existing tools would help reviewers and future readers.
On dependency complexity:
The optional extras approach (Collecting mempalace[multilingual]
Downloading mempalace-3.1.0-py3-none-any.whl (110 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.7/110.7 KB 2.7 MB/s eta 0:00:00
Collecting chromadb<0.7,>=0.5.0
Downloading chromadb-0.6.3-py3-none-any.whl (611 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 611.1/611.1 KB 45.7 MB/s eta 0:00:00
Requirement already satisfied: pyyaml<7,>=6.0 in /usr/local/lib/python3.10/dist-packages (from mempalace[multilingual]) (6.0.3)
Collecting posthog>=2.4.0
Downloading posthog-7.10.3-py3-none-any.whl (217 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 217.0/217.0 KB 62.8 MB/s eta 0:00:00
Collecting pypika>=0.48.9
Downloading pypika-0.51.1-py2.py3-none-any.whl (60 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.6/60.6 KB 28.3 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.22.5 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.2.6)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0
Downloading opentelemetry_exporter_otlp_proto_grpc-1.41.0-py3-none-any.whl (20 kB)
Requirement already satisfied: typer>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.24.1)
Collecting grpcio>=1.58.0
Downloading grpcio-1.80.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (6.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.8/6.8 MB 71.9 MB/s eta 0:00:00
Collecting uvicorn[standard]>=0.18.3
Downloading uvicorn-0.44.0-py3-none-any.whl (69 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.4/69.4 KB 31.2 MB/s eta 0:00:00
Collecting opentelemetry-api>=1.2.0
Downloading opentelemetry_api-1.41.0-py3-none-any.whl (69 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.0/69.0 KB 30.5 MB/s eta 0:00:00
Requirement already satisfied: typing_extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.15.0)
Collecting bcrypt>=4.0.1
Downloading bcrypt-5.0.0-cp39-abi3-manylinux_2_34_x86_64.whl (278 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 278.2/278.2 KB 70.2 MB/s eta 0:00:00
Collecting build>=1.0.3
Downloading build-1.4.2-py3-none-any.whl (24 kB)
Collecting chroma-hnswlib==0.7.6
Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 87.7 MB/s eta 0:00:00
Collecting opentelemetry-sdk>=1.2.0
Downloading opentelemetry_sdk-1.41.0-py3-none-any.whl (180 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 180.2/180.2 KB 59.7 MB/s eta 0:00:00
Collecting overrides>=7.3.1
Downloading overrides-7.7.0-py3-none-any.whl (17 kB)
Collecting mmh3>=4.0.1
Downloading mmh3-5.2.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (101 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.2/101.2 KB 41.7 MB/s eta 0:00:00
Requirement already satisfied: tqdm>=4.65.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.67.3)
Collecting kubernetes>=28.1.0
Downloading kubernetes-35.0.0-py2.py3-none-any.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 97.0 MB/s eta 0:00:00
Requirement already satisfied: httpx>=0.27.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.28.1)
Collecting fastapi>=0.95.2
Downloading fastapi-0.135.3-py3-none-any.whl (117 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 117.7/117.7 KB 44.6 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-fastapi>=0.41b0
Downloading opentelemetry_instrumentation_fastapi-0.62b0-py3-none-any.whl (13 kB)
Requirement already satisfied: tokenizers>=0.13.2 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.22.2)
Collecting orjson>=3.9.12
Downloading orjson-3.11.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (133 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.9/133.9 KB 2.6 MB/s eta 0:00:00
Collecting tenacity>=8.2.3
Downloading tenacity-9.1.4-py3-none-any.whl (28 kB)
Collecting importlib-resources
Downloading importlib_resources-6.5.2-py3-none-any.whl (37 kB)
Collecting pydantic>=1.9
Downloading pydantic-2.12.5-py3-none-any.whl (463 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 463.6/463.6 KB 82.4 MB/s eta 0:00:00
Collecting onnxruntime>=1.14.1
Downloading onnxruntime-1.23.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (17.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.4/17.4 MB 70.0 MB/s eta 0:00:00
Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from chromadb<0.7,>=0.5.0->mempalace[multilingual]) (14.3.3)
Collecting pyproject_hooks
Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Requirement already satisfied: packaging>=24.0 in /usr/local/lib/python3.10/dist-packages (from build>=1.0.3->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (26.0)
Requirement already satisfied: tomli>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from build>=1.0.3->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.4.0)
Collecting starlette>=0.46.0
Downloading starlette-1.0.0-py3-none-any.whl (72 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.7/72.7 KB 33.1 MB/s eta 0:00:00
Collecting typing-inspection>=0.4.2
Downloading typing_inspection-0.4.2-py3-none-any.whl (14 kB)
Requirement already satisfied: annotated-doc>=0.0.2 in /usr/local/lib/python3.10/dist-packages (from fastapi>=0.95.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.0.4)
Requirement already satisfied: idna in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (3.11)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2026.2.25)
Requirement already satisfied: anyio in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.13.0)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.0.9)
Requirement already satisfied: h11>=0.16 in /usr/local/lib/python3.10/dist-packages (from httpcore==1.->httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.16.0)
Collecting websocket-client!=0.40.0,!=0.41.,!=0.42.*,>=0.32.0
Downloading websocket_client-1.9.0-py3-none-any.whl (82 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.6/82.6 KB 38.4 MB/s eta 0:00:00
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.17.0)
Requirement already satisfied: urllib3!=2.6.0,>=1.24.2 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.6.3)
Requirement already satisfied: python-dateutil>=2.5.3 in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.9.0.post0)
Collecting requests-oauthlib
Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB)
Collecting durationpy>=0.7
Downloading durationpy-0.10-py3-none-any.whl (3.9 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.32.5)
Collecting coloredlogs
Downloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 KB 13.9 MB/s eta 0:00:00
Collecting flatbuffers
Downloading flatbuffers-25.12.19-py2.py3-none-any.whl (26 kB)
Collecting protobuf
Downloading protobuf-7.34.1-cp310-abi3-manylinux2014_x86_64.whl (324 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.3/324.3 KB 78.0 MB/s eta 0:00:00
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from onnxruntime>=1.14.1->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.14.0)
Collecting importlib-metadata<8.8.0,>=6.0
Downloading importlib_metadata-8.7.1-py3-none-any.whl (27 kB)
Collecting opentelemetry-proto==1.41.0
Downloading opentelemetry_proto-1.41.0-py3-none-any.whl (72 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.1/72.1 KB 30.9 MB/s eta 0:00:00
Collecting googleapis-common-protos~=1.57
Downloading googleapis_common_protos-1.74.0-py3-none-any.whl (300 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 300.7/300.7 KB 74.3 MB/s eta 0:00:00
Collecting opentelemetry-exporter-otlp-proto-common==1.41.0
Downloading opentelemetry_exporter_otlp_proto_common-1.41.0-py3-none-any.whl (18 kB)
Collecting protobuf
Downloading protobuf-6.33.6-cp39-abi3-manylinux2014_x86_64.whl (323 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 323.4/323.4 KB 50.2 MB/s eta 0:00:00
Collecting opentelemetry-instrumentation-asgi==0.62b0
Downloading opentelemetry_instrumentation_asgi-0.62b0-py3-none-any.whl (17 kB)
Collecting opentelemetry-util-http==0.62b0
Downloading opentelemetry_util_http-0.62b0-py3-none-any.whl (9.3 kB)
Collecting opentelemetry-instrumentation==0.62b0
Downloading opentelemetry_instrumentation-0.62b0-py3-none-any.whl (34 kB)
Collecting opentelemetry-semantic-conventions==0.62b0
Downloading opentelemetry_semantic_conventions-0.62b0-py3-none-any.whl (231 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 231.6/231.6 KB 72.7 MB/s eta 0:00:00
Collecting wrapt<3.0.0,>=1.0.0
Downloading wrapt-2.1.2-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (113 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.6/113.6 KB 29.5 MB/s eta 0:00:00
Collecting asgiref~=3.0
Downloading asgiref-3.11.1-py3-none-any.whl (24 kB)
Collecting backoff>=1.10.0
Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting distro>=1.5.0
Downloading distro-1.9.0-py3-none-any.whl (20 kB)
Collecting pydantic-core==2.41.5
Downloading pydantic_core-2.41.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 90.4 MB/s eta 0:00:00
Collecting annotated-types>=0.6.0
Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)
Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2.19.2)
Requirement already satisfied: huggingface-hub<2.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.10.1)
Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.9.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.5.4)
Requirement already satisfied: click>=8.2.1 in /usr/local/lib/python3.10/dist-packages (from typer>=0.9.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (8.3.1)
Collecting watchfiles>=0.20
Downloading watchfiles-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (455 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 455.6/455.6 KB 78.0 MB/s eta 0:00:00
Collecting httptools>=0.6.3
Downloading httptools-0.7.1-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (440 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 440.9/440.9 KB 83.6 MB/s eta 0:00:00
Collecting uvloop>=0.15.1
Downloading uvloop-0.22.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (3.7 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.7/3.7 MB 33.5 MB/s eta 0:00:00
Collecting websockets>=10.4
Downloading websockets-16.0-cp310-cp310-manylinux1_x86_64.manylinux_2_28_x86_64.manylinux_2_5_x86_64.whl (183 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 183.8/183.8 KB 60.7 MB/s eta 0:00:00
Collecting python-dotenv>=0.13
Downloading python_dotenv-1.2.2-py3-none-any.whl (22 kB)
Requirement already satisfied: hf-xet<2.0.0,>=1.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.4.3)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (2026.3.0)
Requirement already satisfied: filelock>=3.10.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<2.0,>=0.16.4->tokenizers>=0.13.2->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (3.25.2)
Collecting zipp>=3.20
Downloading zipp-3.23.0-py3-none-any.whl (10 kB)
Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (0.1.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kubernetes>=28.1.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (3.4.6)
Requirement already satisfied: exceptiongroup>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from anyio->httpx>=0.27.0->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.3.1)
Collecting humanfriendly>=9.1
Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 KB 37.7 MB/s eta 0:00:00
Collecting oauthlib>=3.0.0
Downloading oauthlib-3.3.1-py3-none-any.whl (160 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 160.1/160.1 KB 56.5 MB/s eta 0:00:00
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy->onnxruntime>=1.14.1->chromadb<0.7,>=0.5.0->mempalace[multilingual]) (1.3.0)
Installing collected packages: flatbuffers, durationpy, zipp, wrapt, websockets, websocket-client, uvloop, uvicorn, typing-inspection, tenacity, python-dotenv, pyproject_hooks, pypika, pydantic-core, protobuf, overrides, orjson, opentelemetry-util-http, oauthlib, mmh3, importlib-resources, humanfriendly, httptools, grpcio, distro, chroma-hnswlib, bcrypt, backoff, asgiref, annotated-types, requests-oauthlib, pydantic, posthog, opentelemetry-proto, importlib-metadata, googleapis-common-protos, coloredlogs, build, watchfiles, starlette, opentelemetry-exporter-otlp-proto-common, opentelemetry-api, onnxruntime, kubernetes, opentelemetry-semantic-conventions, fastapi, opentelemetry-sdk, opentelemetry-instrumentation, opentelemetry-instrumentation-asgi, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, chromadb, mempalace
Successfully installed annotated-types-0.7.0 asgiref-3.11.1 backoff-2.2.1 bcrypt-5.0.0 build-1.4.2 chroma-hnswlib-0.7.6 chromadb-0.6.3 coloredlogs-15.0.1 distro-1.9.0 durationpy-0.10 fastapi-0.135.3 flatbuffers-25.12.19 googleapis-common-protos-1.74.0 grpcio-1.80.0 httptools-0.7.1 humanfriendly-10.0 importlib-metadata-8.7.1 importlib-resources-6.5.2 kubernetes-35.0.0 mempalace-3.1.0 mmh3-5.2.1 oauthlib-3.3.1 onnxruntime-1.23.2 opentelemetry-api-1.41.0 opentelemetry-exporter-otlp-proto-common-1.41.0 opentelemetry-exporter-otlp-proto-grpc-1.41.0 opentelemetry-instrumentation-0.62b0 opentelemetry-instrumentation-asgi-0.62b0 opentelemetry-instrumentation-fastapi-0.62b0 opentelemetry-proto-1.41.0 opentelemetry-sdk-1.41.0 opentelemetry-semantic-conventions-0.62b0 opentelemetry-util-http-0.62b0 orjson-3.11.8 overrides-7.7.0 posthog-7.10.3 protobuf-6.33.6 pydantic-2.12.5 pydantic-core-2.41.5 pypika-0.51.1 pyproject_hooks-1.2.0 python-dotenv-1.2.2 requests-oauthlib-2.0.0 starlette-1.0.0 tenacity-9.1.4 typing-inspection-0.4.2 uvicorn-0.44.0 uvloop-0.22.1 watchfiles-1.1.1 websocket-client-1.9.0 websockets-16.0 wrapt-2.1.2 zipp-3.23.0) handles the sentence-transformers + spaCy + optional LLM API key chain well — users who don't need multilingual don't pull in the weight. One thing to document clearly: what's the behavior when is not installed but multilingual content is encountered? The fallback to English regex is mentioned, but making that degradation explicit in the README (with a clear note that CJK classification quality will be poor without the extra) would help users understand what they're getting.
173/173 on the benchmark suite is strong. Would be curious how those benchmarks handle edge cases like code-switching (mixed language within a single entry) or very short entries where there's not enough signal for the embedding to land confidently — but that can be a follow-on issue rather than a blocker here.
Overall: this is a meaningful contribution. The embedding model migration concern is the one I'd want resolved before merging — everything else is either already well handled or addressable in follow-on issues. Nice work @EndeavorYen.
web3guru888
left a comment
There was a problem hiding this comment.
This is a substantial and well-thought-out PR — excited to see it finally land. The architectural direction here is exactly right, and the scope of the changes reflects genuine engineering effort.
What's excellent:
The shift from TOPIC_KEYWORDS string matching to embedding-based cosine similarity for room classification is the correct long-term solution. String heuristics break the moment you introduce a second language — or even domain-specific vocabulary in your primary language. We've run into this ourselves working with scientific content (astrophysics, epidemiology, etc.) where existing keyword lists were completely useless for domain classification. Embedding similarity against room description embeddings is semantically principled in a way that keyword lists never can be.
The get_embedding_function() abstraction in config.py is the piece I'm most excited about architecturally. Previously every ChromaDB consumer was implicitly hardcoded to whatever the default was — this brings all of that under a single configurable, cacheable entrypoint. The global model cache (loaded once, not per-call) is also essential; loading sentence-transformers on every call would add 2-3s of latency per operation which would break interactive workflows. These two things together are a real improvement to the codebase independent of the multilingual story.
The CJK-safe spellcheck auto-skip via Unicode range detection is a nicely practical fix. Corrupted CJK from English spellcheckers was a real issue — glad it's solved simply.
One concern I think needs a warning before merge:
If a user deploys with the default embedding model, builds up their memory palace, and then switches to a different model via env var or config, the existing ChromaDB vectors are in an incompatible embedding space. Querying with new-model embeddings against old-model vectors gives semantically meaningless cosine scores — you get garbage retrieval without any error or warning. This is a well-known footgun in vector DB systems.
I don't think the PR needs to solve full migration, but it should at minimum:
- Persist the embedding model name in a config file or ChromaDB metadata on first use
- On startup, detect if the configured model differs from what existing vectors were built with
- If mismatch, emit a clear warning (or fail loudly with instructions to run a rebuild command)
Without this, users who experiment with different embedding models will silently corrupt their retrieval quality and have no idea why.
On the 7 new KG MCP tools:
Can you clarify how these relate to the existing mempalace_kg_query, mempalace_kg_add, etc. tools? If these are additive (covering operations that didn't exist before), that's great. If any of them overlap with existing tools under slightly different names or signatures, it will create confusion for MCP clients that enumerate available tools — they'll see two tools that look similar and won't know which to use. A short note in the PR description on what's net-new vs what (if anything) supersedes existing tools would help reviewers and future readers.
On dependency complexity:
The optional extras approach (pip install mempalace[multilingual]) handles the sentence-transformers + spaCy + optional LLM API key chain well — users who don't need multilingual don't pull in the weight. One thing worth documenting clearly: what's the behavior when the multilingual extra is not installed but multilingual content is encountered? The fallback to English regex is mentioned, but making that degradation path explicit in the README would help users understand what they're getting without the optional install.
173/173 on the benchmark suite is strong. Would be curious how those benchmarks handle edge cases like code-switching (mixed language within a single entry) or very short entries where there's not enough signal for the embedding to land confidently — but that can be a follow-on issue rather than a blocker here.
Overall: this is a meaningful contribution. The embedding model migration concern is the one I'd want resolved before merging — everything else is either already well handled or addressable in follow-on issues. Nice work @EndeavorYen.
|
Thanks for the thorough review, really helpful feedback. Addressing each point: 1. Embedding model mismatch detection This is already implemented. I should have called it out in the PR description, that's on me.
Called in both Open question: currently it's a log warning. Would a louder signal (stderr or hard failure) be more appropriate? 2. KG MCP tools All 7 are net-new. Upstream had no KG-related MCP tools:
No overlap with existing tool names or signatures. To be transparent: the KG extraction is a complementary addition that got bundled into this PR rather than being strictly multilingual scope. Will keep PR scope tighter going forward. 3. Fallback behavior documentation Good call. The degradation path:
Will add to README. Happy to do it in this PR or as a follow-up. 4. KG extraction quality without Haiku Honest answer: the 173 benchmark covers room classification, memory extraction, entity detection, and search, but not KG triple extraction. In the purely-local path (spaCy + co-occurrence, no LLM):
A KG-specific benchmark with precision/recall by extraction path would be valuable. Will open as a follow-on issue. 5. Multi-process fork safety Valid concern. I'll add 6. Config coverage (82%) Mostly Ollama error paths and OS-specific permission branches. Will add in a follow-up. |
|
@EndeavorYen — thanks for the detailed follow-up. This is much clearer now. Embedding model mismatch detection — Good, glad KG MCP tools — Appreciate the transparency on scope. The 7 tools being net-new with no naming conflicts is the key thing. Fallback degradation table — This is exactly what reviewers need. Adding it to the README (or even the PR description) would be a good addition, either now or as a follow-up. KG extraction quality — The honest answer is the right answer. Co-occurrence works for co-located entities but drops off for implicit multi-hop relationships, and that's the hard case. A KG-specific benchmark issue makes sense — if you open it, we can contribute some test triples from our astrophysics and cryptography wings where the relationship structure is well-defined. Fork safety ( Overall this is in good shape. The fork safety change is the one I'd land before merge; the rest can be follow-ups. |
8218d2b to
0e6340a
Compare
…ation Replace per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. Changes: - Room classification: cosine similarity against room description embeddings - Memory extraction: embedding-based classification (5 types, any language) - Entity detection: add Chinese name patterns (百家姓 surnames) - Spellcheck: auto-skip CJK text via Unicode detection - Embedding provider: pluggable via get_embedding_function() with caching - Default: paraphrase-multilingual-MiniLM-L12-v2 (sentence-transformers) - Ollama: "ollama:<model>" prefix (e.g., ollama:qwen3-embedding-8b) - Configurable via MEMPALACE_EMBEDDING_MODEL env var or config.json - Knowledge graph: temporal triples, multi-hop traversal, auto-extraction - Dialect: CJK bigram extraction for topic keywords - All ChromaDB consumers route through centralized embedding function New optional dependency: sentence-transformers>=2.0 Install: pip install mempalace[multilingual] Without it: English regex fallback (existing behavior unchanged) Benchmark: 173/173 (100%) across 8 languages (zh-Hans, zh-Hant, en, fr, es, de, ja, ko) 652 tests passing, 0 failures. CI-compatible (multilingual tests skip gracefully when sentence-transformers is not installed). Closes MemPalace#231. Related: MemPalace#37, MemPalace#50, MemPalace#92, MemPalace#117, MemPalace#156, MemPalace#273.
0e6340a to
2f90412
Compare
|
Thanks for the follow-up. Pushed a few updates based on your feedback: 1. Fork safety (os.getpid()): Added to the embedding cache key. 2. Scope cleanup: Found and removed 3 session checkpoint/restore/list MCP tools that were fork-specific and accidentally included. These would have bypassed upstream's diary_write path with truncated content written The upstream save workflow (hook_stop/hook_precompact triggering AI-driven diary_write with full content) is now preserved as intended. Added 2 storage integrity regression tests to guard against this class of bug:
3. Mismatch warning: Noted on stderr vs hard failure. Will make it louder (stderr, non-blocking) in a follow-up, keeping the rolling upgrade behavior you described. The diff is now strictly multilingual + KG + embedding infrastructure. 4. fallback behavior table : Fallback behavior (without
|
| Component | With [multilingual] |
Without (English fallback) |
|---|---|---|
| Room classification | Embedding similarity (any language) | English keyword matching |
| Memory extraction | Embedding classification (any language) | English regex markers |
| Entity detection | English + Chinese patterns (百家姓) | English patterns only |
| Spellcheck | CJK auto-skip + English correction | English correction only |
| Embedding model | paraphrase-multilingual-MiniLM-L12-v2 | ChromaDB default (all-MiniLM-L6-v2) |
All existing English-only behavior is unchanged. The [multilingual] extra adds capabilities without modifying the default path.
Will open the KG extraction benchmark issue once this lands.
Happy to receive test triples from your astrophysics/cryptography wings for that.
|
@EndeavorYen — the scope cleanup in this latest push is significant. Removing those 3 session checkpoint/restore/list MCP tools was the right call — they were silently truncating diary content to 200 chars, which would have been a hard-to-debug data loss bug in production (sessions look mined, retrieval returns partial content). The storage integrity regression tests catching that class of issue are worth keeping regardless of what else changes. The fallback behavior table is exactly what the PR description needed. The per-component breakdown makes the no- On the KG test triples offer — yes, happy to provide some from our astrophysics and cryptography wings once this lands. A few representative examples of what we're extracting:
The interesting edge cases for your NER benchmark: multi-word proper nouns (observatory names, algorithm names with version numbers), entities that appear as both subject and object in different triples, and predicates that are semantically synonymous but lexically different ("detected_by" vs "observed_by"). Happy to structure a test set around those once you open the benchmark issue. 654 tests + the scope reduction to strictly multilingual + KG + embedding infra means this is in a good state to merge. LGTM from our side. |
|
The two new traversal tools are the ones I'm most excited about in this batch — wanted to share some notes from our own graph traversal work. On On One thing worth considering for |
|
Nice work @EndeavorYen — the embedding-based classification is solid. Heads up: #442 includes infrastructure that your PR would benefit from but currently doesn't have:
These are 3 cherry-pickable commits on top of the base
Also: we tested Happy to help with integration if useful. |
|
@NickShtefan — the Russian score jump from 0.19-0.40 → 0.70-0.77 with multilingual-e5-base is really useful data. That 128-token limit on paraphrase-multilingual-MiniLM-L12-v2 is a real ceiling — at 128 tokens you're truncating most substantive content, so you're comparing sentence openings rather than actual semantics. The cherry-pick list from #442 is the right way to frame this. The embedding model mismatch detection ( The One thing worth flagging to @EndeavorYen: if these commits cherry-pick cleanly into #488's branch, that's the lowest-friction path. If they don't, the alternative is a sequential merge order (442 → 488) which may need maintainer coordination. Also cross-referencing #516 where @vincent067 and others are specifically hitting the Chinese-language case — the multilingual-e5-base scores you measured for Russian will likely generalize, since both Russian and Chinese are high-density morphological languages that suffer most from MiniLM's limited token budget. |
|
Let me know if I can help in any way. |
|
@NickShtefan @FabioLissi thanks for the work you've put into #442. The mismatch detection, re-mine command, and configurable chunk size are solid infrastructure, and this PR should build on top of that rather than duplicate it. A few things that stood out to me:
Integration plan:
Will push the result and report back here. |
|
@EndeavorYen sounds good. Sequential merge (#442 first, then rebase #488 on top) is the cleanest path — both PRs touch config.py and palace.py heavily, and cherry-picking individual commits across diverged branches tends to create more conflicts than it saves. #442 was just rebased on latest upstream/main and includes @FabioLissi's MCP mismatch propagation fix (e64cdd8). Ready to merge from our side. One heads-up: we also added Looking forward to seeing e5-base vs MiniLM-L12 on your 173-test suite. |
|
@NickShtefan Thanks! That makes sense. Given the overlap, sequential merge does seem like the cleanest path here. I’m happy to follow that direction. If #442 lands first, I can rebase #488 on top afterward instead of forcing the cherry-pick route and resolving the same conflict surface twice. If the preferred path is to update #488 independently before that, I can do that as well. Also appreciate the iter_all_metadatas() note , I’ll keep that in mind for the rebase. |
|
@EndeavorYen — heads up: #442 was just updated with a significant refactoring in response to maintainer feedback. What changed in #442@igorls requested that embedding model configuration should not live in env vars — it should be bound to the palace at init time and changeable only via a migrate/re-mine command. We've implemented this:
Impact on #488Since #488 also uses
The sequential merge path we discussed still holds — once #442 lands, rebasing #488 should be straightforward since the interfaces are cleaner now. Happy to help with the rebase if useful. |
|
Thanks for the update, Nick. The latest changes in #442 make sense to me, especially binding the embedding model to palace metadata instead of configuring it through env vars. Since #442 and #488 overlap in I think that will be cleaner than trying to resolve the same overlap in parallel. |
Summary
Replaces per-language keyword/regex heuristics with embedding-based semantic classification, enabling MemPalace to work with 50+ languages using zero per-language configuration. One new optional dependency (
sentence-transformers), fully backward-compatible.What changed
Room classification → embedding-based
Replaced
TOPIC_KEYWORDSstring matching with cosine similarity against room description embeddings. Any language works — Chinese, French, Korean, etc. — with zero keyword lists.Memory extraction → embedding-based
Replaced per-language regex markers with cosine similarity against memory type description embeddings (decision, preference, milestone, problem, emotional). Each paragraph classified independently.
Entity detection → Chinese name support
Added 百家姓 (Baijiaxing) surname tables for Chinese (simplified + traditional) + Chinese verb/dialogue patterns. Stays rule-based because NER is fundamentally pattern matching.
Spellcheck → CJK-safe
Auto-skips non-English text via Unicode detection. 14 regression tests verify CJK text is never corrupted.
Embedding provider → pluggable
Centralized
get_embedding_function()in config.py — all ChromaDB consumers route through it. Supports:paraphrase-multilingual-MiniLM-L12-v2(default, sentence-transformers)ollama:<model-name>for Ollama-hosted models (e.g.,ollama:qwen3-embedding-8bper Domain-scoped collections + local embedding model = better retrieval at scale #273)MEMPALACE_EMBEDDING_MODELenv var orconfig.jsonKnowledge graph → auto-extraction
Added
kg_extraction.pywith hybrid NER+LLM triple extraction (spaCy + optional Claude Haiku + co-occurrence analysis). 7 new MCP tools for KG operations (query, add, invalidate, timeline, traverse, find_path, extract).Dialect → CJK support
Added Chinese stop words and CJK bigram extraction for topic keywords.
Benchmark results — 173 test cases, 8 languages
Overall: 173/173 (100%), Grade A
Test coverage
652 tests passing, 0 failures. Core multilingual module coverage:
Multilingual tests skip gracefully in CI when
sentence-transformersis not installed — zero impact on existing CI pipeline.Installation
Design decisions
sentence-transformers>=2.0only needed for multilingual. English regex fallback preserved — existing behavior unchangedget_embedding_function()supportsollama:<model>prefix so Qwen3-8B (Domain-scoped collections + local embedding model = better retrieval at scale #273) is a config change, not code changeCoordination with #273
Per maintainer request: the embedding function is pluggable enough that Qwen3-Embedding-8B via Ollama drops in as a config change:
{"embedding_model": "ollama:qwen3-embedding-8b", "embedding_endpoint": "http://localhost:11434"}Or via env var:
MEMPALACE_EMBEDDING_MODEL=ollama:qwen3-embedding-8bTest plan
pip install mempalace[multilingual]works from clean env