Problem
When multiple rooms with very different content sizes exist, unscoped mempalace search returns irrelevant results from the largest room instead of matching content from the correct room.
Reproduction
Project has ~20 rooms. One room is a large work repo with hundreds of JSON schema files, YAML configs, and DB migration dumps. Another room (flights) has travel itineraries with explicit mentions of cities and countries.
Search: mempalace search "cambodia"
All 5 results come from the large work repo — swagger.yaml, schema.rb, JSON migration files. Zero results from flights room, despite flights/TRIP-ITINERARY.md and flights/ROUTE-ANALYSIS.md containing explicit "Cambodia" text with Phnom Penh and Siem Reap references.
Similarity scores: -0.53 to -0.58 (negative).
Search: mempalace search "thailand"
All 5 results are work repo DB migration JSONs. No Thailand/Krabi/Bangkok content surfaced.
Search: mempalace search "cambodia" --room flights
Correct results returned immediately — city coordinates, visa info, bus routes. But scores are still weak (0.004 to -0.168).
Root Cause
-
Default ChromaDB embedding model is all-MiniLM-L6-v2 — a 384-dim general-purpose model that's weak at semantic search over mixed content (code, JSON, prose). Structural patterns in JSON/YAML cluster together and dominate the vector space.
-
No hybrid search — pure dense vector retrieval with no BM25/keyword fallback. Exact string matches ("Cambodia", "Phnom Penh") get zero boost over structural noise.
-
Room metadata is stored but not used for relevance boosting — rooms are only filterable via --room, not factored into scoring. When one room has 10x the chunks of others, its vectors dominate nearest-neighbor results.
Suggested Fixes
-
Upgrade default embedding model to something like BAAI/bge-small-en-v1.5 or nomic-embed-text — significantly better semantic quality at similar cost.
-
Add hybrid search — combine vector similarity with BM25/keyword matching (ChromaDB supports where_document for content filtering). Exact keyword matches should boost relevance.
-
Room-aware scoring — normalize or boost results per-room so a massive room doesn't drown out smaller ones. Could be as simple as returning top-N per room and merging.
-
Allow configurable embedding model — let users set model in mempalace.yaml so they can pick domain-specific embeddings without forking.
Environment
- mempalace: 3.1.0
- chromadb: (installed via pip, default)
- Python: 3.13
- OS: macOS Darwin 25.3.0
Problem
When multiple rooms with very different content sizes exist, unscoped
mempalace searchreturns irrelevant results from the largest room instead of matching content from the correct room.Reproduction
Project has ~20 rooms. One room is a large work repo with hundreds of JSON schema files, YAML configs, and DB migration dumps. Another room (
flights) has travel itineraries with explicit mentions of cities and countries.Search:
mempalace search "cambodia"All 5 results come from the large work repo — swagger.yaml, schema.rb, JSON migration files. Zero results from
flightsroom, despiteflights/TRIP-ITINERARY.mdandflights/ROUTE-ANALYSIS.mdcontaining explicit "Cambodia" text with Phnom Penh and Siem Reap references.Similarity scores: -0.53 to -0.58 (negative).
Search:
mempalace search "thailand"All 5 results are work repo DB migration JSONs. No Thailand/Krabi/Bangkok content surfaced.
Search:
mempalace search "cambodia" --room flightsCorrect results returned immediately — city coordinates, visa info, bus routes. But scores are still weak (0.004 to -0.168).
Root Cause
Default ChromaDB embedding model is
all-MiniLM-L6-v2— a 384-dim general-purpose model that's weak at semantic search over mixed content (code, JSON, prose). Structural patterns in JSON/YAML cluster together and dominate the vector space.No hybrid search — pure dense vector retrieval with no BM25/keyword fallback. Exact string matches ("Cambodia", "Phnom Penh") get zero boost over structural noise.
Room metadata is stored but not used for relevance boosting — rooms are only filterable via
--room, not factored into scoring. When one room has 10x the chunks of others, its vectors dominate nearest-neighbor results.Suggested Fixes
Upgrade default embedding model to something like
BAAI/bge-small-en-v1.5ornomic-embed-text— significantly better semantic quality at similar cost.Add hybrid search — combine vector similarity with BM25/keyword matching (ChromaDB supports
where_documentfor content filtering). Exact keyword matches should boost relevance.Room-aware scoring — normalize or boost results per-room so a massive room doesn't drown out smaller ones. Could be as simple as returning top-N per room and merging.
Allow configurable embedding model — let users set model in
mempalace.yamlso they can pick domain-specific embeddings without forking.Environment