Default embedding model (all-MiniLM-L6-v2) produces poor cross-room search results — large rooms dominate

## Problem

When multiple rooms with very different content sizes exist, unscoped `mempalace search` returns irrelevant results from the largest room instead of matching content from the correct room.

## Reproduction

Project has ~20 rooms. One room is a large work repo with hundreds of JSON schema files, YAML configs, and DB migration dumps. Another room (`flights`) has travel itineraries with explicit mentions of cities and countries.

### Search: `mempalace search "cambodia"`

All 5 results come from the large work repo — swagger.yaml, schema.rb, JSON migration files. Zero results from `flights` room, despite `flights/TRIP-ITINERARY.md` and `flights/ROUTE-ANALYSIS.md` containing explicit "Cambodia" text with Phnom Penh and Siem Reap references.

Similarity scores: -0.53 to -0.58 (negative).

### Search: `mempalace search "thailand"`

All 5 results are work repo DB migration JSONs. No Thailand/Krabi/Bangkok content surfaced.

### Search: `mempalace search "cambodia" --room flights`

Correct results returned immediately — city coordinates, visa info, bus routes. But scores are still weak (0.004 to -0.168).

## Root Cause

1. **Default ChromaDB embedding model is `all-MiniLM-L6-v2`** — a 384-dim general-purpose model that's weak at semantic search over mixed content (code, JSON, prose). Structural patterns in JSON/YAML cluster together and dominate the vector space.

2. **No hybrid search** — pure dense vector retrieval with no BM25/keyword fallback. Exact string matches ("Cambodia", "Phnom Penh") get zero boost over structural noise.

3. **Room metadata is stored but not used for relevance boosting** — rooms are only filterable via `--room`, not factored into scoring. When one room has 10x the chunks of others, its vectors dominate nearest-neighbor results.

## Suggested Fixes

1. **Upgrade default embedding model** to something like `BAAI/bge-small-en-v1.5` or `nomic-embed-text` — significantly better semantic quality at similar cost.

2. **Add hybrid search** — combine vector similarity with BM25/keyword matching (ChromaDB supports `where_document` for content filtering). Exact keyword matches should boost relevance.

3. **Room-aware scoring** — normalize or boost results per-room so a massive room doesn't drown out smaller ones. Could be as simple as returning top-N per room and merging.

4. **Allow configurable embedding model** — let users set model in `mempalace.yaml` so they can pick domain-specific embeddings without forking.

## Environment

- mempalace: 3.1.0
- chromadb: (installed via pip, default)
- Python: 3.13
- OS: macOS Darwin 25.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default embedding model (all-MiniLM-L6-v2) produces poor cross-room search results — large rooms dominate #860

Problem

Reproduction

Search: `mempalace search "cambodia"`

Search: `mempalace search "thailand"`

Search: `mempalace search "cambodia" --room flights`

Root Cause

Suggested Fixes

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Default embedding model (all-MiniLM-L6-v2) produces poor cross-room search results — large rooms dominate #860

Description

Problem

Reproduction

Search: mempalace search "cambodia"

Search: mempalace search "thailand"

Search: mempalace search "cambodia" --room flights

Root Cause

Suggested Fixes

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Search: `mempalace search "cambodia"`

Search: `mempalace search "thailand"`

Search: `mempalace search "cambodia" --room flights`