Skip to content

Fix: set cosine distance metadata on all collection creation sites#807

Merged
igorls merged 2 commits intoMemPalace:developfrom
sha2fiddy:fix/218-cosine-distance-metadata
Apr 14, 2026
Merged

Fix: set cosine distance metadata on all collection creation sites#807
igorls merged 2 commits intoMemPalace:developfrom
sha2fiddy:fix/218-cosine-distance-metadata

Conversation

@sha2fiddy
Copy link
Copy Markdown
Contributor

@sha2fiddy sha2fiddy commented Apr 13, 2026

Summary

  • Add metadata={"hnsw:space": "cosine"} to all 4 production create_collection/get_or_create_collection call sites that were missing it (backends/chroma.py, cli.py x2, migrate.py)
  • Fix 3 test fixtures to also create collections with cosine distance (conftest.py, test_mcp_server.py, test_miner.py)
  • Add explicit test verifying ChromaBackend.get_collection(create=True) produces a cosine-distance collection

Closes #218. Supersedes stale branch feature/fix-cosine-distance (170 files divergent, pre-backend-refactor).

Upgrading existing palaces

Existing palaces created before this fix use L2 (Euclidean) distance and will not self-heal from this merge. The hnsw:space setting is baked in at collection-creation time. After upgrading, run mempalace repair to rebuild the collection with cosine distance.

Context

ChromaDB defaults HNSW index to L2 (Euclidean) distance. MemPalace scoring computes 1 - distance, which only produces valid similarity scores under cosine distance (range 0–2). Under L2, distances are unbounded, producing negative scores. Only mcp_server.py and repair.py were passing the cosine metadata.

Note on metadata= vs configuration=

This PR uses the legacy metadata={"hnsw:space": "cosine"} syntax intentionally. ChromaDB 0.6.3+ introduced a newer configuration= API, but it requires a CollectionConfigurationInternal object — passing a plain dict doesn't work in 0.6.x. Since the repo pins chromadb>=0.5.0,<0.7, the legacy metadata= approach is the only one that works across the full supported range. When the chromadb pin is bumped (e.g. PR #664), these calls should be migrated to the configuration= API.

Test plan

  • pytest tests/ -x -v — 685 passed (2 pre-existing version-mismatch failures unrelated to this change)
  • ruff check mempalace/ tests/ — all checks passed (formatted with CI-pinned ruff 0.4.x)
  • Sweep: every create_collection/get_or_create_collection call (excluding benchmarks) now has metadata= or is read-only get_collection
  • New test test_chroma_backend_creates_collection_with_cosine_distance verifies metadata via raw chromadb.PersistentClient
  • mempalace repair confirmed to rebuild collections with metadata={"hnsw:space": "cosine"} (cli.py:247)

ChromaDB defaults HNSW index to L2 (Euclidean) distance, but
MemPalace scoring uses 1-distance which requires cosine (range 0-2).
Add metadata={"hnsw:space": "cosine"} to the 4 production and 3 test
call sites that were missing it.

Closes MemPalace#218
@sha2fiddy sha2fiddy marked this pull request as ready for review April 13, 2026 15:12
@milla-jovovich
Copy link
Copy Markdown
Collaborator

@sha2fiddy thank so much for catching this!

The fix is correct. All 4 missing sites verified: backends/chroma.py:88, cli.py:243 (repair), cli.py:389 (compressed),
migrate.py:210. The metadata= vs configuration= reasoning is solid for the current chromadb pin.

CI needs a small cleanup. The lint failures are just formatting — looks like a newer local ruff reformatted things the
CI-pinned version doesn't expect. Can you re-run with the pinned version?

uvx --from 'ruff>=0.4.0,<0.5' ruff format .
uvx --from 'ruff>=0.4.0,<0.5' ruff check .

Ideally the final diff is just the metadata={"hnsw:space": "cosine"} additions with no extra reformatting — keeps it
clean and easy to review. The version consistency test failures are pre-existing on develop, not yours.

One thing to call out for users: Since hnsw:space is set at collection-creation time, existing palaces on L2 won't
self-heal from this merge. They'll need mempalace repair or a re-mine. Worth adding a bolded "Existing palaces: run
mempalace repair after upgrading" note so people don't miss it.

Once the formatting is cleaned up, this is good to go. We'll close #304 as superseded after this lands.

Thank you from Milla and Lu ✨

@sha2fiddy
Copy link
Copy Markdown
Contributor Author

sha2fiddy commented Apr 13, 2026

@milla-jovovich Fixed the lint — reformatted with pinned ruff 0.4.x. The test failures are pre-existing on develop (version.py out of sync with pyproject.toml), not from this PR.

Added an upgrade note to the PR description since repair already rebuilds with cosine metadata.

@igorls igorls merged commit 5320246 into MemPalace:develop Apr 14, 2026
6 checks passed
@sha2fiddy sha2fiddy deleted the fix/218-cosine-distance-metadata branch April 18, 2026 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collection created without hnsw:space=cosine causes negative similarity scores

3 participants