Skip to content

fix(mcp_server): log exception + retry once on _get_collection failure#1286

Closed
jphein wants to merge 283 commits intoMemPalace:developfrom
jphein:fix/get-collection-log-retry
Closed

fix(mcp_server): log exception + retry once on _get_collection failure#1286
jphein wants to merge 283 commits intoMemPalace:developfrom
jphein:fix/get-collection-log-retry

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 30, 2026

A small improvement to _get_collection: log the exception on failure (instead of returning None silently) and retry once after clearing the cached client and collection. Behaviour on the happy path is unchanged.

Why

The current shape (try: ... except Exception: return None) has two costs in practice on a long-running deployment:

  1. Invisible incidents. When a transient ChromaDB error poisons the cached _client_cache / _collection_cache, every subsequent call returns None but the operator never sees the underlying exception in the log. Debugging has to start from the symptoms — 404s, "no palace" responses — without the actual stack.
  2. No self-healing. Once the cache is stale, the only way back to a working state is a process restart. A single retry that clears the caches first resolves the most common shape (a transient error left bad state behind) without operator intervention.

What changes

for attempt in range(2):
    try:
        client = _get_client()
        # ...same body as today...
        return _collection_cache
    except Exception as e:
        logger.error(
            "_get_collection attempt %d failed (palace=%s): %s",
            attempt + 1, _config.palace_path, e,
        )
        if attempt == 0:
            _client_cache = None
            _collection_cache = None
            _metadata_cache = None
            _metadata_cache_time = 0
return None
  • One logger.error(...) per failed attempt, including the palace path so multi-palace deployments can tell which one is hurting.
  • After the first failure, all four caches reset before the second attempt — _client_cache is now in the global declaration so the second attempt actually reopens the client from scratch.
  • The function still returns None if both attempts fail. Existing callers that handle None (there are several throughout mcp_server.py) keep working unchanged.
  • Happy path: zero log lines, no behaviour change. The first iteration returns immediately on success.

What's intentionally not in this PR

  • The HNSW num_threads=1 retrofit and _pin_hnsw_threads() call inside the function are untouched. That work landed separately and is unrelated.
  • No new dependencies, no new config keys, no test changes. The retry path is timing-dependent (a transient failure followed by a recovery) and hard to fake reliably; happy to add a unit test if there's a preferred shape (mock.patch on _get_client raising once?).

Where it ran

This has been running as a local patch on top of the pipx install on a 151K-drawer palace fronting palace-daemon since 2026-04-25. The retry path fired twice during HNSW segment-quarantine windows after a daemon restart — both times the second attempt succeeded and traffic kept flowing without intervention. No false-positive retries observed (the happy path takes the return _collection_cache exit on the first iteration).

I've been carrying this as a sed-applied patch in palace-daemon/patches/; happy to retire that and just depend on a release line that has this in it once it lands. Equally happy to rework the shape (different log level, structured logging, no retry at all and just the log) — let me know what fits best.

Thanks for taking a look!

jphein and others added 30 commits April 10, 2026 12:05
- tool_add_drawer and tool_delete_drawer now invalidate _metadata_cache
  so status/taxonomy reflect changes immediately (review item 3)
- Document English-only stopword limitation in _extract_themes
- Document venv/ assumption in _mempalace_python editable-install path

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…-8 config

- _mempalace_python: honor MEMPALACE_PYTHON env var as first priority
- searcher: clamp similarity to [0, 1] so L2 distances > 1 don't go negative
- config: use encoding="utf-8" + ensure_ascii=False for config writes

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Add SKIP_PATTERNS and JUNK_FILE_SIZE (500KB) to miner to exclude
  minified JS/CSS, bundles, source maps, lockfiles, and large generated
  files from palace mining
- Add `mempalace purge --wing/--room` CLI command for batch deletion
  with confirmation prompt and batched ChromaDB deletes
- Fix status display: use col.count() + paginated metadata fetch instead
  of hard-capped 10K get(), add thousands separators to output
- Update CLAUDE.md test count to 615

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- repair: delete entire palace directory and create fresh PersistentClient
  instead of same-process delete_collection/create_collection which left
  corrupted HNSW ghost entries. Adds index verification query after rebuild.
- mcp_server: track chroma.sqlite3 inode in _get_client() to detect palace
  rebuilds on disk. Automatically reconnects when the database file changes,
  eliminating stale cache after repair/nuke operations.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…skill docs

- Remove overly broad `\*[^*]+\*` regex from EMOTION_MARKERS that matched
  all markdown bold/italic, routing 66% of technical content to emotional
  room (MemPalace#536)
- Replace Unicode checkmark (U+2713) with ASCII '+' in convo_miner and
  split_mega_files progress output — crashes Windows cp1251/cp1252 (MemPalace#535)
- Fix KG path mismatch: MCP server now always uses palace_path for
  knowledge_graph.sqlite3 instead of diverging default path (MemPalace#538)
- Fix SKILL.md: mempalace --version doesn't exist, use mempalace status (MemPalace#534)
- Fix init instructions: add --yes flag for agent-friendly non-interactive
  init (MemPalace#534)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…compress keys

- Default new collections to cosine distance (hnsw:space=cosine) instead
  of L2 — fixes negative similarity scores (MemPalace#568)
- Filter unexpected MCP tool args before dispatch — prevents TypeError
  crash from extra params like top_k (MemPalace#572)
- Rotate WAL at 10 MB with one backup to prevent unbounded growth (MemPalace#573)
- Fix cmd_compress KeyError: align dict keys with compression_stats()
  return values (MemPalace#569)
- Fix spellcheck _load_known_names: read from "people" key, use dict
  keys as canonical names (MemPalace#570)
- Repair command uses cosine distance for rebuilt collection

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Purge now extracts drawers to keep, nukes the palace directory, and
  rebuilds with a fresh PersistentClient + cosine HNSW index. This
  prevents ghost entries from ChromaDB's in-place collection.delete()
  which caused segfaults on subsequent queries/inserts.
- Add FAT/exFAT inode caveat comment on _get_client() detection
- Clarify purge --room help: without --wing, purges across ALL wings

Co-Authored-By: Claude Opus 4.6 <[email protected]>
_extract_content now handles tool_use and tool_result content blocks
via _format_tool_use and _format_tool_result helpers. Join changed
from space to newline for proper tool block separation.

_try_claude_code_jsonl tracks tool_use_map, merges tool-result-only
user messages into the previous assistant turn, and merges consecutive
assistant messages from multi-turn tool loops.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Fixes ARM64 HNSW segfaults, Python 3.13/3.14 compat.
Auto-migrates existing databases, zero API breakage.
648 tests pass, 50K palace verified.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Both save and precompact hooks now auto-mine the JSONL transcript
directly into the palace before blocking the AI. This captures raw
tool output (Bash results, search findings, build errors) that the
AI would otherwise summarize away during its save cycle.

- Add MP_PYTHON auto-detection (MEMPAL_PYTHON env → repo venv → system)
- Add inline normalize → chunk → upsert pipeline to both hooks
- Skip file_already_mined — transcript grows, upsert is idempotent
- Update block reason messages to explicitly request verbatim tool output
- Precompact hook: parse transcript_path from input, fallback to session_id lookup
- Update README with two-layer capture docs and MEMPAL_PYTHON config

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…b upgrade

- CLAUDE.md: test count 648, fork changes 8-13, PR MemPalace#562, hook descriptions
- README.md: hooks section reflects two-layer capture, chromadb >=1.5.4,
  normalize.py captures tool blocks, MEMPAL_PYTHON env var documented
- AGENTS.md: add normalize.py and hooks to key files section
- HOOKS_TUTORIAL.md: new sections for two-layer capture and configuration
- mempalace/README.md: normalize.py description updated, version.py added
- normalize.py: docstring updated for tool_use/tool_result capture

Co-Authored-By: Claude Opus 4.6 <[email protected]>
… PRs

Resolves merge conflict in mcp_server.py by keeping our improvements
(cached metadata, inode detection, WAL rotation, max_distance) and
integrating upstream's query_sanitizer (MemPalace#385) and context parameter.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The old block reason said "save to your memory system" which Claude
interpreted as its internal ~/.claude/memory/ system instead of the
mempalace. Now explicitly names mempalace_add_drawer and
mempalace_diary_write MCP tools and includes a negative instruction
to NOT use Claude's internal memory.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The negative instruction referenced ~/.claude/memory/ but Claude's
auto-memory actually lives at ~/.claude/projects/.../memory/.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Two bugs found when mining ~/.claude/projects/ transcripts:
- message field can be a string (not dict) in some JSONL entries
- Read tool offset/limit can be strings, causing TypeError on addition

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…compress keys

- params null guard: request.get("params") or {} handles explicit null
- L2→cosine in search tool descriptions (collection uses hnsw:space=cosine)
- set_hook_setting mkdir before config write
- Guard empty tool_use_map key in normalize.py
- Correct searcher.py docstring: cosine distance range 0–2, useful range 0.3–1.0

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…emPalace#637)

_SAFE_NAME_RE was ASCII-only ([a-zA-Z0-9]), rejecting valid Unicode
names like Pēteris, 日本語, Москва in KG writes and MCP tools.

Changed to \w which matches Unicode letters/digits in Python 3.
Path traversal, null bytes, and control chars still blocked by
the explicit checks before the regex.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ssertion

- miner.py: move chromadb/palace imports above logger to fix E402
- test_normalize.py: rename ambiguous var 'l' to 'line' (E741)
- test_hooks_cli.py: fix Windows assertion — check basename contains
  'python' instead of asserting venv-specific path components

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Local ruff 0.15 formats differently from CI's pinned 0.4.x range.
Reformatted all files to match CI expectations.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Users installing hooks don't know ~/.claude/projects/ contains
past JSONL transcripts that can be mined. Added prominent
"Backfill Past Conversations" section to hooks README and tutorial.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hybrid search, graph cache, L1 optimization, decay/forgetting,
feedback loops, KG entity resolution, input sanitization.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- New TODO #0: convo_miner files everything to 'sessions' wing,
  losing project identity from Claude Code transcript paths
- Added survey context: why staying on MemPalace, patterns to steal
  from Karta/Gigabrain/Codex/ByteRover
- Future section: contradiction detection, foresight signals, query
  classification (simpler versions of Karta's mechanisms)
- Updated "not pursuing" with dream engine, dual ANN, event sourcing

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Hybrid search (TODO #1): when vector results are poor (best distance
> 1.0), automatically falls back to keyword text-match via ChromaDB
where_document.$contains. Extracts most distinctive non-stopword token
from query, or accepts explicit keyword param. Results merged, deduped,
sorted by distance. MCP server exposes new keyword parameter.

Wing fix (TODO #0): _ingest_transcript() now derives project wing from
Claude Code transcript path (-Projects-<name>/) instead of hardcoding
"sessions". Per-project search now finds auto-mined content.

692 tests pass (22 new).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the resilience and observability of the MCP server’s ChromaDB collection accessor (_get_collection) by logging failures and attempting a one-time self-healing retry after clearing cached state.

Changes:

  • Log _get_collection exceptions instead of silently returning None.
  • Retry once after clearing cached client/collection/metadata state to recover from transient failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/mcp_server.py Outdated
Comment on lines +314 to +317
except Exception as e:
logger.error(
"_get_collection attempt %d failed (palace=%s): %s",
attempt + 1, _config.palace_path, e,
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error log here only includes str(e) and omits the traceback, which makes it much harder to debug the underlying ChromaDB failure. Consider using logger.exception(...) (or logger.error(..., exc_info=True)) so the full stack trace is emitted with the message.

Suggested change
except Exception as e:
logger.error(
"_get_collection attempt %d failed (palace=%s): %s",
attempt + 1, _config.palace_path, e,
except Exception:
logger.exception(
"_get_collection attempt %d failed (palace=%s)",
attempt + 1, _config.palace_path,

Copilot uses AI. Check for mistakes.
Comment thread mempalace/mcp_server.py
Comment on lines +284 to +324
for attempt in range(2):
try:
client = _get_client()
if create:
# hnsw:num_threads=1 disables ChromaDB's multi-threaded ParallelFor
# HNSW insert path, which has a race in repairConnectionsForUpdate /
# addPoint (see issues #974, #965). Set via metadata on fresh
# collections and re-applied via _pin_hnsw_threads() for legacy
# palaces whose collections were created before this fix (the
# runtime config does not persist cross-process in chromadb 1.5.x,
# so the retrofit runs every time _get_collection opens a cache).
raw = client.get_or_create_collection(
_config.collection_name,
metadata={
"hnsw:space": "cosine",
"hnsw:num_threads": 1,
**_HNSW_BLOAT_GUARD,
},
)
_pin_hnsw_threads(raw)
_collection_cache = ChromaCollection(raw)
_metadata_cache = None
_metadata_cache_time = 0
elif _collection_cache is None:
raw = client.get_collection(_config.collection_name)
_pin_hnsw_threads(raw)
_collection_cache = ChromaCollection(raw)
_metadata_cache = None
_metadata_cache_time = 0
return _collection_cache
except Exception as e:
logger.error(
"_get_collection attempt %d failed (palace=%s): %s",
attempt + 1, _config.palace_path, e,
)
_pin_hnsw_threads(raw)
_collection_cache = ChromaCollection(raw)
_metadata_cache = None
_metadata_cache_time = 0
elif _collection_cache is None:
raw = client.get_collection(_config.collection_name)
_pin_hnsw_threads(raw)
_collection_cache = ChromaCollection(raw)
_metadata_cache = None
_metadata_cache_time = 0
return _collection_cache
except Exception:
return None
if attempt == 0:
_client_cache = None
_collection_cache = None
_metadata_cache = None
_metadata_cache_time = 0
return None
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces new retry/self-healing behavior (clearing caches and retrying once) but there’s no unit test to lock in the expected semantics (e.g., first call raises, caches are cleared, second call succeeds; and the both-attempts-fail case still returns None). Since tests/test_mcp_server.py already covers related cache invalidation behavior, adding a small monkeypatch-based test would help prevent regressions.

Copilot uses AI. Check for mistakes.
… main

The "fix: _get_client() get-then-create guard" row in the fork-ahead
table was misleading — the guard landed in fork as d3a2d22 on
2026-04-18 but was silently dropped during the develop merge-conflict
resolution in 7e18a70 on 2026-04-25. Current fork main does not carry
it; we lean on metadata consistency (hnsw:space + hnsw:num_threads:1
+ _HNSW_BLOAT_GUARD applied uniformly across opens) to sidestep the
ChromaDB 1.5.x metadata-mismatch SIGSEGV without the defensive guard.

Issue MemPalace#1089 documented the pattern on 2026-04-21. PR MemPalace#1262 (filed
2026-04-28 by @Legion345) implements the canonical version — try
get_collection / except chromadb.errors.NotFoundError → create — which
is an improvement over our older InvalidCollectionException reference
(NotFoundError is the renamed-in-1.5.x exception class).

Once MemPalace#1262 merges, this row clears via develop sync. Until then the
fork is exposed if upstream ever changes default hnsw:* metadata
between releases — the same drift shape that caused the original
crash.
jphein added a commit to jphein/palace-daemon that referenced this pull request Apr 30, 2026
…nding queue; trim patch

Filed four upstream PRs on 2026-04-30:
- rboarescu#15  feat: GET /viz status dashboard (stacks on rboarescu#13)
- rboarescu#16  feat: GET /list — query-free metadata browse
- rboarescu#17  feat: DELETE /memory + PATCH /memory
- rboarescu#18  feat(lifespan): auto-migrate Stop-hook checkpoints on startup

Also rebased PR rboarescu#13 onto upstream/main to clear a CHANGELOG conflict
left by upstream's b4aee82 (patch sync) — state went CONFLICTING ->
MERGEABLE / CLEAN.

README:
- Open upstream PRs table: four new rows (rboarescu#15-rboarescu#18) plus a 2026-04-30
  note covering today's rebase + new PRs in one breath.
- Pending PRs queue: now empty. Replaced the four stale rows
  (event-log-frame and graph-endpoint were already in flight via
  rboarescu#11 and rboarescu#13; mempal-fast.py was already upstream via the merged
  PR #4 omnibus; /viz is now PR rboarescu#15) with a brief empty-state note.

CLAUDE.md:
- Patch description trimmed to reflect that the hnsw:num_threads
  enforcement landed upstream via _pin_hnsw_threads(); only the
  log + retry-once slice remains.

patches/mcp_server_get_collection.patch:
- Regenerated against current mempalace develop. The patch is now
  just the "log exception + retry once on cache failure" change.
  Filed upstream as MemPalace/mempalace#1286; once that merges this
  patch retires.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/palace-daemon that referenced this pull request Apr 30, 2026
README:
- Intro paragraph: appended GET /list (PR rboarescu#16), DELETE/PATCH /memory
  (PR rboarescu#17), and lifespan auto-migrate (PR rboarescu#18) to "what this fork
  adds" since they're already on fork main even though the PRs are
  awaiting review upstream.
- Links bar: added docs/typescript-port-plan.md.
- Added a "Cross-repo coordination" subsection under "Recently landed
  in upstream" — calls out the local patches/ directory and the
  in-flight MemPalace/mempalace#1286 that retires it. Also references
  #1142 (RELEASING.md) for completeness.
- Requirements: dropped the stale "kind= searcher filter" justification
  for recommending the fork (kind= was retired in 1.7.1); replaced
  with daemon-strict hook mode + warnings/sqlite-fallback search path
  as the actual current reasons. Added the patches/ re-apply step.
- API table: added /list, DELETE /memory/{id}, PATCH /memory/{id}
  rows so the table reflects what the fork main actually exposes.
- Sources: cross-repo PR note for #1286.

CHANGELOG:
- Added [Unreleased] section: patch trim (Maintenance), TS port plan
  doc, hook-routing-fix.md status header, and a Docs entry covering
  today's README updates + PR rboarescu#13 rebase context.

docs/hook-routing-fix.md:
- Added a Status: SHIPPED header pointing at 62425e3 (2026-04-24
  clients/hook.py introduction) and clarifying mempal-fast.py is the
  simpler successor — same shape as docs/graph-endpoint.md's status
  header.

No daemon behaviour changes.
@jphein jphein force-pushed the fix/get-collection-log-retry branch from 4b2dfd6 to 2038269 Compare April 30, 2026 20:42
jphein added a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
Two issues:

1. logger.error("...", str(e)) omitted the traceback, making the
   underlying ChromaDB failure hard to debug. Switch to logger.exception
   so the full stack trace lands in the log alongside the message.
   Same context lines, just with exc_info coming from sys.exc_info()
   automatically.

2. No unit test pinned the new retry/self-healing semantics. Added
   TestGetCollectionRetry (tests/test_mcp_server.py) with three cases:

   - test_retry_succeeds_on_second_attempt: first _get_client raises,
     caches are cleared between attempts (verified by inspecting
     planted sentinels), second attempt returns the wrapped collection.
   - test_returns_none_when_both_attempts_fail: two failures in a row
     return None and stop after exactly two attempts.
   - test_happy_path_does_not_retry: a successful first attempt
     returns immediately and never invokes _get_client a second time.

   All three use monkeypatch on _get_client / ChromaCollection /
   _pin_hnsw_threads so they don't need a real ChromaDB; they run in
   ~0.4s end-to-end and don't touch disk.
@jphein jphein requested a review from igorls as a code owner April 30, 2026 20:48
…call sites

The earlier 2026-04-30 wording said current fork main calls
get_or_create_collection with stable/consistent metadata, sidestepping
the ChromaDB 1.5.x metadata-mismatch SIGSEGV without carrying the
defensive guard. That's not actually true.

Audited the four get_or_create_collection call sites on current fork
main:

- backends/chroma.py:1040     full hnsw:* set (space + num_threads + bloat-guard)
- mcp_server.py:295           full hnsw:* set
- migrate.py:235              no metadata
- cli.py:905                  no metadata

The trigger condition for the 1.5.x metadata-mismatch SIGSEGV (call-
site metadata diverging from stored collection metadata) is
structurally present. The fork's apparent stability is presumably
because the no-metadata sites operate on different collections
(mempalace_compressed, the migrate temp collection) than the full-
metadata sites (mempalace_drawers) — so the divergence never reaches
the same row in the same metadata table.

That's a load-bearing assumption worth flagging: it's circumstantial,
not structural. Once MemPalace#1262 (Legion345) lands, audit all four sites
together so the guard wraps every site rather than just chroma.py.
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 30, 2026

Pushed 757eb7a addressing both Copilot notes:

  • Switched logger.error("...", str(e)) to logger.exception("...") so the traceback lands in the daemon log alongside the message instead of being dropped. Same context lines, just with exc_info from sys.exc_info() automatically.

  • Added TestGetCollectionRetry (tests/test_mcp_server.py) with three monkeypatch tests that pin the new semantics:

    • test_retry_succeeds_on_second_attempt — first _get_client raises, caches are cleared between attempts (verified by inspecting planted sentinels in _client_cache / _collection_cache), second attempt returns the wrapped collection.
    • test_returns_none_when_both_attempts_fail — two failures in a row return None and stop after exactly two attempts.
    • test_happy_path_does_not_retry — a successful first attempt returns immediately and never invokes _get_client a second time.

    All three use monkeypatch on _get_client / ChromaCollection / _pin_hnsw_threads so they don't need a real ChromaDB; they run in ~0.4s end-to-end and don't touch disk.

Also fixed a ruff format lint failure from the previous push (multi-arg logger.exception call needed each arg on its own line under the project's ruff config). Full CI matrix now green — test-linux 3.9/3.11/3.13, test-macos, test-windows, lint. Thanks for the careful review.

jphein and others added 3 commits April 30, 2026 14:19
The 2026-04-30 wordings in 393ac89 ("metadata is consistent across opens
sidesteps the SIGSEGV") and 260e2a2 ("non-uniform metadata across four
call sites, trigger condition structurally present") both missed the
mark. Third try here, traced from the code:

- chromadb's get_or_create_collection is called directly from two
  places: backends/chroma.py:1061 and mcp_server.py:295. Both pass
  identical metadata (hnsw:space, hnsw:num_threads: 1, bloat-guard).

- The fork's ChromaBackend.get_or_create_collection(path, name) at
  chroma.py:1101 is a 2-arg legacy shim that delegates via
  self.get_collection(palace_path, name, create=True) to the same
  metadata-passing line.

- migrate.py:237 and cli.py:1002 call that 2-arg shim, so they reach
  chromadb with full metadata too.

So metadata IS consistent across all five reachable call sites, not
divergent. The earlier "non-uniform" claim mistook the fork's API
surface for chromadb's call signature.

The residual SIGSEGV exposure is legacy palaces created under older
chromadb whose stored metadata differs from the current expanded set —
a per-palace-history risk, not a structural call-graph one.

Live MemPalace#1262 comment and MemPalace#1089 issue body have been re-PATCHed to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… update sync status

MemPalace#1262 (Legion345) is path 1 of MemPalace#1089's "interim guard PR" — adds
get-then-create wrapping in chromadb backend. Shepherding via review
comment posted 2026-04-30. Once it merges, fork-ahead Row 15 clears
via develop sync.

MemPalace#1286 (our PR, filed 2026-04-30) is the _get_collection retry-once +
log-on-failure improvement. Adjacent to Row 15 — when both MemPalace#1262 and
MemPalace#1286 land, the _get_collection path is both crash-resilient and
self-healing.

Status line refreshed: develop has moved to fdfaf01 (Gemini CLI
normalize MemPalace#1234, privacy consent MemPalace#1233, both 2026-04-27); next sync
will clear those plus row 15 once MemPalace#1262 merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Fork main has been failing CI lint since 2026-04-30 04:04 UTC.
Three real issues + 6 format-needs files. All pre-existing on main
before today's doc work but inherited by every commit since. Cleared
in one pass:

- F401 mempalace/miner.py:19 — unused `typing.Optional` import,
  auto-fixed by `ruff --fix`.
- F811 tests/test_searcher.py — `class TestBM25NoneSafety` was
  literally duplicated (lines 205 and 243, identical name + docstring
  + body). Likely copy-paste during 7ba28dc (kind= filter retirement).
  Deleted the second copy; verified the surviving class covers the
  same 3 None-safety tests.
- C901 mempalace/searcher.py:721 — `search_memories` complexity
  27 vs 25 ceiling. Added `# noqa: C901` with reason: this fork-only
  function is a fallback orchestration with several legitimate
  branches (BM25 top-up, warnings, closet-boost) and 27 isn't
  pathological; refactor would just split for the linter, not for
  readability.
- ruff format applied to 6 files (mcp_server, miner, searcher,
  render-docs, test_mcp_server, test_migrate) — mechanical reformat
  from prior commits that didn't run format.

`ruff check .` and `ruff format --check .` both clean. Searcher test
suite green (29/29).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@igorls igorls added area/mcp MCP server and tools bug Something isn't working storage labels May 2, 2026
@igorls igorls added this to the v3.3.5 milestone May 2, 2026
@igorls
Copy link
Copy Markdown
Member

igorls commented May 2, 2026

Thanks @jphein — retry-once with cache-clear on _get_collection is correct and complementary to the recent #1303/#1299/#1289 work (those fixed correctness on first open; this adds resilience after a transient failure). Local tests for TestGetCollectionRetry pass.

One blocker: rebase against develop — the diff was authored against the pre-#1303 body of _get_collection. Develop's current implementation (mcp_server.py ~L276-338) does EF resolution via ChromaBackend._resolve_embedding_function() and a get_collection/create_collection split guarded by _ChromaNotFoundError. The retry loop needs to wrap that shape, not get_or_create_collection. CONFLICTING right now.

Once rebased and CI is green we can merge for v3.3.5.

jphein and others added 4 commits May 3, 2026 09:33
# Conflicts:
#	mempalace/mcp_server.py
#	mempalace/searcher.py
#	tests/test_mcp_server.py
…Palace#1289/MemPalace#1303

After the 2026-05-03 develop sync to commit 1888b67:
- Row 15 (_get_client get-then-create guard) clears: MemPalace#1262 fixed
  ChromaBackend; MemPalace#1289 fixed mcp_server's parallel call site; MemPalace#1303
  plumbed embedding_function= through the reopen path.
- Trailing tracking paragraph updated (sync target, count of merged/
  open/closed PRs, list of features brought in).
- Upstream-PRs table: MemPalace#1262 → merged, MemPalace#1286 flagged CONFLICTING with
  rebase queued, MemPalace#1289 + MemPalace#1303 added as merged.

Note: fork-only _get_session_recovery_collection still uses the older
get_or_create_collection pattern; theoretical SIGSEGV exposure on
legacy recovery collections only. Tracked as follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…attern to recovery collection

`_get_session_recovery_collection` (fork-only, introduced for the
checkpoint collection split in row 23) used the older
`get_or_create_collection` pattern that MemPalace#1262/MemPalace#1289/MemPalace#1303 just
fixed in `_get_collection`. Same theoretical SIGSEGV class on
chromadb 1.5.x when stored metadata diverges from the call site,
plus the `embedding_function=` omission MemPalace#1303 caught.

Both branches now resolve EF via `ChromaBackend._resolve_embedding_function()`
and the create-path uses try `get_collection` / except
`NotFoundError` → `create_collection`. Recovery collection's
metadata is intentionally lighter than the main collection's
(no `_HNSW_BLOAT_GUARD`); preserved that asymmetry — recovery is
small enough that segment bloat isn't a concern.

Existing `test_session_recovery.py` covers happy paths; 83/83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
A small improvement to `_get_collection`: log the exception on failure
(instead of returning `None` silently) and retry once after clearing
the cached client and collection. Behaviour on the happy path is
unchanged.

## Why

The current shape (`try: ... except Exception: return None`) has two
costs in practice on a long-running deployment:

1. Invisible incidents. When a transient ChromaDB error poisons the
   cached `_client_cache` / `_collection_cache`, every subsequent
   call returns `None` but the operator never sees the underlying
   exception in the log. Debugging has to start from the symptoms —
   404s, "no palace" responses — without the actual stack.
2. No self-healing. Once the cache is stale, the only way back to a
   working state is a process restart. A single retry that clears
   the caches first resolves the most common shape (a transient
   error left bad state behind) without operator intervention.

## What changes

- One `logger.exception(...)` per failed attempt, including the
  palace path so multi-palace deployments can disambiguate.
- One retry after the first failure, with `_client_cache`,
  `_collection_cache`, `_metadata_cache`, `_metadata_cache_time`
  reset before the second attempt so the retry rebuilds from
  scratch rather than reusing a poisoned handle.
- Happy path (first attempt succeeds) is unchanged: same return,
  same caching, no extra log line.

## Why this is still useful after MemPalace#1289 / MemPalace#1303

MemPalace#1289 split `get_or_create_collection` into `get_collection` +
`create_collection` (the metadata-mismatch SIGSEGV path); MemPalace#1303
plumbed `embedding_function=` through both reopen branches. Those
fix specific *crash* shapes. This PR addresses the *aftermath* —
when an error of any class poisons the client/collection cache and
every subsequent call returns `None` invisibly. Retry-once
self-heals; logging makes the underlying exception visible. The
retry loop sits *above* the get-then-create logic, so retries
benefit from MemPalace#1289's correct call shape and MemPalace#1303's EF binding.

## Rebased on top of MemPalace#1289 + MemPalace#1303

This PR was filed before MemPalace#1289/MemPalace#1303 merged. The original two
commits (2038269 + 757eb7a) wrapped the older `get_or_create`
body. Rebased onto the new `try get_collection / except
NotFoundError → create_collection` body — the retry/log envelope
is unchanged, only the inner body picks up the merged fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jphein jphein force-pushed the fix/get-collection-log-retry branch from 757eb7a to 51780ce Compare May 3, 2026 17:56
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented May 3, 2026

Rebased onto develop after #1289 and #1303 merged. The retry/log envelope is unchanged; the inner body now sits on top of the get-then-create + embedding_function= plumbing those PRs landed, so retries benefit from the corrected call shape rather than re-attempting a known-bad pattern. CI green locally, ready for another look when convenient.

@igorls
Copy link
Copy Markdown
Member

igorls commented May 6, 2026

Hi @jphein — thanks for carrying this; the underlying retry+log idea is sound and composes cleanly with #1322 (clearing _client_cache on attempt 1 forces a fresh make_client() which re-runs quarantine_stale_hnsw).

I'd like to land that piece for v3.3.5 (due 2026-05-08), but this PR can't merge as-is — it carries fork-only files (FORK_CHANGELOG.md, docs/fork-changes.yaml, docs/superpowers/**, scripts/{deploy,preflight,rebase-on-develop}.sh), bumps version.py / pyproject.toml to 3.3.4 without the matching marketplace.json / plugin.json / codex-plugin/plugin.json (which is what check-versions is catching), and the convo_miner.py rewrite regresses test_convo_mining + test_mine_convos_rebuilds_stale_drawers_after_schema_bump on all three OSes (Linux/macOS/Windows go red on those two tests where develop is green).

I opened #1377 with just the surgical _get_collection retry, credited as Co-authored-by: Jeffrey Hein <[email protected]>. It includes two unit tests for the retry path and ships the cache reset between attempts so a stale handle self-heals on the second call.

Closing this one in favour of #1377 — sorry for the friction. The other genuinely-useful pieces buried in this PR are very welcome as separate follow-ups against current develop:

If you'd rather I cherry-pick any of those into surgical PRs the way I did with the retry, ping me and I will. Either way thanks for the diagnosis on the underlying bug — easy to miss.

@igorls igorls closed this May 6, 2026
igorls added a commit that referenced this pull request May 6, 2026
…xception

fix(mcp): retry _get_collection once on transient failure (#1286)
YulelogPagoda pushed a commit to YulelogPagoda/mempalace that referenced this pull request May 6, 2026
…1286)

A transient chromadb exception inside `_get_collection` was swallowed by
the bare `except Exception: return None`, leaving every subsequent tool
call hitting the same poisoned cache silently. The fix wraps the body
in a `for attempt in range(2)` loop: on attempt 0 failure, log via
`logger.exception(...)` and clear `_client_cache` / `_collection_cache`
/ `_metadata_cache` so the next iteration forces `_get_client()` to
rebuild from scratch — that path now re-runs `quarantine_stale_hnsw`
(per MemPalace#1322), so the second attempt heals the common stale-handle case
automatically. If both attempts fail, return `None` (matches the prior
contract for permanent failures).

Two new tests in `tests/test_mcp_server.py::TestCacheInvalidation`:
- `test_get_collection_retries_once_on_exception` — first attempt raises
  via a monkeypatched `_get_client`, second attempt succeeds; assert the
  caller gets the collection back, not None.
- `test_get_collection_returns_none_after_two_failures` — both attempts
  fail, assert we exhaust the loop and return None (no infinite retry).

Surgical extraction from PR MemPalace#1286, which carried the same fix idea
(plus a fork-sync bundle that couldn't be merged); credit to the
original author below.

Co-authored-by: Jeffrey Hein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP server and tools bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants