Skip to content

fix(searcher): tolerate None documents in BM25 reranker#1198

Merged
igorls merged 2 commits intoMemPalace:developfrom
jphein:fix/tokenize-none-guard
Apr 26, 2026
Merged

fix(searcher): tolerate None documents in BM25 reranker#1198
igorls merged 2 commits intoMemPalace:developfrom
jphein:fix/tokenize-none-guard

Conversation

@jphein
Copy link
Copy Markdown
Collaborator

@jphein jphein commented Apr 25, 2026

Problem

searcher._tokenize calls text.lower() unconditionally. When ChromaDB returns a drawer with documents containing None, the hybrid-rerank path raises AttributeError mid-search, dropping the entire query response.

Observed in production on 2026-04-24 21:07:05:

File "mempalace/searcher.py", line 81, in _bm25_scores
    tokenized = [_tokenize(d) for d in documents]
File "mempalace/searcher.py", line 52, in _tokenize
    return _TOKEN_RE.findall(text.lower())
AttributeError: 'NoneType' object has no attribute 'lower'

This closes the gap left by #999's None-metadata audit, which fixed metadata read loops but did not extend to BM25 reranker helpers.

Fix

_tokenize short-circuits to [] when text is falsy (None or ""). A None doc gets BM25 score 0.0; the rest of the corpus reranks normally.

Tests

Three regression tests in TestBM25NoneSafety:

  1. test_tokenize_handles_none — direct unit guard
  2. test_tokenize_handles_empty_string — covers the empty-string case at the same boundary
  3. test_bm25_scores_does_not_crash_on_none_documents — load-bearing: mixed corpus with a None mid-list, asserts score 0.0 for that doc and finite scores for the rest. This is the exact production failure shape.

All existing searcher and hybrid_search tests still pass (pytest tests/test_searcher.py tests/test_hybrid_search.py → 30/30 against develop after cherry-pick).

🤖 Generated with Claude Code

`_tokenize` calls `text.lower()` unconditionally; when ChromaDB returns a
drawer with `documents` containing `None`, the hybrid-rerank path raises
`AttributeError: 'NoneType' object has no attribute 'lower'`.

Observed in production daemon log (2026-04-24 21:07:05) during a search
that triggered `_hybrid_rank → _bm25_scores → _tokenize`:

    File "mempalace/searcher.py", line 81, in _bm25_scores
        tokenized = [_tokenize(d) for d in documents]
    File "mempalace/searcher.py", line 52, in _tokenize
        return _TOKEN_RE.findall(text.lower())
    AttributeError: 'NoneType' object has no attribute 'lower'

Closes the gap left by the upstream None-metadata audit (MemPalace#999), which
covered metadata loops but not BM25 helpers. Returns `[]` for falsy input
so a None doc gets score 0.0 while the rest of the corpus reranks normally.

Three regression tests in TestBM25NoneSafety lock the behavior and reference
the production trace.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copilot AI review requested due to automatic review settings April 25, 2026 13:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the hybrid BM25 re-ranker against None documents returned by ChromaDB so searches don’t crash mid-rerank in production.

Changes:

  • Add a None/empty guard in searcher._tokenize so missing documents produce no tokens (and thus BM25 score 0.0).
  • Add regression tests covering _tokenize(None), _tokenize(""), and _bm25_scores with a mixed corpus containing a None.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
mempalace/searcher.py Makes _tokenize tolerate falsy input to prevent AttributeError during BM25 scoring.
tests/test_searcher.py Adds regression tests to ensure BM25 helpers don’t crash when Chroma returns None documents.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/searcher.py
Comment on lines 48 to +52
def _tokenize(text: str) -> list:
"""Lowercase + strip to alphanumeric tokens of length ≥ 2."""
"""Lowercase + strip to alphanumeric tokens of length ≥ 2.

Tolerates ``None`` documents — Chroma can return ``None`` in the
``documents`` field for drawers without text content, which would
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_tokenize now intentionally accepts None (and tests cover it), but the type annotation still says text: str. Please update the signature/return typing to reflect the supported inputs (e.g., str | None) and consider narrowing the guard to the intended cases (None/empty string) so unexpected non-string falsy values don’t get silently treated as empty text.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_searcher.py
Comment on lines +150 to +154
from mempalace.searcher import _bm25_scores
scores = _bm25_scores("postgres migration", ["postgres migration done", None, "kafka rebalance"])
assert len(scores) == 3
assert scores[1] == 0.0
assert scores[0] > 0.0
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression test claims “finite scores for the rest”, but it only asserts on scores[0] and the None doc. Consider also asserting that the other score(s) are finite (and/or the expected value, e.g. 0.0 for the non-matching doc) to make the test match its stated guarantee and catch potential NaN/inf regressions.

Copilot uses AI. Check for mistakes.
jphein added a commit to jphein/mempalace that referenced this pull request Apr 25, 2026
After today's work:

- Merged upstream/develop (29 commits), absorbing MemPalace#976 (HNSW + mine_global_lock + PreCompact attempt-cap), MemPalace#1179 (CLI BM25), MemPalace#1180/MemPalace#1182/MemPalace#1183/MemPalace#1185 features.
- Filed PR MemPalace#1198 (`_tokenize` None guard) upstream.
- Closed PR MemPalace#1171 (adapter-level flock) — superseded by MemPalace#976 + daemon-strict.
- palace-daemon migration commits (c09582c + 0e97b19) make the daemon the
  single canonical writer; in-tree adapter flock no longer carries its weight.

README:
- Lead paragraph: 2026-04-21 → 2026-04-25 sync date, 165,632 → 151,420
  drawer count, daemon-on-disks topology, listed absorbed upstream PRs.
- Fork change queue: dropped MemPalace#1171 row, added MemPalace#1198 row.
- "Multi-client coordination" section: rewrote — palace-daemon is now the
  fork's primary answer (was "deferred"), MemPalace#1171 retired, narrative shifted
  to defense-in-depth around MemPalace#976 + daemon.
- Two-layer memory model: storage path is now palace-daemon HTTP, not
  ~/.mempalace/palace.
- Ecosystem palace-daemon row: marks integration as complete (commits
  cited).
- Open upstream PRs table: refreshed to MemPalace org URLs, added MemPalace#1198
  row, removed MemPalace#1171, added 2026-04-25 develop sync to merged list, added
  MemPalace#1171 to closed list with rationale.
- Test count 1096 → 1334.

CLAUDE.md:
- Row 20 (MemPalace#1198): "Upstream PR pending" → "Filed as MemPalace#1198 on 2026-04-24".
- Row MemPalace#1171 in PR table: open → closed (with MemPalace#976 supersession).
- Added MemPalace#1198 row to PR table.
- Top-of-section count: "14 merged, 7 open, 9 closed" → "14 merged, 7 open
  (including MemPalace#1198), 10 closed (added MemPalace#1171)" + sync date 2026-04-25.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
CI lint job runs `ruff format --check`; the new tests in TestBM25NoneSafety
needed the standard "blank line after import-inside-function" + line-length
wrap. No logic change — formatter pass only.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 25, 2026
Two new scripts plus legitimate fixes they surfaced.

scripts/preflight.sh — runs the same checks CI does (ruff check, ruff
format --check, pytest). Prevents the class of "I ran ruff format locally
but CI runs ruff check too" bugs we hit on PR MemPalace#1024 and PR MemPalace#1198 today.

scripts/rebase-on-develop.sh — rebases a fork PR branch onto
upstream/develop, then auto-restores fork-only collateral
(.claude-plugin/hooks/*.sh, .claude-plugin/.mcp.json) to upstream's
state. The collateral cleanup was the recurring step I had to remember
manually for each of MemPalace#1005/MemPalace#1024/MemPalace#1177; this codifies it. Supports
--finish for the "after I resolved conflicts manually" continuation
path. Never auto-pushes — prints the push command for confirmation.

Lint debt the new preflight caught on main:

- tests/test_palace_graph.py: F811 — `invalidate_graph_cache` was
  imported at line 9, then re-imported inside the chromadb-mock
  patch.dict block at line 44. Removed the duplicate; the line-9
  import is sufficient since the autouse fixture at line 12 already
  uses it.
- mempalace/backends/chroma.py: ruff format wanted a one-line
  `collection.modify(configuration=...)` call instead of the wrapped
  multi-line form.
- mempalace/hooks_cli.py: ruff format wanted blank line after import
  in the daemon-URL try block, and dict-literal style for the
  json.dumps call.

CI on jphein/mempalace runs both ruff check and ruff format --check;
without these fixes the next push to main would have shown red.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Collaborator

@bensig bensig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve, no asks. Defensive single-line fix with a real production trace driving it.

What landed

`_tokenize` now returns `[]` for falsy input instead of raising `AttributeError` on `None.lower()`. The trace in the test docstring is genuine (daemon log 2026-04-24 21:07:05) — Chroma was returning `None` for drawers without text content during a hybrid-rerank pass, and the BM25 path crashed mid-search.

Test coverage

The new `TestBM25NoneSafety` class covers both the `None` and empty-string paths and includes the failure trace in the docstring so a future reader knows exactly what symptom this prevents. `pytest tests/test_searcher.py` — 27 passed on this branch.

Minor

Could also have hardened `_bm25_scores` upstream of `_tokenize` to filter `None` documents before iterating. The chosen approach (defend at the leaf function) is more defensive and catches any other call site that might pass `None` — better choice for a small surface like this.

Ship it.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
…#1173 status

CLAUDE.md row 28 documents the canonical-source pipeline (5a01aec)
that landed earlier today. PR table updated to reflect bensig's
2026-04-26 approvals on four of our open PRs:

  - MemPalace#1173 (HNSW quarantine wire): approved on the original 1-commit
    shape, then force-pushed two safety commits (cold-start gate +
    integrity sniff-test) after a production cold-start incident
    destroyed three healthy 253MB segments. Now mergeable=CONFLICTING
    against develop (which moved with MemPalace#1210, MemPalace#1205, MemPalace#976) — needs
    rebase + re-review.
  - MemPalace#1177, MemPalace#1198, MemPalace#1201: approved, mergeable, awaiting merge.

YAML manifest gets a new entry for the doc pipeline itself; FORK_CHANGELOG
regenerated to match. promises.md (in claude-config, not this repo)
got a long log entry covering today's full output.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@jphein
Copy link
Copy Markdown
Collaborator Author

jphein commented Apr 26, 2026

Thanks @bensig — ready for merge whenever convenient. Tests stable on the fork's daily-driver palace + smoke pass on the develop branch.

jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
…rry-pick of MemPalace#1094)

Fork main was carrying the per-site `meta = meta or {}` guards from MemPalace#999
in eight read paths but didn't have the boundary coercion that closes
the issue once for all callers.

Cherry-picked MemPalace#1094 (open upstream, jp-authored) so the typed
QueryResult/GetResult contract — both declare `metadatas: list[dict]`,
never `list[Optional[dict]]` — is honored at the chromadb adapter
boundary. Three same-family fork-ahead PRs we filed since (MemPalace#1198
_tokenize None-doc, MemPalace#1201 palace_graph None metadata, MemPalace#1199 review)
all pointed at gaps that would have been impossible if this pattern
had been in place. Future None-metadata bugs of the same shape are
now catchable at one site instead of N.

Per-site guards are kept as belt-and-suspenders for paths that might
bypass the typed wrappers (e.g. legacy `import chromadb` direct calls).
A future cleanup PR can remove them once we're confident no caller
reaches chromadb directly.

6 new tests in test_backends.py (MemPalace#1094 originals); composes cleanly with
fork main's _segment_appears_healthy + _quarantined_paths from earlier
today. Full suite 1383/1383.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Member

@igorls igorls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@igorls igorls merged commit 414aa3e into MemPalace:develop Apr 26, 2026
6 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request Apr 26, 2026
Brings in upstream's corpus-origin + privacy-warning track (PRs MemPalace#1211
MemPalace#1221 MemPalace#1223 MemPalace#1224 MemPalace#1225) plus the canonical merged versions of our four
PRs that landed today (21:22-21:41 UTC):

  MemPalace#1173  quarantine_stale_hnsw on make_client + cold-start gate +
         integrity sniff-test (PROTO/STOP byte check, no deserialization)
  MemPalace#1177  .blob_seq_ids_migrated marker guard, closes MemPalace#1090
  MemPalace#1198  _tokenize None-document guard in BM25 reranker
  MemPalace#1201  palace_graph.build_graph skips None metadata

Conflict resolution:

* mempalace/backends/chroma.py — took upstream as base (it has the
  igorls-review pickle-protocol docstring, thread-safety paragraph, and
  Path(marker).touch() style nit), then re-applied MemPalace#1094's _coerce_none_metas
  in query()/get() since MemPalace#1094 is still open and not yet in develop.

* mempalace/mcp_server.py — took upstream's clean form. Dropped the
  fork-only `palace_path=` kwarg from four ChromaCollection() call sites:
  the kwarg was load-bearing for MemPalace#1171's per-collection write lock, but
  MemPalace#1171 closed in favor of MemPalace#976's mine_global_lock + daemon-strict, so
  the kwarg has no remaining consumer. ChromaCollection.__init__ in
  upstream/develop is back to (self, collection); calling it with
  palace_path= raised TypeError → silently swallowed by the broad except
  in _get_collection() → returned None → tool_status() returned _no_palace().
  41 mcp_server tests went from failing-with-KeyError to passing.

* mempalace/cli.py — dropped fork-only `workers=args.workers` from the
  cmd_mine -> miner.mine() call. Pre-existing fork-side bug: the
  `--workers` argparse arg landed in 5cd14bd but miner.mine() never
  accepted a workers param, so production `mempalace mine` TypeError'd
  on every invocation. Removed the broken plumbing; tests/test_cli.py
  updated to match.

* CHANGELOG.md — took upstream verbatim. Fork-specific changelog lives
  in FORK_CHANGELOG.md (canonical: docs/fork-changes.yaml).

* CLAUDE.md — kept ours. Fork's CLAUDE.md is operational; upstream's
  added a "Design Principles / Contributing" charter, which lives in
  README.md on the fork.

* tests/test_backends.py — took upstream's ruff-formatted line widths.

docs/fork-changes.yaml flips the two MemPalace#1173 entries (hnsw-integrity-gate,
hnsw-cold-start-gate) and the MemPalace#1201 entry (palace-graph-none-guard) from
OPEN to MERGED 2026-04-26. MemPalace#1173 MemPalace#1177 MemPalace#1198 MemPalace#1201 added to the
merged_upstream archive at the bottom. FORK_CHANGELOG.md regenerated.

scripts/check-docs.sh: 4/4 clean.
Test suite: 1460/1460.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
jphein added a commit to jphein/mempalace that referenced this pull request Apr 27, 2026
Upstream merged MemPalace#1173, MemPalace#1177, MemPalace#1198, MemPalace#1201 into develop on 2026-04-26
(picked up by the 2026-04-27 develop sync). Strike out their fork-ahead
rows in CLAUDE.md and add to the "Merged into upstream" section. Update
PR table accordingly: 18 merged (was 14), 7 open (was 8). Bump version
note: upstream shipped v3.3.3 on 2026-04-24 (carries our MemPalace#659/MemPalace#1021);
v3.3.4 prep branch in flight.

README:
- "tracks upstream/develop through the 2026-04-27 sync" (was 2026-04-26)
- 17 fork-ahead changes (was 22)
- 1,510 tests pass on main (was 1,395 — upstream brought new suites)
- "Open upstream PRs" 7 entries (was 11)
- Drop merged rows from "Fork-ahead — open or pending" table; keep
  PreCompact recovery-marker row (still fork-only)

scripts/check-docs.sh: clean (90 PR refs match upstream state, 13 fork
hashes resolve, FORK_CHANGELOG renders clean from YAML).

docs/fork-changes.yaml: no edit needed — already had merged_date on the
4 entries from the 2026-04-26 commit `5fd15db`.
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
…lace#1198)

Merged upstream None guard into our CJK-aware tokenizer.
Scorpion1221 pushed a commit to Scorpion1221/mempalace that referenced this pull request May 2, 2026
Sync with upstream/develop via cherry-pick of 9 critical bugfixes:
- HNSW index bloat prevention (MemPalace#1191)
- SIGSEGV guards on collection reopen (MemPalace#1262, MemPalace#1289)
- blob-seq marker fast-path (MemPalace#1177)
- palace_graph None-metadata guard (MemPalace#1201)
- palace_graph security tunnels (MemPalace#1168)
- hyphenated wing name normalization (MemPalace#1197)
- searcher _tokenize None guard (MemPalace#1198)
- mcp_server diary topic sanitize (MemPalace#936)

Tests: 1459 default + 113 benchmark + 6/7 stress all pass.
(random-kill stress test was failing on dev pre-merge; not regressed.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants