fix: resolve 6 community-reported bugs#518
fix: resolve 6 community-reported bugs#518leopechnicki wants to merge 3 commits intoMemPalace:developfrom
Conversation
|
Six fixes in one PR is a lot to review atomically — but the fixes themselves are all correct and well-scoped. Running through each: #475 (float mtime epsilon): #476 (entity detector stopwords): Glad to see
#477 ( #478 (pagination for status/taxonomy): The #479 (duplicate #503 (Windows UTF-8): This matches our recommendation from yesterday — Overall a solid batch. The |
|
Six solid bug fixes in one PR — a few notes on each: #475 (float mtime epsilon): The #476 (entity detector STOPWORDS): Adding #477 (limit clamp to 100): Correct fix. One thing to double-check: does the MCP #478 (pagination for status/taxonomy): The batch-of-1,000 pagination is the right pattern — good catch that #479 (duplicate cache declaration): Silent cache reset is a nasty bug — glad this got caught. The fix is straightforward. #503 (Windows UTF-8 surrogate): The Overall this is a solid omnibus fix for real issues. Test plan items could be improved — most are manual checks, and for bugs this subtle, automated regression tests would be more valuable (especially for #479 which is easy to re-introduce). |
- Fix float mtime comparison breaking file deduplication (MemPalace#475) - Add common tech terms to entity detector STOPWORDS (MemPalace#476) - Clamp tool_search limit to 1-100 to prevent memory exhaustion (MemPalace#477) - Paginate metadata reads to fix 10K drawer truncation (MemPalace#478) - Remove duplicate _client_cache/_collection_cache declarations (MemPalace#479) - Force UTF-8 on Windows stdin/stdout for CJK content (MemPalace#503) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…city, schema docs - Move _get_all_metadatas to palace.py as shared get_all_metadatas with METADATA_PAGE_SIZE constant; miner.py and mcp_server.py both use it now - Replace overly broad "azure"/"notion" stopwords with "azure_functions"/ "notion_api" to avoid false negatives on legitimate entity names - Update tool_search schema description to document the 1-100 limit range Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
c5fe95b to
b78e858
Compare
web3guru888
left a comment
There was a problem hiding this comment.
This is a solid multi-fix PR — six distinct community-reported bugs, all treated correctly. Let me walk through each fix and flag a few things worth reviewing.
#475 — Float mtime epsilon comparison (palace.py)
Correct fix. float(stored_mtime) == current_mtime was failing because ChromaDB serializes metadata through JSON (which has no float64 precision guarantees), so round-tripped mtime values lose sub-millisecond bits. abs(diff) < 0.01 is a practical threshold — OS mtimes typically have 100ns resolution so 10ms epsilon is many orders of magnitude above noise but well below the 1-second granularity that would cause false matches.
One edge case: if a file is re-mined within the same 10ms window as a previous mine (e.g., automated tooling touching files in rapid succession), this could produce a false "already mined" result. That's an extremely unlikely race condition but worth noting. Epsilon of 0.001 would be tighter without any practical downside on real file timestamps.
#476 — STOPWORDS expansion for entity detector (entity_detector.py)
Good direction. Handler, Node, Service, Manager, Client, Server are all heavily overloaded in codebases and should absolutely not be extracted as named entities.
A couple of observations:
"one"in STOPWORDS: presumably to catch "One" being detected as an entity (as in "MemPalace One" or similar)? Fine, but this will also suppress legitimate single-word entity detection if "One" is part of a proper name. Seems low-risk."azure_functions"and"notion_api"are interesting additions — these are specific product names. The rationale for stopping them isn't clear. If the concern is over-detection of platform names, a broader approach (like a product-name list) would be more principled. This could quietly suppress valid named entities in documentation that discusses Azure Functions or Notion API as subjects.- The comment says "20+ common technical terms" but I count 22. Minor, but the PR description says "20+" which is consistent.
Missing: The issue referenced (#476) mentions entity detection specifically for code files. Are there tests verifying that handler in prose (e.g., "The EventHandler processes...") is no longer extracted? Would be useful to see a test case covering the new stopwords.
#477 — Unbounded limit in tool_search (mcp_server.py)
max(1, min(limit, 100)) is the right pattern. Without this, a malicious or buggy MCP client could trigger a col.query(n_results=999999) which loads all embeddings into memory. The JSON schema description update ("Max results, 1-100 (default 5)") correctly documents the new constraint.
Minor: The floor of 1 is fine for correctness but if a caller passes limit=0, they presumably want no results — clamping to 1 would return one result unexpectedly. max(1, ...) is conventional though, so this is a style point rather than a bug.
#478/#479 — Pagination via get_all_metadatas + duplicate cache declarations (palace.py, mcp_server.py, miner.py)
The pagination fix is the most impactful change here. Replacing all col.get(limit=10000) calls with get_all_metadatas(col) correctly handles palaces with >10,000 drawers. The implementation is clean: 1,000-record pages, terminates when a partial page is returned, re-uses a shared constant.
The removal of the duplicate _client_cache = None / _collection_cache = None declaration at lines 103-104 in mcp_server.py is a genuine bug fix. That second declaration was resetting the singleton caches to None after the WAL setup block, meaning the first call to _get_client() after WAL init would always create a new client instead of returning the cached one. This could cause subtle multi-client collision issues in environments where the WAL was written between the two initializations. Nice catch.
Note on get_all_metadatas: The function accepts **extra_kwargs which get merged into every page's col.get() call. The where filter via extra_kwargs is correct for filtering by wing — but if extra_kwargs contained limit or offset, it would override the pagination values and silently break. A defensive check or explicit documentation of the accepted kwargs would help.
Also: get_all_metadatas loads all metadatas into memory. For a 1M-drawer palace this becomes a multi-GB Python list. Fine for current use cases, but if status() / taxonomy() end up being called on very large palaces, a generator-based approach would be more memory-efficient. Non-blocking for now.
#503 — Windows UTF-8 / surrogate fix in MCP main() (mcp_server.py)
Correct and well-considered. The MCP server communicates over stdin/stdout as JSON-RPC, so any CJK content in a drawer document that passes through a Windows codepage-encoded pipe (cp936, cp932, cp1252) will corrupt the JSON framing. The surrogateescape on stdin and replace on stdout is the right combination: it preserves round-trippable binary on input while preventing corrupted surrogates from poisoning the JSON output.
One subtle point: wrapping sys.stdout with io.TextIOWrapper and errors="replace" means that malformed characters in outgoing JSON responses will be silently replaced with ?. This is safe for the MCP framing (JSON stays parseable) but could cause MCP clients to see ? in drawer content that had non-UTF-8 bytes. This is expected and acceptable for the stated goal (prevent server crash), but worth noting in the PR description for downstream debugging.
Overall
Six solid fixes, each addressing the root cause. The pagination fix (#478) and duplicate cache declaration fix (#479) are the most impactful — both affect correctness on real-world deployments. The mtime epsilon and Windows UTF-8 fixes are the most likely to be hit in the field.
One test coverage concern: the PR test plan lists manual verification steps but doesn't add automated tests for the new get_all_metadatas pagination logic or the tool_search limit clamping. The pagination helper in particular deserves a unit test with a mock collection that returns exactly 1,000 records on page 1 and 500 on page 2.
Verdict: COMMENT — would approve after addressing the azure_functions/notion_api stopword rationale and confirming test coverage for pagination + limit clamping.
Reviewed by MemPalace-AGI — autonomous research system with perfect memory
- Split STOPWORDS into GENERIC_STOPWORDS and TECHNICAL_STOPWORDS so projects can override technical terms without losing common words. Cloud/SaaS entries use compound forms (azure_functions, azure_devops, notion_api, notion_workspace) to avoid suppressing bare names (MemPalace#476). - Verified case normalization: extract_candidates already applies .lower() before stopword lookup, so "Handler" matches "handler" (MemPalace#476). - Added unit tests: * Stopwords suppress "Handler" in prose and via case-insensitive lookup * GENERIC_STOPWORDS / TECHNICAL_STOPWORDS are separate sets * get_all_metadatas pagination (mock 1000+500 across two pages) * get_all_metadatas defensive check strips caller limit/offset/include * tool_search limit clamping (<1 -> 1, >100 -> 100) - Updated MCP tool_search schema with minimum/maximum constraints (MemPalace#477). - Added defensive stripping of limit/offset/include in get_all_metadatas extra_kwargs to prevent silent override (MemPalace#478). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
Fixes 6 open bugs reported by the community:
file_already_mined()used strict==on floats that lose precision through ChromaDB's JSON serialization. Changed to epsilon comparison (abs(diff) < 0.01).Handler,Node,Service,Manager,Client,Server, etc.) to the STOPWORDS set to prevent false positive entity detection.tool_searchhas no upper bound onlimitparameter: Clamped tomax(1, min(limit, 100))to prevent memory exhaustion from unbounded ChromaDB queries.col.get(limit=10000)calls with a paginated helper_get_all_metadatas()that iterates in batches of 1,000 (same pattern already used inpalace_graph.py)._client_cache/_collection_cachedeclarations: Removed the second declaration at lines 103-104 that silently resets cache state after the WAL setup block.main()withsurrogateescapefor input andreplacefor output (prevents corrupted JSON-RPC responses).Also applied the same pagination fix to
miner.status()which had the same 10K truncation issue.Test plan
mempalace mineskips unchanged files on re-run (mtime fix)mempalace statusshows correct counts for palaces with >10K drawersHandler,Node,Serviceetc. as projectstool_searchwithlimit=999999gets clamped to 100Closes #475, #476, #477, #478, #479, #503
🤖 Generated with Claude Code