Skip to content

fix: resolve 6 community-reported bugs#518

Open
leopechnicki wants to merge 3 commits intoMemPalace:developfrom
leopechnicki:fix/community-bugs
Open

fix: resolve 6 community-reported bugs#518
leopechnicki wants to merge 3 commits intoMemPalace:developfrom
leopechnicki:fix/community-bugs

Conversation

@leopechnicki
Copy link
Copy Markdown

Summary

Fixes 6 open bugs reported by the community:

Also applied the same pagination fix to miner.status() which had the same 10K truncation issue.

Test plan

  • Verify mempalace mine skips unchanged files on re-run (mtime fix)
  • Verify mempalace status shows correct counts for palaces with >10K drawers
  • Verify entity detection no longer flags Handler, Node, Service etc. as projects
  • Verify tool_search with limit=999999 gets clamped to 100
  • Test MCP server on Windows with CJK content input
  • Verify no regression on existing test suite

Closes #475, #476, #477, #478, #479, #503

🤖 Generated with Claude Code

@web3guru888
Copy link
Copy Markdown

Six fixes in one PR is a lot to review atomically — but the fixes themselves are all correct and well-scoped. Running through each:

#475 (float mtime epsilon): abs(diff) < 0.01 is right. Worth documenting why == breaks: ChromaDB stores metadata values as JSON, and JSON float round-trip through Python's json module can introduce sub-millisecond drift even for the same timestamp. 0.01 seconds is a safe epsilon — no file system has sub-10ms mtime resolution in practice.

#476 (entity detector stopwords): Glad to see Handler, Node, Service, Manager added. We hit this issue at Cycle 2 (getting Handler as a project name in every Python file). Two observations:

  1. "azure" and "notion" are interesting additions — they're legitimate entity names in some contexts (actual products). This might cause false negatives for people whose project is actually named AzureConnector or NotionSync. Might be worth being specific: azure_functions, notion_api rather than the bare product names.
  2. Lowercased entries won't match capitalized text if the detector checks for exact-case stopwords — double-check that the lookup is case-insensitive or that the detector normalizes before lookup.

#477 (tool_search limit clamp): max(1, min(limit, 100)) is correct. The silent unbounded query is a real memory risk with large palaces.

#478 (pagination for status/taxonomy): The _get_all_metadatas() helper is correct. Using len(batch["ids"]) < 1000 as the termination condition is fine — len(batch["metadatas"]) < 1000 would also work but ids is always present. Note: miner.py:status() gets the same fix inline rather than using the new helper — worth using the shared helper there too to keep the pattern consistent.

#479 (duplicate _client_cache): Correct removal. The double-declaration resets the cache on import, which could cause unexpected client re-initialization.

#503 (Windows UTF-8): This matches our recommendation from yesterday — surrogateescape on stdin, replace on stdout is the correct split. The import io inside main() is slightly unusual (normally at top of file) but non-blocking.

Overall a solid batch. The azure/notion stopwords are the only thing worth reconsidering — everything else looks good.

@web3guru888
Copy link
Copy Markdown

Six solid bug fixes in one PR — a few notes on each:

#475 (float mtime epsilon): The abs(diff) < 0.01 epsilon is correct. One small note: if we're using 0.01s as the tolerance, it's worth confirming this doesn't generate false-negatives on fast filesystems where two modifications can happen within 10ms. On most OS X / Linux systems, os.stat() mtime resolution is 1ns but ChromaDB's JSON round-trip is the actual precision loss — so the 0.01s margin is probably fine. PR #522 (prune) also does mtime comparisons; worth ensuring that PR picks up the same epsilon logic.

#476 (entity detector STOPWORDS): Adding Handler, Node, Service, etc. is the right fix and matches what we'd observed (#476 thread). One flag: the STOPWORDS set now mixes two categories — generic English words ("is", "the") and framework-specific identifiers (Manager, Client, Server). As the set grows, it might be worth splitting into GENERIC_STOPWORDS and TECHNICAL_STOPWORDS so they can be tuned independently (or domain-specific overrides can add to just one list). Not a blocker, just a maintenance concern for later.

#477 (limit clamp to 100): Correct fix. One thing to double-check: does the MCP tool_search description/schema also specify the 1-100 range? If the tool description says "specify any number" but the implementation clamps silently, that's confusing for callers. Schema and behavior should agree.

#478 (pagination for status/taxonomy): The batch-of-1,000 pagination is the right pattern — good catch that miner.status() had the same problem. Suggestion: expose the batch size as a constant (METADATA_PAGE_SIZE = 1000) so it's easy to find and tune, rather than repeated literal 1000s.

#479 (duplicate cache declaration): Silent cache reset is a nasty bug — glad this got caught. The fix is straightforward.

#503 (Windows UTF-8 surrogate): The surrogateescape + replace split (input vs output) is the right call — input should be permissive (don't crash on malformed), output should be safe (don't produce invalid JSON-RPC). Makes sense.

Overall this is a solid omnibus fix for real issues. Test plan items could be improved — most are manual checks, and for bugs this subtle, automated regression tests would be more valuable (especially for #479 which is easy to re-introduce).


MemPalace-AGI dashboard

leopechnicki and others added 2 commits April 10, 2026 23:33
- Fix float mtime comparison breaking file deduplication (MemPalace#475)
- Add common tech terms to entity detector STOPWORDS (MemPalace#476)
- Clamp tool_search limit to 1-100 to prevent memory exhaustion (MemPalace#477)
- Paginate metadata reads to fix 10K drawer truncation (MemPalace#478)
- Remove duplicate _client_cache/_collection_cache declarations (MemPalace#479)
- Force UTF-8 on Windows stdin/stdout for CJK content (MemPalace#503)

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…city, schema docs

- Move _get_all_metadatas to palace.py as shared get_all_metadatas with
  METADATA_PAGE_SIZE constant; miner.py and mcp_server.py both use it now
- Replace overly broad "azure"/"notion" stopwords with "azure_functions"/
  "notion_api" to avoid false negatives on legitimate entity names
- Update tool_search schema description to document the 1-100 limit range

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a solid multi-fix PR — six distinct community-reported bugs, all treated correctly. Let me walk through each fix and flag a few things worth reviewing.


#475 — Float mtime epsilon comparison (palace.py)

Correct fix. float(stored_mtime) == current_mtime was failing because ChromaDB serializes metadata through JSON (which has no float64 precision guarantees), so round-tripped mtime values lose sub-millisecond bits. abs(diff) < 0.01 is a practical threshold — OS mtimes typically have 100ns resolution so 10ms epsilon is many orders of magnitude above noise but well below the 1-second granularity that would cause false matches.

One edge case: if a file is re-mined within the same 10ms window as a previous mine (e.g., automated tooling touching files in rapid succession), this could produce a false "already mined" result. That's an extremely unlikely race condition but worth noting. Epsilon of 0.001 would be tighter without any practical downside on real file timestamps.


#476 — STOPWORDS expansion for entity detector (entity_detector.py)

Good direction. Handler, Node, Service, Manager, Client, Server are all heavily overloaded in codebases and should absolutely not be extracted as named entities.

A couple of observations:

  • "one" in STOPWORDS: presumably to catch "One" being detected as an entity (as in "MemPalace One" or similar)? Fine, but this will also suppress legitimate single-word entity detection if "One" is part of a proper name. Seems low-risk.
  • "azure_functions" and "notion_api" are interesting additions — these are specific product names. The rationale for stopping them isn't clear. If the concern is over-detection of platform names, a broader approach (like a product-name list) would be more principled. This could quietly suppress valid named entities in documentation that discusses Azure Functions or Notion API as subjects.
  • The comment says "20+ common technical terms" but I count 22. Minor, but the PR description says "20+" which is consistent.

Missing: The issue referenced (#476) mentions entity detection specifically for code files. Are there tests verifying that handler in prose (e.g., "The EventHandler processes...") is no longer extracted? Would be useful to see a test case covering the new stopwords.


#477 — Unbounded limit in tool_search (mcp_server.py)

max(1, min(limit, 100)) is the right pattern. Without this, a malicious or buggy MCP client could trigger a col.query(n_results=999999) which loads all embeddings into memory. The JSON schema description update ("Max results, 1-100 (default 5)") correctly documents the new constraint.

Minor: The floor of 1 is fine for correctness but if a caller passes limit=0, they presumably want no results — clamping to 1 would return one result unexpectedly. max(1, ...) is conventional though, so this is a style point rather than a bug.


#478/#479 — Pagination via get_all_metadatas + duplicate cache declarations (palace.py, mcp_server.py, miner.py)

The pagination fix is the most impactful change here. Replacing all col.get(limit=10000) calls with get_all_metadatas(col) correctly handles palaces with >10,000 drawers. The implementation is clean: 1,000-record pages, terminates when a partial page is returned, re-uses a shared constant.

The removal of the duplicate _client_cache = None / _collection_cache = None declaration at lines 103-104 in mcp_server.py is a genuine bug fix. That second declaration was resetting the singleton caches to None after the WAL setup block, meaning the first call to _get_client() after WAL init would always create a new client instead of returning the cached one. This could cause subtle multi-client collision issues in environments where the WAL was written between the two initializations. Nice catch.

Note on get_all_metadatas: The function accepts **extra_kwargs which get merged into every page's col.get() call. The where filter via extra_kwargs is correct for filtering by wing — but if extra_kwargs contained limit or offset, it would override the pagination values and silently break. A defensive check or explicit documentation of the accepted kwargs would help.

Also: get_all_metadatas loads all metadatas into memory. For a 1M-drawer palace this becomes a multi-GB Python list. Fine for current use cases, but if status() / taxonomy() end up being called on very large palaces, a generator-based approach would be more memory-efficient. Non-blocking for now.


#503 — Windows UTF-8 / surrogate fix in MCP main() (mcp_server.py)

Correct and well-considered. The MCP server communicates over stdin/stdout as JSON-RPC, so any CJK content in a drawer document that passes through a Windows codepage-encoded pipe (cp936, cp932, cp1252) will corrupt the JSON framing. The surrogateescape on stdin and replace on stdout is the right combination: it preserves round-trippable binary on input while preventing corrupted surrogates from poisoning the JSON output.

One subtle point: wrapping sys.stdout with io.TextIOWrapper and errors="replace" means that malformed characters in outgoing JSON responses will be silently replaced with ?. This is safe for the MCP framing (JSON stays parseable) but could cause MCP clients to see ? in drawer content that had non-UTF-8 bytes. This is expected and acceptable for the stated goal (prevent server crash), but worth noting in the PR description for downstream debugging.


Overall

Six solid fixes, each addressing the root cause. The pagination fix (#478) and duplicate cache declaration fix (#479) are the most impactful — both affect correctness on real-world deployments. The mtime epsilon and Windows UTF-8 fixes are the most likely to be hit in the field.

One test coverage concern: the PR test plan lists manual verification steps but doesn't add automated tests for the new get_all_metadatas pagination logic or the tool_search limit clamping. The pagination helper in particular deserves a unit test with a mock collection that returns exactly 1,000 records on page 1 and 500 on page 2.

Verdict: COMMENT — would approve after addressing the azure_functions/notion_api stopword rationale and confirming test coverage for pagination + limit clamping.


Reviewed by MemPalace-AGI — autonomous research system with perfect memory

@bensig bensig changed the base branch from main to develop April 11, 2026 22:22
- Split STOPWORDS into GENERIC_STOPWORDS and TECHNICAL_STOPWORDS so
  projects can override technical terms without losing common words.
  Cloud/SaaS entries use compound forms (azure_functions, azure_devops,
  notion_api, notion_workspace) to avoid suppressing bare names (MemPalace#476).

- Verified case normalization: extract_candidates already applies
  .lower() before stopword lookup, so "Handler" matches "handler" (MemPalace#476).

- Added unit tests:
  * Stopwords suppress "Handler" in prose and via case-insensitive lookup
  * GENERIC_STOPWORDS / TECHNICAL_STOPWORDS are separate sets
  * get_all_metadatas pagination (mock 1000+500 across two pages)
  * get_all_metadatas defensive check strips caller limit/offset/include
  * tool_search limit clamping (<1 -> 1, >100 -> 100)

- Updated MCP tool_search schema with minimum/maximum constraints (MemPalace#477).

- Added defensive stripping of limit/offset/include in
  get_all_metadatas extra_kwargs to prevent silent override (MemPalace#478).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@leopechnicki leopechnicki requested a review from igorls as a code owner April 13, 2026 11:56
@igorls igorls added area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kg Knowledge graph area/mcp MCP server and tools area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Float mtime comparison breaks file deduplication — every file re-mined on each run

3 participants