You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While reviewing the v3.0.0→v3.3.0 diff I found 7 issues that are not covered by any existing issue or PR. I checked #883, #809, #893, #851, #852, and #661 before filing this.
All code references are against the current develop branch (commit 4aa7e1e, v3.3.0 tag).
Data Integrity
These three violate CLAUDE.md's design principle: "Incremental only — A crash mid-operation must leave the existing palace untouched."
1. repair.py:265-275 — rebuild_index deletes collection before upsert completes
rebuild_index calls backend.delete_collection() at line 265, then upserts drawers in a loop (lines 269-275) with no try/except. If the upsert crashes partway through (OOM, ChromaDB error), the remaining drawers are lost. The backup copy is made before deletion (line 260), but the code never restores from it on failure — the user must do it manually.
# Current (line 265-275)backend.delete_collection(palace_path, COLLECTION_NAME)
new_col=backend.create_collection(palace_path, COLLECTION_NAME)
foriinrange(0, len(all_ids), batch_size):
# no try/except — crash here = partial data lossnew_col.upsert(...)
Fix: Wrap the upsert loop in try/except and restore chroma.sqlite3 from the backup file on failure.
2. migrate.py:234-235 — non-atomic palace swap
shutil.rmtree(palace_path) # line 234 — old palace deletedshutil.move(temp_palace, palace_path) # line 235 — new palace moved in
If the process dies between lines 234 and 235, both the old palace (deleted) and the new palace (not yet moved) are gone. The backup at line 204 mitigates this but recovery is manual.
Fix: Use os.rename() to swap atomically: rename old to .old, rename temp to palace, then remove .old.
offset+=batch_size# always 5000, regardless of actual batch size
If a batch returns fewer than batch_size items (e.g. due to concurrent modification or the last page), subsequent batches skip batch_size - actual_count drawers. The extracted data is then used to rebuild the collection, so any skipped drawers are permanently lost. Compare with exporter.py:70 which correctly uses if not batch["ids"]: break.
Fix:offset += len(batch["ids"]) and add if not batch["ids"]: break.
Input Validation
4. mcp_server.py:917,926-927,967 — tool_diary_write stores topic without sanitization
agent_name is passed through sanitize_name() (line 926) and entry through sanitize_content() (line 927), but topic is stored raw into ChromaDB metadata at line 967. Any string — including path traversal sequences, null bytes, or megabyte-length payloads — passes through.
deftool_diary_write(agent_name: str, entry: str, topic: str="general"):
agent_name=sanitize_name(agent_name, "agent_name") # ✓ sanitizedentry=sanitize_content(entry) # ✓ sanitized# topic — NOT sanitized, stored raw at line 967
Fix: Add topic = sanitize_name(topic, "topic") after line 927.
The function has a for attempt in range(3) retry loop (line 150). HTTP 429/503 errors correctly continue to the next attempt (line 168), but json.JSONDecodeError from malformed LLM output immediately return None, None (line 163), exiting the loop. LLM responses are nondeterministic — a single malformed response should not permanently skip the source file.
Fix: Change return None, None to continue on line 163.
6. mcp_server.py:1121-1124 — tool_reconnect does not reset metadata cache
tool_reconnect resets _collection_cache, _palace_db_inode, and _palace_db_mtime (lines 1122-1124), but does not reset _metadata_cache or _metadata_cache_time. After reconnect, _get_cached_metadata may return stale metadata from the old connection for up to 5 seconds (the TTL). Compare with _get_client() at lines 174-180 which resets all 6 globals.
# Current (line 1121)global_collection_cache, _palace_db_inode, _palace_db_mtime# Missing: _metadata_cache, _metadata_cache_time
Fix: Add _metadata_cache and _metadata_cache_time to the global declaration and reset them.
os.makedirs(output_dir) and os.makedirs(wing_dir) follow symlinks. If an attacker pre-creates a symlink at the output location, exported markdown files land in attacker-controlled directories. The miner correctly skips symlinks for input files, but the exporter does not validate output paths.
Fix: Resolve the output path with os.path.realpath() and verify it stays within the intended directory.
Context
While reviewing the v3.0.0→v3.3.0 diff I found 7 issues that are not covered by any existing issue or PR. I checked #883, #809, #893, #851, #852, and #661 before filing this.
All code references are against the current
developbranch (commit 4aa7e1e, v3.3.0 tag).Data Integrity
These three violate CLAUDE.md's design principle: "Incremental only — A crash mid-operation must leave the existing palace untouched."
1.
repair.py:265-275— rebuild_index deletes collection before upsert completesrebuild_indexcallsbackend.delete_collection()at line 265, then upserts drawers in a loop (lines 269-275) with no try/except. If the upsert crashes partway through (OOM, ChromaDB error), the remaining drawers are lost. The backup copy is made before deletion (line 260), but the code never restores from it on failure — the user must do it manually.Fix: Wrap the upsert loop in try/except and restore
chroma.sqlite3from the backup file on failure.2.
migrate.py:234-235— non-atomic palace swapIf the process dies between lines 234 and 235, both the old palace (deleted) and the new palace (not yet moved) are gone. The backup at line 204 mitigates this but recovery is manual.
Fix: Use
os.rename()to swap atomically: rename old to.old, rename temp to palace, then remove.old.3.
cli.py:276— cmd_repair extraction uses fixed offset incrementIf a batch returns fewer than
batch_sizeitems (e.g. due to concurrent modification or the last page), subsequent batches skipbatch_size - actual_countdrawers. The extracted data is then used to rebuild the collection, so any skipped drawers are permanently lost. Compare withexporter.py:70which correctly usesif not batch["ids"]: break.Fix:
offset += len(batch["ids"])and addif not batch["ids"]: break.Input Validation
4.
mcp_server.py:917,926-927,967— tool_diary_write storestopicwithout sanitizationagent_nameis passed throughsanitize_name()(line 926) andentrythroughsanitize_content()(line 927), buttopicis stored raw into ChromaDB metadata at line 967. Any string — including path traversal sequences, null bytes, or megabyte-length payloads — passes through.Fix: Add
topic = sanitize_name(topic, "topic")after line 927.Resilience
5.
closet_llm.py:162-163— JSONDecodeError exits retry loop immediatelyThe function has a
for attempt in range(3)retry loop (line 150). HTTP 429/503 errors correctlycontinueto the next attempt (line 168), butjson.JSONDecodeErrorfrom malformed LLM output immediatelyreturn None, None(line 163), exiting the loop. LLM responses are nondeterministic — a single malformed response should not permanently skip the source file.Fix: Change
return None, Nonetocontinueon line 163.6.
mcp_server.py:1121-1124— tool_reconnect does not reset metadata cachetool_reconnectresets_collection_cache,_palace_db_inode, and_palace_db_mtime(lines 1122-1124), but does not reset_metadata_cacheor_metadata_cache_time. After reconnect,_get_cached_metadatamay return stale metadata from the old connection for up to 5 seconds (the TTL). Compare with_get_client()at lines 174-180 which resets all 6 globals.Fix: Add
_metadata_cacheand_metadata_cache_timeto the global declaration and reset them.Defense-in-Depth
7.
exporter.py:51,92— output path follows symlinksos.makedirs(output_dir)andos.makedirs(wing_dir)follow symlinks. If an attacker pre-creates a symlink at the output location, exported markdown files land in attacker-controlled directories. The miner correctly skips symlinks for input files, but the exporter does not validate output paths.Fix: Resolve the output path with
os.path.realpath()and verify it stays within the intended directory.Relationship to Existing Issues/PRs
query_relationship(),timeline(),stats()missingself._lockinKnowledgeGraph#883/fix: add missing self._lock to query_relationship, timeline, and stats in KnowledgeGraph #884 — KG thread safety. Already merged. Not duplicated here.PRs
Submitting focused PRs for these findings, targeting
develop:fix/repair-crash-safety— Findings 1, 2, 3 (data integrity)fix/diary-topic-sanitize— Finding 4 (input validation)fix/closet-llm-resilience— Findings 5, 6, 7 (resilience + defense-in-depth)