Skip to content

bug: 7 issues in v3.0.0→v3.3.0 — data loss in repair/migrate, missing input validation, retry gaps #934

@shaun0927

Description

@shaun0927

Context

While reviewing the v3.0.0→v3.3.0 diff I found 7 issues that are not covered by any existing issue or PR. I checked #883, #809, #893, #851, #852, and #661 before filing this.

All code references are against the current develop branch (commit 4aa7e1e, v3.3.0 tag).


Data Integrity

These three violate CLAUDE.md's design principle: "Incremental only — A crash mid-operation must leave the existing palace untouched."

1. repair.py:265-275 — rebuild_index deletes collection before upsert completes

rebuild_index calls backend.delete_collection() at line 265, then upserts drawers in a loop (lines 269-275) with no try/except. If the upsert crashes partway through (OOM, ChromaDB error), the remaining drawers are lost. The backup copy is made before deletion (line 260), but the code never restores from it on failure — the user must do it manually.

# Current (line 265-275)
backend.delete_collection(palace_path, COLLECTION_NAME)
new_col = backend.create_collection(palace_path, COLLECTION_NAME)
for i in range(0, len(all_ids), batch_size):
    # no try/except — crash here = partial data loss
    new_col.upsert(...)

Fix: Wrap the upsert loop in try/except and restore chroma.sqlite3 from the backup file on failure.

2. migrate.py:234-235 — non-atomic palace swap

shutil.rmtree(palace_path)           # line 234 — old palace deleted
shutil.move(temp_palace, palace_path) # line 235 — new palace moved in

If the process dies between lines 234 and 235, both the old palace (deleted) and the new palace (not yet moved) are gone. The backup at line 204 mitigates this but recovery is manual.

Fix: Use os.rename() to swap atomically: rename old to .old, rename temp to palace, then remove .old.

3. cli.py:276 — cmd_repair extraction uses fixed offset increment

offset += batch_size  # always 5000, regardless of actual batch size

If a batch returns fewer than batch_size items (e.g. due to concurrent modification or the last page), subsequent batches skip batch_size - actual_count drawers. The extracted data is then used to rebuild the collection, so any skipped drawers are permanently lost. Compare with exporter.py:70 which correctly uses if not batch["ids"]: break.

Fix: offset += len(batch["ids"]) and add if not batch["ids"]: break.


Input Validation

4. mcp_server.py:917,926-927,967 — tool_diary_write stores topic without sanitization

agent_name is passed through sanitize_name() (line 926) and entry through sanitize_content() (line 927), but topic is stored raw into ChromaDB metadata at line 967. Any string — including path traversal sequences, null bytes, or megabyte-length payloads — passes through.

def tool_diary_write(agent_name: str, entry: str, topic: str = "general"):
    agent_name = sanitize_name(agent_name, "agent_name")  # ✓ sanitized
    entry = sanitize_content(entry)                         # ✓ sanitized
    # topic — NOT sanitized, stored raw at line 967

Fix: Add topic = sanitize_name(topic, "topic") after line 927.


Resilience

5. closet_llm.py:162-163 — JSONDecodeError exits retry loop immediately

The function has a for attempt in range(3) retry loop (line 150). HTTP 429/503 errors correctly continue to the next attempt (line 168), but json.JSONDecodeError from malformed LLM output immediately return None, None (line 163), exiting the loop. LLM responses are nondeterministic — a single malformed response should not permanently skip the source file.

for attempt in range(3):
    try:
        # ... make request, parse JSON ...
    except json.JSONDecodeError:
        return None, None    # ← exits loop, should be: continue
    except urllib.error.HTTPError as e:
        if e.code in (429, 503) and attempt < 2:
            continue         # ← correctly retries

Fix: Change return None, None to continue on line 163.

6. mcp_server.py:1121-1124 — tool_reconnect does not reset metadata cache

tool_reconnect resets _collection_cache, _palace_db_inode, and _palace_db_mtime (lines 1122-1124), but does not reset _metadata_cache or _metadata_cache_time. After reconnect, _get_cached_metadata may return stale metadata from the old connection for up to 5 seconds (the TTL). Compare with _get_client() at lines 174-180 which resets all 6 globals.

# Current (line 1121)
global _collection_cache, _palace_db_inode, _palace_db_mtime
# Missing: _metadata_cache, _metadata_cache_time

Fix: Add _metadata_cache and _metadata_cache_time to the global declaration and reset them.


Defense-in-Depth

7. exporter.py:51,92 — output path follows symlinks

os.makedirs(output_dir) and os.makedirs(wing_dir) follow symlinks. If an attacker pre-creates a symlink at the output location, exported markdown files land in attacker-controlled directories. The miner correctly skips symlinks for input files, but the exporter does not validate output paths.

Fix: Resolve the output path with os.path.realpath() and verify it stays within the intended directory.


Relationship to Existing Issues/PRs

PRs

Submitting focused PRs for these findings, targeting develop:

  • fix/repair-crash-safety — Findings 1, 2, 3 (data integrity)
  • fix/diary-topic-sanitize — Finding 4 (input validation)
  • fix/closet-llm-resilience — Findings 5, 6, 7 (resilience + defense-in-depth)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1highbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions