bug: 7 issues in v3.0.0→v3.3.0 — data loss in repair/migrate, missing input validation, retry gaps

## Context

While reviewing the v3.0.0→v3.3.0 diff I found 7 issues that are not covered by any existing issue or PR. I checked #883, #809, #893, #851, #852, and #661 before filing this.

All code references are against the current `develop` branch (commit 4aa7e1e, v3.3.0 tag).

---

## Data Integrity

These three violate CLAUDE.md's design principle: *"Incremental only — A crash mid-operation must leave the existing palace untouched."*

### 1. `repair.py:265-275` — rebuild_index deletes collection before upsert completes

`rebuild_index` calls `backend.delete_collection()` at line 265, then upserts drawers in a loop (lines 269-275) with no try/except. If the upsert crashes partway through (OOM, ChromaDB error), the remaining drawers are lost. The backup copy is made before deletion (line 260), but the code never restores from it on failure — the user must do it manually.

```python
# Current (line 265-275)
backend.delete_collection(palace_path, COLLECTION_NAME)
new_col = backend.create_collection(palace_path, COLLECTION_NAME)
for i in range(0, len(all_ids), batch_size):
    # no try/except — crash here = partial data loss
    new_col.upsert(...)
```

**Fix:** Wrap the upsert loop in try/except and restore `chroma.sqlite3` from the backup file on failure.

### 2. `migrate.py:234-235` — non-atomic palace swap

```python
shutil.rmtree(palace_path)           # line 234 — old palace deleted
shutil.move(temp_palace, palace_path) # line 235 — new palace moved in
```

If the process dies between lines 234 and 235, both the old palace (deleted) and the new palace (not yet moved) are gone. The backup at line 204 mitigates this but recovery is manual.

**Fix:** Use `os.rename()` to swap atomically: rename old to `.old`, rename temp to palace, then remove `.old`.

### 3. `cli.py:276` — cmd_repair extraction uses fixed offset increment

```python
offset += batch_size  # always 5000, regardless of actual batch size
```

If a batch returns fewer than `batch_size` items (e.g. due to concurrent modification or the last page), subsequent batches skip `batch_size - actual_count` drawers. The extracted data is then used to rebuild the collection, so any skipped drawers are permanently lost. Compare with `exporter.py:70` which correctly uses `if not batch["ids"]: break`.

**Fix:** `offset += len(batch["ids"])` and add `if not batch["ids"]: break`.

---

## Input Validation

### 4. `mcp_server.py:917,926-927,967` — tool_diary_write stores `topic` without sanitization

`agent_name` is passed through `sanitize_name()` (line 926) and `entry` through `sanitize_content()` (line 927), but `topic` is stored raw into ChromaDB metadata at line 967. Any string — including path traversal sequences, null bytes, or megabyte-length payloads — passes through.

```python
def tool_diary_write(agent_name: str, entry: str, topic: str = "general"):
    agent_name = sanitize_name(agent_name, "agent_name")  # ✓ sanitized
    entry = sanitize_content(entry)                         # ✓ sanitized
    # topic — NOT sanitized, stored raw at line 967
```

**Fix:** Add `topic = sanitize_name(topic, "topic")` after line 927.

---

## Resilience

### 5. `closet_llm.py:162-163` — JSONDecodeError exits retry loop immediately

The function has a `for attempt in range(3)` retry loop (line 150). HTTP 429/503 errors correctly `continue` to the next attempt (line 168), but `json.JSONDecodeError` from malformed LLM output immediately `return None, None` (line 163), exiting the loop. LLM responses are nondeterministic — a single malformed response should not permanently skip the source file.

```python
for attempt in range(3):
    try:
        # ... make request, parse JSON ...
    except json.JSONDecodeError:
        return None, None    # ← exits loop, should be: continue
    except urllib.error.HTTPError as e:
        if e.code in (429, 503) and attempt < 2:
            continue         # ← correctly retries
```

**Fix:** Change `return None, None` to `continue` on line 163.

### 6. `mcp_server.py:1121-1124` — tool_reconnect does not reset metadata cache

`tool_reconnect` resets `_collection_cache`, `_palace_db_inode`, and `_palace_db_mtime` (lines 1122-1124), but does not reset `_metadata_cache` or `_metadata_cache_time`. After reconnect, `_get_cached_metadata` may return stale metadata from the old connection for up to 5 seconds (the TTL). Compare with `_get_client()` at lines 174-180 which resets all 6 globals.

```python
# Current (line 1121)
global _collection_cache, _palace_db_inode, _palace_db_mtime
# Missing: _metadata_cache, _metadata_cache_time
```

**Fix:** Add `_metadata_cache` and `_metadata_cache_time` to the global declaration and reset them.

---

## Defense-in-Depth

### 7. `exporter.py:51,92` — output path follows symlinks

`os.makedirs(output_dir)` and `os.makedirs(wing_dir)` follow symlinks. If an attacker pre-creates a symlink at the output location, exported markdown files land in attacker-controlled directories. The miner correctly skips symlinks for input files, but the exporter does not validate output paths.

**Fix:** Resolve the output path with `os.path.realpath()` and verify it stays within the intended directory.

---

## Relationship to Existing Issues/PRs

- **#809** — Security audit by @Kesshite covers different findings (hook injection, file permissions, Wikipedia SSRF). None of the 7 findings above overlap.
- **#893** — Hardens hooks_cli, MCP error messages, and config. Different scope.
- **#883/#884** — KG thread safety. Already merged. Not duplicated here.
- **#851** — Status pagination. Different scope.
- **#852** — Backend abstraction refactor. Already merged. Not duplicated here.

## PRs

Submitting focused PRs for these findings, targeting `develop`:

- `fix/repair-crash-safety` — Findings 1, 2, 3 (data integrity)
- `fix/diary-topic-sanitize` — Finding 4 (input validation)
- `fix/closet-llm-resilience` — Findings 5, 6, 7 (resilience + defense-in-depth)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: 7 issues in v3.0.0→v3.3.0 — data loss in repair/migrate, missing input validation, retry gaps #934

Context

Data Integrity

1. `repair.py:265-275` — rebuild_index deletes collection before upsert completes

2. `migrate.py:234-235` — non-atomic palace swap

3. `cli.py:276` — cmd_repair extraction uses fixed offset increment

Input Validation

4. `mcp_server.py:917,926-927,967` — tool_diary_write stores `topic` without sanitization

Resilience

5. `closet_llm.py:162-163` — JSONDecodeError exits retry loop immediately

6. `mcp_server.py:1121-1124` — tool_reconnect does not reset metadata cache

Defense-in-Depth

7. `exporter.py:51,92` — output path follows symlinks

Relationship to Existing Issues/PRs

PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: 7 issues in v3.0.0→v3.3.0 — data loss in repair/migrate, missing input validation, retry gaps #934

Description

Context

Data Integrity

1. repair.py:265-275 — rebuild_index deletes collection before upsert completes

2. migrate.py:234-235 — non-atomic palace swap

3. cli.py:276 — cmd_repair extraction uses fixed offset increment

Input Validation

4. mcp_server.py:917,926-927,967 — tool_diary_write stores topic without sanitization

Resilience

5. closet_llm.py:162-163 — JSONDecodeError exits retry loop immediately

6. mcp_server.py:1121-1124 — tool_reconnect does not reset metadata cache

Defense-in-Depth

7. exporter.py:51,92 — output path follows symlinks

Relationship to Existing Issues/PRs

PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `repair.py:265-275` — rebuild_index deletes collection before upsert completes

2. `migrate.py:234-235` — non-atomic palace swap

3. `cli.py:276` — cmd_repair extraction uses fixed offset increment

4. `mcp_server.py:917,926-927,967` — tool_diary_write stores `topic` without sanitization

5. `closet_llm.py:162-163` — JSONDecodeError exits retry loop immediately

6. `mcp_server.py:1121-1124` — tool_reconnect does not reset metadata cache

7. `exporter.py:51,92` — output path follows symlinks