security: harden inputs, fix shell injection, optimize DB access by bensig · Pull Request #387 · MemPalace/mempalace

bensig · 2026-04-09T15:06:54Z

Summary

Security hardening across the codebase, cherry-picked from #252 by @anthonyonazure with lint fixes and a bug fix applied.

Shell injection fix in hook script — paths passed via sys.argv instead of string interpolation
Input validation — sanitize_name() and sanitize_content() block path traversal, null bytes, oversized strings at all MCP entry points
10MB file size guard + symlink skip in both miners (fixes Missing Input Validation Across File Processing Pipeline #250)
SQLite connection leak fix in knowledge_graph.py — single reusable connection with proper with conn: transactions
Shared palace.py module — deduplicates get_collection() and file_already_mined() across miners
Write-ahead log for audit trail on all write/delete operations
Metadata cache with 30s TTL for status/taxonomy calls
md5 → sha256 for drawer and triple IDs
File permissions hardened to 0o700/0o600
chromadb pinned to >=0.5.0,<0.7

Fixes applied on top of #252

Removed unused chromadb import in convo_miner.py (F401)
Removed duplicate os import in mcp_server.py (F811)
Removed unused client variable in mcp_server.py (F841)
Restored SKIP_FILENAMES constant in miner.py (F821 — undefined name)
Removed stale md5 drawer_id line that wasn't cleaned up
Applied ruff format to match CI (ruff 0.4.x)

Closes #250. Supersedes #252.

Test plan

CI passes (lint + tests on Linux/macOS/Windows)
Verify mempalace mine skips files >10MB and symlinks
Verify MCP tools reject path traversal in wing/room names

Credit: @anthonyonazure for the original security audit and implementation.

@anthonyonazure

- Fix command injection in hook script (pass paths via sys.argv) - Add sanitize_name/sanitize_content validators in config.py - Add 10MB file size guard + symlink skip in miners - Fix SQLite connection leak in knowledge_graph.py (reuse connection) - Use `with conn:` for proper transaction handling - Consolidate shared palace operations into palace.py - Add write-ahead log for audit trail on writes/deletes - Add metadata cache with 30s TTL for status/taxonomy calls - Upgrade md5 → sha256 for drawer/triple IDs - Harden file permissions (0o700/0o600) - Pin chromadb>=0.5.0,<0.7 Based on PR #252 by @anthonyonazure with lint fixes applied. Co-Authored-By: anthonyonazure <[email protected]>

…_DIRS - _client → _client_cache to match conftest.py reset fixture - _get_collection now uses _get_client() return value instead of stale ref - Restore .pytest_cache and other dirs missing from palace.py SKIP_DIRS

The 30s TTL metadata cache returned stale data between test runs and after write operations. Reverted to direct col.get() reads which match the original behavior and pass all tests.

Remove datetime.now() from drawer_id hash so same content + wing + room always produces the same ID. This enables the idempotency check that returns "already_exists" on duplicate writes.

… harden security Review fixes (from Sage's review): - Restore mtime check in file_already_mined (check_mtime=True for miner) - Restore limit=10000 on MCP metadata fetches to prevent OOM on large palaces - Apply _SAFE_NAME_RE regex in sanitize_name (was dead code) - Drop raw_aaak metadata duplication in diary_write - chmod 0o700 on WAL dir, 0o600 on WAL file - Add check_same_thread=False on KnowledgeGraph SQLite connection - Remove __del__ (unreliable) and dead PRAGMA foreign_keys=ON

Covers the check_mtime=True path in palace.py to meet 85% coverage threshold.

milla-jovovich

Hey Ben — I've taken a look and ran it through CLI. Here are the results.

Two regressions I caught

1. miner.py loses its smart re-mining check. The old miner.py::file_already_mined compared stored source_mtime against the current file's mtime and returned False (re-mine) if the file had changed. The new palace.py::file_already_mined is the dumb existence-only version from convo_miner.py. The write side still stores mtime on miner.py:389 (metadata["source_mtime"] = os.path.getmtime(source_file)) but the read side no longer uses it. Modified files will be silently skipped on mempalace mine. This contradicts the old function's docstring ("Returns False (needs re-mining) when the file has been modified") and undoes the incremental re-mining story.

Fix: either inline the mtime comparison into palace.py::file_already_mined with an opt-in flag, or keep miner.py's smart version and only share the dumb one with convo_miner.

2. MCP metadata fetches are now unbounded, and the "30s TTL cache" the PR description promises is not in the code. The diff removes limit=10000 from tool_status, tool_list_wings, tool_list_rooms, and tool_get_taxonomy — that's a regression against merged #137 and against cktang88's #159 audit.

I grepped the full post-PR mcp_server.py for ttl and cache — only _client_cache (ChromaDB client handle reuse) and _collection_cache (collection handle reuse) exist. Neither is a metadata cache. There is no TTL. Every tool_status call still does a fresh unbounded col.get(include=["metadatas"]). On a 40K+ drawer palace (e.g. #211's Obsidian vault or our own palaces at ~80K drawers), this will OOM or hang on every MCP status call.

Fix: either restore limit=10000, actually implement the 30s TTL cache, or both.

Medium-severity items

3. Hook script eval $(python3 -c ...) reintroduces shell injection via stop_hook_active. sid and tp go through safe() sanitization; sha doesn't. If JSON supplies "stop_hook_active": "$(whatever)", Python f-string prints it literal, and bash eval evaluates command substitution inside the double quotes. Ironic in a PR titled "fix shell injection." Sanitize sha too, or replace eval with a safer parsing pattern (tempfile + source, or os.execvp chain).

4. sanitize_name's _SAFE_NAME_RE regex is dead code. Defined but never applied. The function only checks .., /, \, \x00. A name like <script>alert(1)</script> passes sanitization because none of those patterns match. Either apply the regex (if not _SAFE_NAME_RE.match(value): raise ValueError(...)) or delete it.

5. raw_aaak metadata field in tool_diary_write stores the full entry twice. ChromaDB metadata has a per-field size limit. Long diary entries will fail col.add. The TODO comment acknowledges the real fix is AAAK expansion before embedding — storing raw AAAK in metadata doesn't improve retrieval because the embedding still sees the compressed version. Drop the field, or only store it when entry is under ~4KB.

6. WAL file ~/.mempalace/wal/write_log.jsonl has no chmod. Config gets 0o600; WAL doesn't. The WAL contains content_preview: content[:200] for every add_drawer / delete_drawer / diary_write — can be PII or private content. Security-hardening PR that creates a new world-readable file with drawer content previews is a contradiction. Add:

_WAL_DIR.mkdir(parents=True, exist_ok=True, mode=0o700)
# after first write:
os.chmod(_WAL_FILE, 0o600)

7. KnowledgeGraph single-connection pattern breaks multi-threaded access. Python's sqlite3.connect() defaults to check_same_thread=True. The old code (open/close per call) was thread-safe. The new code will raise ProgrammingError on the second thread's first call. Add check_same_thread=False on the connect line (WAL mode handles the concurrency), and drop the __del__ — it's a Python anti-pattern for cleanup because it may not run at interpreter shutdown.

Low-severity cleanup (not blocking)

PRAGMA foreign_keys=ON in _init_db is dead — no FOREIGN KEY constraints in the schema.
Duplicate _collection_cache = None declarations in mcp_server.py at lines 66–67 and 95–96.
md5 → sha256 for non-crypto IDs is cargo cult — usedforsecurity=False was the correct annotation. Not harmful, just theater.
WAL has no rotation / size limit. Future ops work.
sanitize_name rejects : which would block RDF-style predicates like foaf:knows. Future flag.

Good parts I want to keep

sanitize_name / sanitize_content at MCP entry points — right idea, right location, right defaults
10 MB file size guard + symlink skip in both miners — correctly added to scan_project and scan_convos
palace.py shared module — real dedupe of get_collection / SKIP_DIRS, clean new file
with conn: transactions in knowledge_graph.py — replaces the leak-prone conn.execute(...); conn.commit(); conn.close() pattern
Hook script python3 - "$TRANSCRIPT_PATH" <<'PYEOF' — correct sys.argv pattern, no interpolation
SQLite row_factory = sqlite3.Row + named column access — cleaner than positional row[10]
0o700 / 0o600 file perms on config dir/file
pyyaml upper bound pin

Thanks for all the cleanup on top of #252 — the sanitize_name/content pattern and the palace.py shared module are exactly the right shape. 💜

milla-jovovich

Approving — overriding my earlier request-changes review. Ben, the two items I flagged (mtime check loss in palace.py::file_already_mined and the unbounded metadata fetch + missing TTL cache in mcp_server.py) are both real and worth a follow-up pass whenever you have a spare minute, but they shouldn't block the merge. Nice work on the sanitize_name/content + palace.py dedupe — clean shape. 💜

PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06. Highlights: Claude/Codex plugin packaging (#270), security hardening (#387), honest AAAK stats + benchmark corrections (#147), Windows compatibility fixes, Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more. See GitHub release notes for full changelog.

PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06. Highlights: Claude/Codex plugin packaging (#270), security hardening (#387), honest AAAK stats + benchmark corrections (#147), Windows compatibility fixes, Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more. See GitHub release notes for full changelog. Co-authored-by: milla-jovovich <[email protected]>

PyPI release cut covering 39 merged PRs since v3.0.0 on 2026-04-06. Highlights: Claude/Codex plugin packaging (MemPalace#270), security hardening (MemPalace#387), honest AAAK stats + benchmark corrections (MemPalace#147), Windows compatibility fixes, Knowledge Graph WAL mode + batching, 10K limit safety caps, and much more. See GitHub release notes for full changelog. Co-authored-by: milla-jovovich <[email protected]>

bensig mentioned this pull request Apr 9, 2026

security+perf: harden inputs, fix command injection, optimize DB access #252

Closed

3 tasks

bensig added 5 commits April 9, 2026 08:13

fix: remove metadata cache that broke test isolation

32297fd

The 30s TTL metadata cache returned stale data between test runs and after write operations. Reverted to direct col.get() reads which match the original behavior and pass all tests.

fix: make drawer_id deterministic for idempotent writes

0717cae

Remove datetime.now() from drawer_id hash so same content + wing + room always produces the same ID. This enables the idempotency check that returns "already_exists" on duplicate writes.

test: add coverage for file_already_mined mtime check

2448ac0

Covers the check_mtime=True path in palace.py to meet 85% coverage threshold.

milla-jovovich requested changes Apr 9, 2026

View reviewed changes

bensig requested a review from milla-jovovich April 9, 2026 16:11

milla-jovovich approved these changes Apr 9, 2026

View reviewed changes

bensig merged commit 39855df into main Apr 9, 2026
5 of 6 checks passed

bensig mentioned this pull request Apr 9, 2026

fix: Windows mtime test compatibility #392

Merged

2 tasks

milla-jovovich mentioned this pull request Apr 9, 2026

chore: bump version to 3.1.0 #409

Merged

6 tasks

This was referenced Apr 9, 2026

Multiple issues between README claims and codebase #27

Open

chore: disable broken auto-bump workflow #414

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

security: harden inputs, fix shell injection, optimize DB access#387

security: harden inputs, fix shell injection, optimize DB access#387
bensig merged 6 commits intomainfrom
ben/security-hardening

bensig commented Apr 9, 2026

Uh oh!

milla-jovovich left a comment

Uh oh!

milla-jovovich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bensig commented Apr 9, 2026

Summary

Fixes applied on top of #252

Test plan

Uh oh!

milla-jovovich left a comment

Choose a reason for hiding this comment

Two regressions I caught

Medium-severity items

Low-severity cleanup (not blocking)

Good parts I want to keep

Uh oh!

milla-jovovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants