feat: add sensitive content scanner for palace drawers by MehdiJenab-oti · Pull Request #389 · MemPalace/mempalace

MehdiJenab-oti · 2026-04-09T15:31:17Z

Summary

Users routinely paste sensitive data — API keys, bearer tokens, passwords,
private key blocks — into conversations and project files that get mined
into the palace. Once stored, these secrets sit unencrypted in ChromaDB
drawers where any MCP client or backup process can read them. A stolen
laptop or compromised backup leaks every secret ever filed.

This PR adds a regex-based scanner that warns at every ingestion point
(MCP, miner, CLI audit) so users can act before secrets accumulate.

scanner.py: detects OpenAI/AWS/GitHub API keys, bearer tokens, password assignments, private keys
MCP integration: add_drawer returns a warnings field when secrets are detected (advisory, never blocks)
Miner integration: prints stderr warning during file ingestion
CLI: mempalace scan [--wing NAME] audits all existing drawers

Next steps (future PRs)

Automatic redaction option (replace matched patterns with [REDACTED])
Soft-delete with purgatory (recovery window before permanent removal)
mempalace_delete_drawer currently does a hard delete with no undo

Test plan

18 new tests in test_scanner.py covering all patterns and false-positive checks
Full test suite passes (135 tests, 0 failures)
Lint clean via ruff

Detect API keys, tokens, passwords, and private keys in content before storage. Advisory warnings only — does not block ingestion. - New scanner.py module with regex-based detection - MCP add_drawer returns warnings field when secrets found - Miner prints stderr warning during file ingestion - New CLI command: mempalace scan [--wing NAME] to audit existing drawers - 18 tests covering all patterns and false-positive checks

…b-oti/mempalace into feat/sensitive-content-scanner

bgauryy · 2026-04-09T17:41:40Z

PR Review: feat: add sensitive content scanner for palace drawers

Executive Summary

Aspect	Value
PR Goal	Add regex-based scanner that warns about sensitive content (API keys, tokens, passwords, private keys) at every ingestion point
Files Changed	5
Risk Level	HIGH — Security feature that scans for secrets; regex gaps create false confidence, and warnings leak partial secrets
Review Mode	Full
Review Effort	3/5
Recommendation	REQUEST_CHANGES

Affected Areas: Scanner core (scanner.py), MCP ingestion (mcp_server.py), file mining (miner.py), CLI audit (cli.py), tests (test_scanner.py)

Business Impact: Users trust this scanner to flag sensitive data before it's stored. False negatives create a false sense of security. Warning messages that include the matched secret content undermine the scanner's purpose.

Flow Changes: Scanner hooks into three existing ingestion paths — MCP tool_add_drawer, miner process_file, and a new cmd_scan CLI command. All advisory-only (content still stored). MCP path adds a warnings field to the return dict.

Ratings

Aspect	Score
Correctness	3/5
Security	3/5
Performance	4/5
Maintainability	3/5

PR Health

Has clear description
References ticket/issue (if applicable) — no issue linked
Appropriate size (236 additions across 5 files)
Has relevant tests (124 lines of test coverage)

Guidelines Compliance

No project guidelines loaded.

High Priority Issues

(Must fix before merge)

[Security] #1: Scanner Warnings Leak Partial Secret Content to MCP Callers

Location: mempalace/scanner.py:33-38, mempalace/mcp_server.py:382 | Confidence: HIGH

scan_content() stores the full matched secret in findings["match"]. format_warnings() truncates to 40 chars but still includes a significant portion of the actual secret. In the MCP path, result["warnings"] = format_warnings(findings) sends this back to the MCP caller — an AI agent that may log it, display it, or forward it to an AI provider's API. The CLI cmd_scan also prints up to 50 chars of matched secrets to stdout. A scanner designed to protect against secret leakage should not itself propagate secrets in its output.

The miner.py integration is the only caller that handles this correctly — it prints only pattern names, not matched content.

 def format_warnings(findings):
     """Format findings into a human-readable warning string."""
     if not findings:
         return ""
     lines = ["WARNING: Sensitive content detected:"]
     for f in findings:
-        preview = f["match"][:40] + "..." if len(f["match"]) > 40 else f["match"]
-        lines.append(f"  - {f['pattern_name']}: {preview}")
+        lines.append(f"  - {f['pattern_name']} (chars {f['start']}-{f['end']})")
     return "\n".join(lines)

And in cli.py:cmd_scan, use position instead of content:

             for f in findings:
-                preview = f["match"][:50] + "..." if len(f["match"]) > 50 else f["match"]
-                print(f"    - {f['pattern_name']}: {preview}")
+                print(f"    - {f['pattern_name']} (chars {f['start']}-{f['end']})")

[Security] #2: Incomplete Pattern Coverage Creates False Confidence

Location: mempalace/scanner.py:10-23 | Confidence: HIGH

The scanner covers 4 pattern families (OpenAI/AWS/GitHub keys, Bearer tokens, password assignments, private keys) but misses many common secret formats. Users who see "0 sensitive patterns found" may trust that result when secrets exist. For a security feature, false negatives are worse than missing the feature entirely — they create a false sense of security.

Missing patterns include:

Stripe: sk_live_, pk_live_, rk_live_ (note: underscore not dash — current sk- won't match)
Slack: xoxb-, xoxp-, xoxo-, xapp-
npm: npm_
Database connection strings: postgres://user:password@, mongodb+srv://user:password@
Generic secret assignment: SECRET_KEY=, API_SECRET=, CLIENT_SECRET=

 PATTERNS = {
     "api_key": re.compile(
-        r"(?:sk-|AKIA|ghp_|gho_|github_pat_)[A-Za-z0-9_-]{20,}"
+        r"(?:sk-|sk_live_|pk_live_|rk_live_|AKIA|ghp_|gho_|github_pat_|xoxb-|xoxp-|xapp-|npm_)[A-Za-z0-9_-]{20,}"
     ),
     "bearer_token": re.compile(
         r"Bearer\s+[A-Za-z0-9_-]{20,}"
     ),
     "password_assignment": re.compile(
         r"""(?:password|passwd|pwd)["']?\s*[=:]\s*['"][^'"]+['"]""", re.IGNORECASE
     ),
     "private_key": re.compile(
         r"-----BEGIN (?:RSA |EC )?PRIVATE KEY-----"
     ),
+    "connection_string": re.compile(
+        r"(?:postgres|mysql|mongodb\+srv|redis)://[^:\s]+:[^@\s]+@"
+    ),
+    "generic_secret": re.compile(
+        r"""(?:SECRET|PRIVATE)[_-]?(?:KEY|TOKEN)["']?\s*[=:]\s*['"][^'"]{8,}['"]""", re.IGNORECASE
+    ),
 }

Medium Priority Issues

(Should fix, not blocking)

[Bug] #3: `scan_content` Has No Null/Type Guard

Location: mempalace/scanner.py:27 | Confidence: HIGH

If content is None (possible from ChromaDB col.get() when a document has been deleted or corrupted), pattern.finditer(None) raises TypeError. The CLI cmd_scan iterates over data["documents"] from ChromaDB, which can contain None entries.

 def scan_content(content):
+    if not content:
+        return []
     findings = []
     for name, pattern in PATTERNS.items():

[Architecture] #4: Inconsistent Finding Display Across Three Integration Points

Location: mempalace/scanner.py:42, mempalace/cli.py:255, mempalace/miner.py:435 | Confidence: HIGH

Three callers format scanner findings three different ways:

Caller	Shows Match Content	Truncation	Output
`format_warnings()` (MCP)	Yes	40 chars	MCP response dict
`cmd_scan()` (CLI)	Yes	50 chars	stdout
`process_file()` (miner)	No — names only	N/A	stderr

The miner's approach (names only) is the safest and should be the standard. All callers should use format_warnings() for consistency, and format_warnings() should not include match content (see Issue #1).

 # cli.py — cmd_scan
-            for f in findings:
-                preview = f["match"][:50] + "..." if len(f["match"]) > 50 else f["match"]
-                print(f"    - {f['pattern_name']}: {preview}")
+            print(f"    {format_warnings(findings)}")

[Code Quality] #5: `bearer_token` Regex Case-Sensitivity Inconsistency

Location: mempalace/scanner.py:14-16 | Confidence: MED

password_assignment uses re.IGNORECASE but bearer_token does not. HTTP Authorization headers use Bearer (capital B), but in conversation logs, code comments, and config files, both bearer and BEARER appear. Since the scanner processes conversation content and code files (not raw HTTP headers), case-insensitive matching is more appropriate.

     "bearer_token": re.compile(
-        r"Bearer\s+[A-Za-z0-9_-]{20,}"
+        r"Bearer\s+[A-Za-z0-9_-]{20,}", re.IGNORECASE
     ),

Low Priority Issues

(Nice to have)

[Performance] #6: CLI Scan Loads All Drawers Into Memory At Once

Location: mempalace/cli.py:241 | Confidence: MED

col.get(limit=10000) loads up to 10,000 documents into memory simultaneously. For large palaces with many drawers, this could cause significant memory pressure. Consider batching with offset/limit pagination.

-    kwargs = {"include": ["documents", "metadatas"], "limit": 10000}
-    ...
-    data = col.get(**kwargs)
+    batch_size = 500
+    offset = 0
+    while True:
+        kwargs = {"include": ["documents", "metadatas"], "limit": batch_size, "offset": offset}
+        if args.wing:
+            kwargs["where"] = {"wing": args.wing}
+        data = col.get(**kwargs)
+        if not data["ids"]:
+            break
+        # ... process batch ...
+        offset += batch_size

Flow Impact Analysis

                    INGESTION PATHS (all advisory-only)
                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  MCP Client              CLI Mine              CLI Scan (NEW)
      │                      │                       │
      ▼                      ▼                       ▼
  tool_add_drawer()     process_file()          cmd_scan()
      │                      │                       │
      ├─ scan_content()      ├─ scan_content()       ├─ scan_content()
      │    ↓                 │    ↓                  │    ↓
      │  findings            │  findings             │  findings
      │    ↓                 │    ↓                  │    ↓
      ├─ format_warnings()   ├─ print names only     ├─ print match[:50]
      │    ↓                 │    (stderr) ✓          │    (stdout) ⚠
      │  [LEAKS secret       │                       │  [LEAKS secret
      │   to MCP response]   │                       │   to terminal]
      │    ⚠                 │                       │
      ▼                      ▼                       ▼
  Content STORED         Content STORED          (read-only scan)
  regardless             regardless

Blast Radius: scan_content() is called from 3 locations. Changes to the function signature or return format affect all 3 callers. The format_warnings() function is used by MCP only — CLI has its own inline formatting.

Breaking Changes: None — the scanner is purely additive. MCP callers that don't check for "warnings" in the response are unaffected.

Relationship to RFC: The existing RFC-mempalace-security-hardening identifies "Sensitive data in WAL" (Finding 5) and "Internal error leakage" (Finding 8) as security concerns. This PR's warning leakage (Issue #1) is a new instance of the same class of problem — the scanner that detects secrets also propagates them through its output.

Created by Octocode MCP https://octocode.ai

web3guru888

Nice security feature. Advisory-only is the right design — blocking storage would be worse than a warning. A few observations:

Strengths:

Pattern coverage is solid: OpenAI, AWS, GitHub (pat/oauth/personal), bearer tokens, password assignments, private keys. The 18 tests include both detection and false-positive checks.
MCP integration is non-intrusive: warnings field in the return dict only appears when there are findings. Existing clients won't break.
The mempalace scan CLI command is useful for auditing existing palaces.

Potential issues:

Slack/Stripe/Anthropic tokens — You cover OpenAI (sk-) and AWS (AKIA), but xoxb-, xoxp- (Slack), sk_live_/sk_test_ (Stripe), sk-ant- (Anthropic) are equally common in conversation transcripts. Worth adding in a follow-up.
sk- prefix collision — The sk-[A-Za-z0-9_-]{20,} pattern will match any 20+ char string starting with sk-. In code discussions, variable names like sk-session-key-manager-helper could false-positive. Consider tightening to require sk-proj- or minimum 30+ chars for bare sk-.
Performance on large palaces — The scan CLI does limit: 10000 in a single col.get() call. On palaces with 50k+ drawers, this will OOM or at least be very slow. Consider paginating with offset/limit or using ChromaDB's offset parameter.
password_assignment false positives — The regex will match config file discussions like password: "${DB_PASSWORD}" (environment variable references, not actual secrets). Consider excluding patterns where the value contains ${ or os.environ.

The scanner module itself is cleanly structured and easy to extend. Good foundation for the redaction feature you mentioned.

🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.

- Remove secret content from warnings — show position only (chars N-M) - Add patterns: Stripe (sk_live_, sk_test_), Slack (xoxb-, xoxp-), npm (npm_), and database connection strings (postgres://, mongodb://, etc.) - Add None guard to scan_content() for robustness - Add re.IGNORECASE to bearer_token pattern - Unify display format — all paths use format_warnings() - Batch CLI scan with pagination (500 drawers per batch) - Expand tests from 18 to 29

- Require sk-proj-, sk-ant-, or sk-or- prefix instead of bare sk- to avoid false positives on variable names like sk-session-key-manager - Add Anthropic (sk-ant-) to detected patterns - Exclude password values starting with $ (env var references like ${DB_PASSWORD}) from password_assignment pattern - Add 4 new tests (33 total): bare sk- false positive, env var password false positives, Anthropic key detection

release: sync develop → main (v3.3.0 manifest, SECURITY.md, version guard, Pages CNAME)

@Perseusxrltd

…ry, PalaceContext Lands the read-side contract so third-party adapter authors (@Perseusxrltd, @JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a stable target matching what RFC 001 §10 landed on the write side in #995. Scope (this PR): - mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes: SourceNotFoundError, AuthRequiredError, AdapterClosedError, TransformationViolationError, SchemaConformanceError (§2.7). Class-level identity contract: name / adapter_version / capabilities / supported_modes / declared_transformations / default_privacy_class (§2.1, §1.4, §1.5, §6). - mempalace/sources/transforms.py: reference implementations of the 13 reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop — as pure functions, plus identity shims for the six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) that the conversations adapter will override when migrated. get_transformation(name) resolves by reserved name. - mempalace/sources/registry.py: entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + explicit register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order; crucially, no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts). - mempalace/sources/context.py: PalaceContext facade (§9) bundling the drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks core passes into adapter.ingest(). upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1. skip_current_item() signals laziness; emit() dispatches to hooks and swallows hook exceptions. - mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible column migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_info then ALTER TABLE ADD COLUMN), matching the pattern used for any new palace-side provenance fields. - pyproject.toml: mempalace.sources entry-point group declared. Empty on the first-party side for now — miners migrate in a follow-up; the group being present means third-party packages can begin registering today. Out of scope (explicit follow-ups): - miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into the adapter (§9). Larger refactor; lands separately. - convo_miner.py + normalize.py → mempalace/sources/conversations.py. The format-detection if-chain in normalize.py becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping). - Closet post-step wired into the conversations adapter (§1.7). - CLI --source flag + --mode deprecation alias (§3.3). - MCP mempalace_mine tool source parameter. - AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round- trip and declared-transformation round-trip tests. - Privacy-class floor enforcement (§6.2); depends on #389 for secrets_possible scanning. Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering the ABC instantiation rules, typed records, all reserved transformations, the registry register/get/unregister surface, PalaceContext upsert + skip + emit semantics, and both the new KG provenance kwargs and backwards- compatible legacy-schema migration. Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 cleanup — sibling PR on the write side).

MehdiJenab-oti and others added 4 commits April 9, 2026 11:24

Merge branch 'main' into feat/sensitive-content-scanner

3d090e0

merge: resolve conflict in miner.py with upstream main

2a4b748

Merge branch 'feat/sensitive-content-scanner' of github.com:MehdiJena…

2e5c2ed

…b-oti/mempalace into feat/sensitive-content-scanner

web3guru888 reviewed Apr 10, 2026

View reviewed changes

MehdiJenab-oti and others added 4 commits April 9, 2026 23:31

Merge branch 'main' into feat/sensitive-content-scanner

e563a77

Merge branch 'main' into feat/sensitive-content-scanner

695eaac

bensig changed the base branch from main to develop April 11, 2026 22:22

bensig requested review from bensig, igorls and milla-jovovich as code owners April 11, 2026 22:22

igorls added area/cli CLI commands area/mcp MCP server and tools area/mining File and conversation mining enhancement New feature or request security Security related labels Apr 14, 2026

igorls and others added 2 commits April 14, 2026 12:35

Merge pull request MemPalace#878 from MemPalace/develop

73f038c

release: sync develop → main (v3.3.0 manifest, SECURITY.md, version guard, Pages CNAME)

merge: resolve conflict in miner.py with upstream main

3386270

bensig mentioned this pull request Apr 18, 2026

RFC: Source adapter plugin specification #989

Open

igorls mentioned this pull request Apr 18, 2026

refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext #1014

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add sensitive content scanner for palace drawers#389

feat: add sensitive content scanner for palace drawers#389
MehdiJenab-oti wants to merge 10 commits intoMemPalace:developfrom
MehdiJenab-oti:feat/sensitive-content-scanner

MehdiJenab-oti commented Apr 9, 2026

Uh oh!

bgauryy commented Apr 9, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MehdiJenab-oti commented Apr 9, 2026

Summary

Next steps (future PRs)

Test plan

Uh oh!

bgauryy commented Apr 9, 2026

PR Review: feat: add sensitive content scanner for palace drawers

Executive Summary

Ratings

PR Health

Guidelines Compliance

High Priority Issues

[Security] #1: Scanner Warnings Leak Partial Secret Content to MCP Callers

[Security] #2: Incomplete Pattern Coverage Creates False Confidence

Medium Priority Issues

[Bug] #3: scan_content Has No Null/Type Guard

[Architecture] #4: Inconsistent Finding Display Across Three Integration Points

[Code Quality] #5: bearer_token Regex Case-Sensitivity Inconsistency

Low Priority Issues

[Performance] #6: CLI Scan Loads All Drawers Into Memory At Once

Flow Impact Analysis

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Bug] #3: `scan_content` Has No Null/Type Guard

[Code Quality] #5: `bearer_token` Regex Case-Sensitivity Inconsistency