feat: add sensitive content scanner for palace drawers#389
feat: add sensitive content scanner for palace drawers#389MehdiJenab-oti wants to merge 10 commits intoMemPalace:developfrom
Conversation
Detect API keys, tokens, passwords, and private keys in content before storage. Advisory warnings only — does not block ingestion. - New scanner.py module with regex-based detection - MCP add_drawer returns warnings field when secrets found - Miner prints stderr warning during file ingestion - New CLI command: mempalace scan [--wing NAME] to audit existing drawers - 18 tests covering all patterns and false-positive checks
…b-oti/mempalace into feat/sensitive-content-scanner
PR Review: feat: add sensitive content scanner for palace drawersExecutive Summary
Affected Areas: Scanner core ( Business Impact: Users trust this scanner to flag sensitive data before it's stored. False negatives create a false sense of security. Warning messages that include the matched secret content undermine the scanner's purpose. Flow Changes: Scanner hooks into three existing ingestion paths — MCP Ratings
PR Health
Guidelines ComplianceNo project guidelines loaded. High Priority Issues(Must fix before merge) [Security] #1: Scanner Warnings Leak Partial Secret Content to MCP CallersLocation:
The def format_warnings(findings):
"""Format findings into a human-readable warning string."""
if not findings:
return ""
lines = ["WARNING: Sensitive content detected:"]
for f in findings:
- preview = f["match"][:40] + "..." if len(f["match"]) > 40 else f["match"]
- lines.append(f" - {f['pattern_name']}: {preview}")
+ lines.append(f" - {f['pattern_name']} (chars {f['start']}-{f['end']})")
return "\n".join(lines)And in for f in findings:
- preview = f["match"][:50] + "..." if len(f["match"]) > 50 else f["match"]
- print(f" - {f['pattern_name']}: {preview}")
+ print(f" - {f['pattern_name']} (chars {f['start']}-{f['end']})")[Security] #2: Incomplete Pattern Coverage Creates False ConfidenceLocation: The scanner covers 4 pattern families (OpenAI/AWS/GitHub keys, Bearer tokens, password assignments, private keys) but misses many common secret formats. Users who see "0 sensitive patterns found" may trust that result when secrets exist. For a security feature, false negatives are worse than missing the feature entirely — they create a false sense of security. Missing patterns include:
PATTERNS = {
"api_key": re.compile(
- r"(?:sk-|AKIA|ghp_|gho_|github_pat_)[A-Za-z0-9_-]{20,}"
+ r"(?:sk-|sk_live_|pk_live_|rk_live_|AKIA|ghp_|gho_|github_pat_|xoxb-|xoxp-|xapp-|npm_)[A-Za-z0-9_-]{20,}"
),
"bearer_token": re.compile(
r"Bearer\s+[A-Za-z0-9_-]{20,}"
),
"password_assignment": re.compile(
r"""(?:password|passwd|pwd)["']?\s*[=:]\s*['"][^'"]+['"]""", re.IGNORECASE
),
"private_key": re.compile(
r"-----BEGIN (?:RSA |EC )?PRIVATE KEY-----"
),
+ "connection_string": re.compile(
+ r"(?:postgres|mysql|mongodb\+srv|redis)://[^:\s]+:[^@\s]+@"
+ ),
+ "generic_secret": re.compile(
+ r"""(?:SECRET|PRIVATE)[_-]?(?:KEY|TOKEN)["']?\s*[=:]\s*['"][^'"]{8,}['"]""", re.IGNORECASE
+ ),
}Medium Priority Issues(Should fix, not blocking) [Bug] #3:
|
| Caller | Shows Match Content | Truncation | Output |
|---|---|---|---|
format_warnings() (MCP) |
Yes | 40 chars | MCP response dict |
cmd_scan() (CLI) |
Yes | 50 chars | stdout |
process_file() (miner) |
No — names only | N/A | stderr |
The miner's approach (names only) is the safest and should be the standard. All callers should use format_warnings() for consistency, and format_warnings() should not include match content (see Issue #1).
# cli.py — cmd_scan
- for f in findings:
- preview = f["match"][:50] + "..." if len(f["match"]) > 50 else f["match"]
- print(f" - {f['pattern_name']}: {preview}")
+ print(f" {format_warnings(findings)}")[Code Quality] #5: bearer_token Regex Case-Sensitivity Inconsistency
Location: mempalace/scanner.py:14-16 | Confidence: MED
password_assignment uses re.IGNORECASE but bearer_token does not. HTTP Authorization headers use Bearer (capital B), but in conversation logs, code comments, and config files, both bearer and BEARER appear. Since the scanner processes conversation content and code files (not raw HTTP headers), case-insensitive matching is more appropriate.
"bearer_token": re.compile(
- r"Bearer\s+[A-Za-z0-9_-]{20,}"
+ r"Bearer\s+[A-Za-z0-9_-]{20,}", re.IGNORECASE
),Low Priority Issues
(Nice to have)
[Performance] #6: CLI Scan Loads All Drawers Into Memory At Once
Location: mempalace/cli.py:241 | Confidence: MED
col.get(limit=10000) loads up to 10,000 documents into memory simultaneously. For large palaces with many drawers, this could cause significant memory pressure. Consider batching with offset/limit pagination.
- kwargs = {"include": ["documents", "metadatas"], "limit": 10000}
- ...
- data = col.get(**kwargs)
+ batch_size = 500
+ offset = 0
+ while True:
+ kwargs = {"include": ["documents", "metadatas"], "limit": batch_size, "offset": offset}
+ if args.wing:
+ kwargs["where"] = {"wing": args.wing}
+ data = col.get(**kwargs)
+ if not data["ids"]:
+ break
+ # ... process batch ...
+ offset += batch_sizeFlow Impact Analysis
INGESTION PATHS (all advisory-only)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
MCP Client CLI Mine CLI Scan (NEW)
│ │ │
▼ ▼ ▼
tool_add_drawer() process_file() cmd_scan()
│ │ │
├─ scan_content() ├─ scan_content() ├─ scan_content()
│ ↓ │ ↓ │ ↓
│ findings │ findings │ findings
│ ↓ │ ↓ │ ↓
├─ format_warnings() ├─ print names only ├─ print match[:50]
│ ↓ │ (stderr) ✓ │ (stdout) ⚠
│ [LEAKS secret │ │ [LEAKS secret
│ to MCP response] │ │ to terminal]
│ ⚠ │ │
▼ ▼ ▼
Content STORED Content STORED (read-only scan)
regardless regardless
Blast Radius: scan_content() is called from 3 locations. Changes to the function signature or return format affect all 3 callers. The format_warnings() function is used by MCP only — CLI has its own inline formatting.
Breaking Changes: None — the scanner is purely additive. MCP callers that don't check for "warnings" in the response are unaffected.
Relationship to RFC: The existing RFC-mempalace-security-hardening identifies "Sensitive data in WAL" (Finding 5) and "Internal error leakage" (Finding 8) as security concerns. This PR's warning leakage (Issue #1) is a new instance of the same class of problem — the scanner that detects secrets also propagates them through its output.
Created by Octocode MCP https://octocode.ai
web3guru888
left a comment
There was a problem hiding this comment.
Nice security feature. Advisory-only is the right design — blocking storage would be worse than a warning. A few observations:
Strengths:
- Pattern coverage is solid: OpenAI, AWS, GitHub (pat/oauth/personal), bearer tokens, password assignments, private keys. The 18 tests include both detection and false-positive checks.
- MCP integration is non-intrusive:
warningsfield in the return dict only appears when there are findings. Existing clients won't break. - The
mempalace scanCLI command is useful for auditing existing palaces.
Potential issues:
-
Slack/Stripe/Anthropic tokens — You cover OpenAI (
sk-) and AWS (AKIA), butxoxb-,xoxp-(Slack),sk_live_/sk_test_(Stripe),sk-ant-(Anthropic) are equally common in conversation transcripts. Worth adding in a follow-up. -
sk-prefix collision — Thesk-[A-Za-z0-9_-]{20,}pattern will match any 20+ char string starting withsk-. In code discussions, variable names likesk-session-key-manager-helpercould false-positive. Consider tightening to requiresk-proj-or minimum 30+ chars for baresk-. -
Performance on large palaces — The scan CLI does
limit: 10000in a singlecol.get()call. On palaces with 50k+ drawers, this will OOM or at least be very slow. Consider paginating with offset/limit or using ChromaDB'soffsetparameter. -
password_assignmentfalse positives — The regex will match config file discussions likepassword: "${DB_PASSWORD}"(environment variable references, not actual secrets). Consider excluding patterns where the value contains${oros.environ.
The scanner module itself is cleanly structured and easy to extend. Good foundation for the redaction feature you mentioned.
🔭 Reviewed as part of the MemPalace-AGI integration project — autonomous research with perfect memory. Community interaction updates are posted regularly on the dashboard.
- Remove secret content from warnings — show position only (chars N-M) - Add patterns: Stripe (sk_live_, sk_test_), Slack (xoxb-, xoxp-), npm (npm_), and database connection strings (postgres://, mongodb://, etc.) - Add None guard to scan_content() for robustness - Add re.IGNORECASE to bearer_token pattern - Unify display format — all paths use format_warnings() - Batch CLI scan with pagination (500 drawers per batch) - Expand tests from 18 to 29
- Require sk-proj-, sk-ant-, or sk-or- prefix instead of bare sk-
to avoid false positives on variable names like sk-session-key-manager
- Add Anthropic (sk-ant-) to detected patterns
- Exclude password values starting with $ (env var references like
${DB_PASSWORD}) from password_assignment pattern
- Add 4 new tests (33 total): bare sk- false positive, env var
password false positives, Anthropic key detection
release: sync develop → main (v3.3.0 manifest, SECURITY.md, version guard, Pages CNAME)
…ry, PalaceContext Lands the read-side contract so third-party adapter authors (@Perseusxrltd, @JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a stable target matching what RFC 001 §10 landed on the write side in #995. Scope (this PR): - mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes: SourceNotFoundError, AuthRequiredError, AdapterClosedError, TransformationViolationError, SchemaConformanceError (§2.7). Class-level identity contract: name / adapter_version / capabilities / supported_modes / declared_transformations / default_privacy_class (§2.1, §1.4, §1.5, §6). - mempalace/sources/transforms.py: reference implementations of the 13 reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop — as pure functions, plus identity shims for the six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) that the conversations adapter will override when migrated. get_transformation(name) resolves by reserved name. - mempalace/sources/registry.py: entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + explicit register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order; crucially, no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts). - mempalace/sources/context.py: PalaceContext facade (§9) bundling the drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks core passes into adapter.ingest(). upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1. skip_current_item() signals laziness; emit() dispatches to hooks and swallows hook exceptions. - mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible column migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_info then ALTER TABLE ADD COLUMN), matching the pattern used for any new palace-side provenance fields. - pyproject.toml: mempalace.sources entry-point group declared. Empty on the first-party side for now — miners migrate in a follow-up; the group being present means third-party packages can begin registering today. Out of scope (explicit follow-ups): - miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into the adapter (§9). Larger refactor; lands separately. - convo_miner.py + normalize.py → mempalace/sources/conversations.py. The format-detection if-chain in normalize.py becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping). - Closet post-step wired into the conversations adapter (§1.7). - CLI --source flag + --mode deprecation alias (§3.3). - MCP mempalace_mine tool source parameter. - AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round- trip and declared-transformation round-trip tests. - Privacy-class floor enforcement (§6.2); depends on #389 for secrets_possible scanning. Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering the ABC instantiation rules, typed records, all reserved transformations, the registry register/get/unregister surface, PalaceContext upsert + skip + emit semantics, and both the new KG provenance kwargs and backwards- compatible legacy-schema migration. Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 cleanup — sibling PR on the write side).
Summary
Users routinely paste sensitive data — API keys, bearer tokens, passwords,
private key blocks — into conversations and project files that get mined
into the palace. Once stored, these secrets sit unencrypted in ChromaDB
drawers where any MCP client or backup process can read them. A stolen
laptop or compromised backup leaks every secret ever filed.
This PR adds a regex-based scanner that warns at every ingestion point
(MCP, miner, CLI audit) so users can act before secrets accumulate.
add_drawerreturns awarningsfield when secrets are detected (advisory, never blocks)mempalace scan [--wing NAME]audits all existing drawersNext steps (future PRs)
[REDACTED])mempalace_delete_drawercurrently does a hard delete with no undoTest plan
test_scanner.pycovering all patterns and false-positive checks