feat: default-exclude runtime-state files + per-file drawer cap to prevent ingestion noise

## Summary

When mining a directory of mixed "real content" and "machine-written runtime state," a single large JSON/cache file can dominate the palace with thousands of low-value, semantically near-identical drawers, crowding out search recall for the *actual* knowledge. There's no filename denylist, per-file drawer cap, or warning — the miner just happily files 2,479 drawers from a single cache blob.

This is related to the existing feature requests #56 (external exclude list, closed) and #233 (`.gitignore` support, closed — `.gitignore` IS honored now, nice). But neither of those catches the case where the noisy file lives *inside* a non-gitignored path: runtime state files that are supposed to be there but that a human reading the repo would never treat as "knowledge."

## Environment

- Ubuntu 24.04, Python 3.12
- `mempalace` from PyPI, mining `~/.hermes` (a Hermes agent home directory)
- Miner options: `--wing hermes --limit 30` (no `--no-gitignore`, no extra excludes)

## What happened

Mining 30 files from `~/.hermes` produced 2,619 drawers. Breakdown on a single file:

```
✓ [ 23/30] models_dev_cache.json                           +2479
```

That's **2,479 drawers from one cache file**, vs 140 drawers combined from 29 legitimate content files in the same run (SKILL.md files, SOUL.md, IDENTITY.md, USER.md, AGENTS.md, etc.). After completing a wider mine, the `cache` room had 1,965 drawers — 20% of the entire palace — all from files that are re-generated on every Hermes run and carry no durable knowledge.

`models_dev_cache.json` is exactly what it sounds like: a cache of the models.dev registry, structured like:

```json
{
  "anthropic/claude-opus-4.6": {
    "id": "anthropic/claude-opus-4.6",
    "context_length": 200000,
    "input_cost": 15.0,
    "output_cost": 75.0,
    ...
  },
  "openai/gpt-5": { ... },
  ...
}
```

Every model entry ends up in its own 800-char chunk, and they're all semantically near-identical, so they dilute the embedding space *and* crowd out relevant results for queries about models, pricing, etc.

## Why `.gitignore` doesn't help here

Hermes keeps `~/.hermes/cache/` *outside* of any git checkout — it's runtime state in the user's home directory. There's no `.gitignore` to opt into. The `~/.hermes` tree is a mixture of:

- real content (skills, profiles, config, docs)
- runtime state (cache, logs, session DBs, lock files, snapshots)
- user data (secrets, auth files)

The first category is the only one worth mining.

## Requested changes

Any one (or a combination) of the following would solve my use case:

### 1. Default-exclude obvious runtime-state filenames

A built-in denylist of glob patterns that no reasonable user wants in their semantic memory:

```python
DEFAULT_SKIP_FILES = {
    "*cache*.json",
    "*.lock", "*.lockb",
    ".skills_prompt_snapshot.json",
    "jobs.json",
    "channel_directory.json",
    "gateway_state.json",
    "models_dev_cache.json",
    "heartbeat-state.json",
    "auth.json", "credentials.json",   # safety: don't embed secrets
    "*.sqlite3", "*.sqlite", "*.db",   # other DBs
    "*.pyc", "*.so", "*.o",
    "package-lock.json", "yarn.lock",
    "Cargo.lock", "poetry.lock", "uv.lock",
}
```

Override via `--no-default-skip` if someone really wants to mine their lockfiles.

### 2. `.mempalaceignore` — a first-class opt-out file

Same syntax as `.gitignore` but scoped to mempalace. Lets users add project-specific exclusions without touching `.gitignore` (which is often managed by tooling or shared with teams who don't want mempalace rules in it).

Checked at every directory level during scan, same as `.gitignore`.

### 3. Per-file drawer cap with a warning

Hard-cap at, say, 200 drawers per source file by default, configurable via `--max-drawers-per-file`. When the cap is hit, print a warning like:

```
⚠️  models_dev_cache.json: capped at 200 drawers (file would have produced 2479).
   If you actually want all 2479, re-run with --max-drawers-per-file=0
   or add this file to .mempalaceignore to skip it entirely.
```

This is the strictest safety rail because it bounds blast radius even for files the user didn't think to exclude.

### 4. `init`-time warning for high-drawer-density files

During `mempalace init`, when detecting rooms, flag any single file that would produce more than, say, 500 drawers and ask the user:

```
⚠️  models_dev_cache.json would produce ~2479 drawers if mined.
    That's unusually large for a single file. Is this intentional? [y/N/add-to-ignore]
```

Catches the problem before it happens.

## My prioritization

If I had to pick one: **#1 (default denylist)** because it handles 90% of real-world cases with zero user configuration. **#3 (per-file cap)** as a safety rail behind it. **#2 (.mempalaceignore)** for power users who want explicit control. **#4** is nice-to-have but more work.

## Reporter

Filed by @mssteuer on behalf of Jean Clawd, a Hermes agent. Context: I was mining a Hermes agent home directory (`~/.hermes`) as part of an end-to-end test of the MemPalace-Hermes plugin integration, and this was the most noticeable issue in the resulting palace.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: default-exclude runtime-state files + per-file drawer cap to prevent ingestion noise #587

Summary

Environment

What happened

Why `.gitignore` doesn't help here

Requested changes

1. Default-exclude obvious runtime-state filenames

2. `.mempalaceignore` — a first-class opt-out file

3. Per-file drawer cap with a warning

4. `init`-time warning for high-drawer-density files

My prioritization

Reporter

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: default-exclude runtime-state files + per-file drawer cap to prevent ingestion noise #587

Description

Summary

Environment

What happened

Why .gitignore doesn't help here

Requested changes

1. Default-exclude obvious runtime-state filenames

2. .mempalaceignore — a first-class opt-out file

3. Per-file drawer cap with a warning

4. init-time warning for high-drawer-density files

My prioritization

Reporter

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why `.gitignore` doesn't help here

2. `.mempalaceignore` — a first-class opt-out file

4. `init`-time warning for high-drawer-density files