Skip to content

feat: default-exclude runtime-state files + per-file drawer cap to prevent ingestion noise #587

@mssteuer

Description

@mssteuer

Summary

When mining a directory of mixed "real content" and "machine-written runtime state," a single large JSON/cache file can dominate the palace with thousands of low-value, semantically near-identical drawers, crowding out search recall for the actual knowledge. There's no filename denylist, per-file drawer cap, or warning — the miner just happily files 2,479 drawers from a single cache blob.

This is related to the existing feature requests #56 (external exclude list, closed) and #233 (.gitignore support, closed — .gitignore IS honored now, nice). But neither of those catches the case where the noisy file lives inside a non-gitignored path: runtime state files that are supposed to be there but that a human reading the repo would never treat as "knowledge."

Environment

  • Ubuntu 24.04, Python 3.12
  • mempalace from PyPI, mining ~/.hermes (a Hermes agent home directory)
  • Miner options: --wing hermes --limit 30 (no --no-gitignore, no extra excludes)

What happened

Mining 30 files from ~/.hermes produced 2,619 drawers. Breakdown on a single file:

✓ [ 23/30] models_dev_cache.json                           +2479

That's 2,479 drawers from one cache file, vs 140 drawers combined from 29 legitimate content files in the same run (SKILL.md files, SOUL.md, IDENTITY.md, USER.md, AGENTS.md, etc.). After completing a wider mine, the cache room had 1,965 drawers — 20% of the entire palace — all from files that are re-generated on every Hermes run and carry no durable knowledge.

models_dev_cache.json is exactly what it sounds like: a cache of the models.dev registry, structured like:

{
  "anthropic/claude-opus-4.6": {
    "id": "anthropic/claude-opus-4.6",
    "context_length": 200000,
    "input_cost": 15.0,
    "output_cost": 75.0,
    ...
  },
  "openai/gpt-5": { ... },
  ...
}

Every model entry ends up in its own 800-char chunk, and they're all semantically near-identical, so they dilute the embedding space and crowd out relevant results for queries about models, pricing, etc.

Why .gitignore doesn't help here

Hermes keeps ~/.hermes/cache/ outside of any git checkout — it's runtime state in the user's home directory. There's no .gitignore to opt into. The ~/.hermes tree is a mixture of:

  • real content (skills, profiles, config, docs)
  • runtime state (cache, logs, session DBs, lock files, snapshots)
  • user data (secrets, auth files)

The first category is the only one worth mining.

Requested changes

Any one (or a combination) of the following would solve my use case:

1. Default-exclude obvious runtime-state filenames

A built-in denylist of glob patterns that no reasonable user wants in their semantic memory:

DEFAULT_SKIP_FILES = {
    "*cache*.json",
    "*.lock", "*.lockb",
    ".skills_prompt_snapshot.json",
    "jobs.json",
    "channel_directory.json",
    "gateway_state.json",
    "models_dev_cache.json",
    "heartbeat-state.json",
    "auth.json", "credentials.json",   # safety: don't embed secrets
    "*.sqlite3", "*.sqlite", "*.db",   # other DBs
    "*.pyc", "*.so", "*.o",
    "package-lock.json", "yarn.lock",
    "Cargo.lock", "poetry.lock", "uv.lock",
}

Override via --no-default-skip if someone really wants to mine their lockfiles.

2. .mempalaceignore — a first-class opt-out file

Same syntax as .gitignore but scoped to mempalace. Lets users add project-specific exclusions without touching .gitignore (which is often managed by tooling or shared with teams who don't want mempalace rules in it).

Checked at every directory level during scan, same as .gitignore.

3. Per-file drawer cap with a warning

Hard-cap at, say, 200 drawers per source file by default, configurable via --max-drawers-per-file. When the cap is hit, print a warning like:

⚠️  models_dev_cache.json: capped at 200 drawers (file would have produced 2479).
   If you actually want all 2479, re-run with --max-drawers-per-file=0
   or add this file to .mempalaceignore to skip it entirely.

This is the strictest safety rail because it bounds blast radius even for files the user didn't think to exclude.

4. init-time warning for high-drawer-density files

During mempalace init, when detecting rooms, flag any single file that would produce more than, say, 500 drawers and ask the user:

⚠️  models_dev_cache.json would produce ~2479 drawers if mined.
    That's unusually large for a single file. Is this intentional? [y/N/add-to-ignore]

Catches the problem before it happens.

My prioritization

If I had to pick one: #1 (default denylist) because it handles 90% of real-world cases with zero user configuration. #3 (per-file cap) as a safety rail behind it. #2 (.mempalaceignore) for power users who want explicit control. #4 is nice-to-have but more work.

Reporter

Filed by @mssteuer on behalf of Jean Clawd, a Hermes agent. Context: I was mining a Hermes agent home directory (~/.hermes) as part of an end-to-end test of the MemPalace-Hermes plugin integration, and this was the most noticeable issue in the resulting palace.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions