Skip to content

feat: path-level mining descriptions — mine metadata instead of content for matched paths #981

@roip

Description

@roip

Problem

MemPalace currently has two modes for any given file: mine everything (chunk all content into drawers), or exclude entirely (via .gitignore / SKIP_DIRS). There's no middle ground.

Real projects have directories where the palace needs to know what files are and what they do, but embedding every line of content is wasteful or harmful:

  • Experiment result dumps — hundreds of KB of JSON numbers. The palace should know 'this is centroid tracking data from Phase 2' not embed 2000 chunks of float arrays.
  • Test directories — the palace should know 'pytest suite for the miner and MCP server', not embed every assert statement.
  • Data/content folders — static assets, CSV datasets, model checkpoints. Descriptions matter, content doesn't.
  • Generated output — build artifacts, compiled files, log directories.

Today the only workaround is: .gitignore the heavy files, then manually write a separate markdown doc describing them and hope it gets mined. This is fragile and doesn't scale.

Proposed solution

Add a descriptions section to \mempalace.yaml:

\\yaml
wing: myproject
rooms:

  • name: experiments
    keywords: [data, results]

descriptions:
'data/results/.json': 'Experiment result JSONs from task-geometry Phase 2 — centroid tracking, behavioral validation, spectral features'
'data/checkpoints/': 'Model checkpoints saved during DPO fine-tuning runs'
'tests/': 'Pytest test suite covering mining, MCP server, room detection, and ignore logic'
'**/
_responses.jsonl': 'Raw model response dumps from steering/calibration experiments'
\\

Mining behavior

When \scan_project\ encounters a file matching a description pattern:

  1. Skip normal content chunking (no 800-char chunks of raw data)
  2. Create one drawer with the description text as content
  3. Set \source_file\ metadata to the matched path (so search can surface it)
  4. Respect mtime — only re-create the drawer if the description changed

This gives the palace semantic knowledge about what's in a path without the noise of embedding raw content.

Pattern matching

Use the same glob syntax as .gitignore — users already know it. Patterns are matched against the project-relative path.

Use cases

  • Any ML project with large result files, activation dumps, or model outputs
  • Web projects with asset directories, uploaded content, or build output
  • Monorepos where some subdirectories should be described, not indexed
  • Projects with test suites that are useful to reference but not embed line-by-line

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions