Skip to content

docs(config): document indexing edge cases (exclude_patterns, cloud-sync watcher, MMR, namespace)#251

Merged
memtomem merged 1 commit intomainfrom
docs/indexing-edge-cases
Apr 18, 2026
Merged

docs(config): document indexing edge cases (exclude_patterns, cloud-sync watcher, MMR, namespace)#251
memtomem merged 1 commit intomainfrom
docs/indexing-edge-cases

Conversation

@memtomem
Copy link
Copy Markdown
Owner

Summary

Five documentation gaps surfaced from a real-world setup walkthrough where a user's index accumulated subagent metadata, cloud-sync files, and OAuth credentials silently — none of which were obvious from the existing docs.

  • `exclude_patterns` had no dedicated section. Added env var row, `config.d/` fragment example, and the two caveats users hit: (a) not retroactive (`mem_do(action="delete", source_file=...)` is required to prune existing entries; `mem_index force=true` does not), and (b) matched against memory_dir-relative paths, so built-in patterns assuming `**/.claude/...` miss when `~/.claude/projects` itself is the auto-discovered root.
  • Auto-discovered roots note now warns about Claude subagent metadata (`/subagents/.meta.json`) and `~/.gemini` browser-profile JSON that the built-in denylist does not cover.
  • The cloud-sync watcher caveat in `google-drive.md` was vague ("the file watcher may miss changes"). Strengthened to spell out that cloud-sync mounts (Drive Stream, OneDrive Files-On-Demand ON, iCloud Optimize Storage) generally don't emit fs watcher events at all on macOS/Linux — so manual `mem_index` is required.
  • MMR section now explains when to enable it (overview/detail overlap, e.g. `MEMORY.md` + `feedback_*.md`) and links to `mem_dedup_scan` / `mem_dedup_merge` for accumulated overlap.
  • Namespace section explains the `enable_auto_ns` immediate-parent-folder limitation and shows the `mem_index namespace=":"` pattern (e.g. `gdrive:team`, `claude:memory`) for richer source/tool encoding that groups well in the Web UI Sources view.

Out of scope (separate PRs)

  • The built-in exclude pattern bug itself (`_BUILTIN_EXCLUDE_SPEC` matches against memory_dir-relative paths, so `/oauth_creds.json` and `/.claude/**/*.meta.json` miss matches under their respective auto-discovered roots). This PR documents the workaround; the actual fix in `engine.py` will be a separate code PR.
  • Rule-based namespace policy (`NamespacePolicyRule` mapping path globs to `{prefix}:{scope}` namespaces). Documented as a manual `mem_index namespace=...` pattern for now.

Test plan

  • `docs/guides/configuration.md` renders cleanly on GitHub (table + jsonc fence + nested blockquote with bullets)
  • Anchor link `exclude_patterns` resolves
  • `docs/guides/google-drive.md` watcher note still flows after the rewrite
  • Verified all referenced symbols exist in source: `indexing.exclude_patterns` (config.py:154), `mem_dedup_scan`/`mem_dedup_merge` (server/tools/dedup_decay.py:17,62), `MEMTOMEM_INDEXING__EXCLUDE_PATTERNS` env var convention

…ync watcher, MMR, namespace)

Five gaps surfaced from a real-world setup walkthrough:

1. `exclude_patterns` had no dedicated section — added env var row,
   config.d fragment example, and the two caveats users hit:
   not-retroactive (must `mem_do delete` to prune existing entries) and
   matched against memory_dir-relative paths (built-in patterns assuming
   `**/.claude/...` miss when `~/.claude/projects` itself is the auto-
   discovered root).

2. Auto-discovered roots note now warns about Claude subagent metadata
   and `~/.gemini` browser-profile noise that the built-in denylist does
   not cover.

3. Cloud-sync watcher caveat (already in google-drive.md as a one-liner)
   was vague — strengthened to make explicit that cloud-sync mounts
   generally do not emit fs watcher events at all on macOS/Linux, so
   manual `mem_index` is required.

4. MMR section now explains *when* to enable it (overview/detail
   overlap, e.g. `MEMORY.md` + `feedback_*.md`) and points to
   `mem_dedup_scan` for accumulated overlap.

5. Namespace section explains the `enable_auto_ns` immediate-parent
   limitation and shows the `mem_index namespace="<prefix>:<scope>"`
   pattern for richer source/tool encoding (groups in Web UI by colon).
memtomem pushed a commit that referenced this pull request Apr 18, 2026
…nt and against absolute paths

The built-in exclude denylist had two related gaps that let credentials
and noise into the index in real-world setups:

1. ``IndexEngine.index_file`` (the file watcher's reindex entry point)
   bypassed all exclude checks. ``_discover_files`` was the only guard,
   so a watchdog event for ``~/.gemini/oauth_creds.json`` would happily
   index the credential.

2. Patterns were matched only against the memory_dir-relative path. When
   ``~/.claude/projects`` itself is the auto-discovered memory_dir root,
   the rel path drops the ``.claude/`` token, so
   ``**/.claude/**/*.meta.json`` never matched the subagent metadata that
   Claude Code drops under ``<UUID>/subagents/``.

Centralize the policy into two helpers — ``_exclude_match_keys`` (builds
the candidate keys: absolute path + rel-to-each-memory_dir) and
``_path_is_excluded`` — and call them from both ``_discover_files`` and
``index_file``. ``**/subagents/*.meta.json`` is added as a built-in noise
pattern so the rel-path form also matches when ``.claude/`` is missing.

Three regression tests cover the two scenarios that previously slipped
through and the new entry-point guard.

Companion to PR #251 (docs) which documented the workaround.
@memtomem memtomem merged commit 0f63fee into main Apr 18, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 18, 2026
@memtomem memtomem deleted the docs/indexing-edge-cases branch April 20, 2026 14:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants