Skip to content

feat(chunking): promote per-entry blockquote tags to ChunkMetadata.tags#463

Merged
memtomem merged 2 commits intomainfrom
feat/blockquote-tags-reader
Apr 24, 2026
Merged

feat(chunking): promote per-entry blockquote tags to ChunkMetadata.tags#463
memtomem merged 2 commits intomainfrom
feat/blockquote-tags-reader

Conversation

@memtomem
Copy link
Copy Markdown
Owner

Summary

  • mem_add(tags=[...]) now round-trips through mem_search(tag_filter=...). Previously mem_add wrote tags into a per-entry blockquote header (> created: ... / tags: [...]) but the chunker only inspected file-level YAML frontmatter, so ChunkMetadata.tags stayed empty and tag_filter (set membership at search/pipeline.py:365–366) silently missed anything added that way.
  • Reader-only PR per planning/mem-add-tags-blockquote-promote-rfc.md. Writer (json.dumps + explicit > prefix) follows in a stacked PR.

What changed

  • New MarkdownChunker._extract_section_blockquote_tags(text) parses a section-leading blockquote group, extracts the tags: value, and strips the entire blockquote from chunk content so it no longer leaks into BM25 / embedding inputs. Recognizes the canonical > tags: ["a"] form (post-RFC writer) AND the legacy lazy-continuation tags: ['a'] form (current writer).
  • Frontmatter and section-leading blockquote tags compose by union into ChunkMetadata.tags. Mid-section blockquotes (quoted paragraphs in body prose) are left alone — section-leading only.
  • The frontmatter parser was refactored to share its value parser via a new _parse_tags_value(value, trailing_lines) helper; both call sites now go through it.
  • Multi-agent e2e test_share_copies_chunk_with_audit_tag previously asserted the shared-from=<src> audit tag in copy.content. The chunker now strips that header from content; the test asserts on copy.metadata.tags instead. Body text stays in copy.content unchanged.

Why this is safe

  • Backward compatibility: existing files written by older mem_add (Python repr(), lazy continuation) parse correctly. Reindex is enough — no on-disk rewrite.
  • No-tags case: section-leading blockquote without a tags: key (e.g. just > created:) leaves the chunk content unchanged.
  • Mid-section blockquote false-positive prevented: parser bails out unless the blockquote is the first non-blank block of the section.
  • Existing files (frontmatter only, no per-entry blockquote): zero behavior change.

Test plan

  • 6 chunker unit tests in tests/test_chunking_blockquote_tags.py covering: canonical / lazy continuation / frontmatter+blockquote union / mid-section noop / no-tags noop / single-vs-double-quote parity
  • tests/test_multi_agent_integration.py test_share_copies_chunk_with_audit_tag re-targets metadata.tags
  • Full suite: uv run pytest -m "not ollama" → 2390 passed, 0 failed
  • uv run ruff check packages/memtomem/src && uv run ruff format --check ...
  • uv run mypy packages/memtomem/src/memtomem/chunking/markdown.py — no issues

Out of scope (follow-ups)

  • Writer change (canonical > tags: ["a"] JSON form) — stacked PR.
  • chunk_links table for true mem_agent_share provenance — separate RFC.

🤖 Generated with Claude Code

pandas-studio and others added 2 commits April 25, 2026 08:20
mem_add(tags=[...]) advertises tags as a first-class API parameter, but
they never reached ChunkMetadata.tags — only file-level YAML frontmatter
was inspected. mem_search(tag_filter=) is set membership against
ChunkMetadata.tags (search/pipeline.py:365), so anything added through
mem_add was silently invisible to tag_filter. Discovered during PR #462
multi-agent e2e: shared-from=<src-uuid> audit tags lived in chunk
content, not metadata.

The chunker now detects a section-leading blockquote group, extracts a
tags: key from it, and unions the result with file-level frontmatter
tags into ChunkMetadata.tags. Both shapes the writer has emitted are
recognized:

- canonical "> tags: [\"a\", \"b\"]" (every line carries `> `)
- legacy lazy continuation: "> created:" followed by a bare
  "tags: [\"a\", \"b\"]" line, which CommonMark glues onto the same
  blockquote at render time

Both single- and double-quoted list literals are parsed (Python repr()
is the legacy form; JSON is the canonical form). The frontmatter
parser was refactored to share its value parser with the new
section-leading parser via a single _parse_tags_value() helper.

The blockquote group is stripped from chunk content so it no longer
leaks into BM25 / embedding inputs. Mid-section blockquotes (a quoted
paragraph in body prose) are left alone — section-leading only. A
section-leading blockquote without a tags: key is also left untouched
so the rendered "> created:" line still survives in chunk content.

Reindex backfills tags onto memories added by older mem_add(tags=)
calls without rewriting them on disk.

Multi-agent e2e test_share_copies_chunk_with_audit_tag now asserts on
metadata.tags rather than content for the shared-from=<src> audit
trail; the body content stays in copy.content unchanged.

Reader-only PR per planning/mem-add-tags-blockquote-promote-rfc.md.
The writer (json.dumps + explicit > prefix) follows in a stacked PR.

Co-Authored-By: Claude <[email protected]>
Stripping the per-entry blockquote header from chunk content changes
content_hash = sha256(content) (models.py:97), which the differ treats
as a new chunk and assigns a fresh uuid4(). After reindex, any
external pin of chunk_id (notebooks, scripts, cross-LTM refs) misses,
and shared-from=<old-uuid> audit chains reference UUIDs that no
longer exist.

Caught in PR review. The chunk_links follow-up RFC will close the
audit-chain gap permanently with FK + cascade; until then the
release note is the only protection.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem merged commit 32e88b0 into main Apr 24, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 24, 2026
@memtomem memtomem deleted the feat/blockquote-tags-reader branch April 27, 2026 14:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants