feat(chunking): promote per-entry blockquote tags to ChunkMetadata.tags by memtomem · Pull Request #463 · memtomem/memtomem

memtomem · 2026-04-24T23:20:37Z

Summary

mem_add(tags=[...]) now round-trips through mem_search(tag_filter=...). Previously mem_add wrote tags into a per-entry blockquote header (> created: ... / tags: [...]) but the chunker only inspected file-level YAML frontmatter, so ChunkMetadata.tags stayed empty and tag_filter (set membership at search/pipeline.py:365–366) silently missed anything added that way.
Reader-only PR per planning/mem-add-tags-blockquote-promote-rfc.md. Writer (json.dumps + explicit > prefix) follows in a stacked PR.

What changed

New MarkdownChunker._extract_section_blockquote_tags(text) parses a section-leading blockquote group, extracts the tags: value, and strips the entire blockquote from chunk content so it no longer leaks into BM25 / embedding inputs. Recognizes the canonical > tags: ["a"] form (post-RFC writer) AND the legacy lazy-continuation tags: ['a'] form (current writer).
Frontmatter and section-leading blockquote tags compose by union into ChunkMetadata.tags. Mid-section blockquotes (quoted paragraphs in body prose) are left alone — section-leading only.
The frontmatter parser was refactored to share its value parser via a new _parse_tags_value(value, trailing_lines) helper; both call sites now go through it.
Multi-agent e2e test_share_copies_chunk_with_audit_tag previously asserted the shared-from=<src> audit tag in copy.content. The chunker now strips that header from content; the test asserts on copy.metadata.tags instead. Body text stays in copy.content unchanged.

Why this is safe

Backward compatibility: existing files written by older mem_add (Python repr(), lazy continuation) parse correctly. Reindex is enough — no on-disk rewrite.
No-tags case: section-leading blockquote without a tags: key (e.g. just > created:) leaves the chunk content unchanged.
Mid-section blockquote false-positive prevented: parser bails out unless the blockquote is the first non-blank block of the section.
Existing files (frontmatter only, no per-entry blockquote): zero behavior change.

Test plan

6 chunker unit tests in tests/test_chunking_blockquote_tags.py covering: canonical / lazy continuation / frontmatter+blockquote union / mid-section noop / no-tags noop / single-vs-double-quote parity
tests/test_multi_agent_integration.py test_share_copies_chunk_with_audit_tag re-targets metadata.tags
Full suite: uv run pytest -m "not ollama" → 2390 passed, 0 failed
uv run ruff check packages/memtomem/src && uv run ruff format --check ...
uv run mypy packages/memtomem/src/memtomem/chunking/markdown.py — no issues

Out of scope (follow-ups)

Writer change (canonical > tags: ["a"] JSON form) — stacked PR.
chunk_links table for true mem_agent_share provenance — separate RFC.

🤖 Generated with Claude Code

mem_add(tags=[...]) advertises tags as a first-class API parameter, but they never reached ChunkMetadata.tags — only file-level YAML frontmatter was inspected. mem_search(tag_filter=) is set membership against ChunkMetadata.tags (search/pipeline.py:365), so anything added through mem_add was silently invisible to tag_filter. Discovered during PR #462 multi-agent e2e: shared-from=<src-uuid> audit tags lived in chunk content, not metadata. The chunker now detects a section-leading blockquote group, extracts a tags: key from it, and unions the result with file-level frontmatter tags into ChunkMetadata.tags. Both shapes the writer has emitted are recognized: - canonical "> tags: [\"a\", \"b\"]" (every line carries `> `) - legacy lazy continuation: "> created:" followed by a bare "tags: [\"a\", \"b\"]" line, which CommonMark glues onto the same blockquote at render time Both single- and double-quoted list literals are parsed (Python repr() is the legacy form; JSON is the canonical form). The frontmatter parser was refactored to share its value parser with the new section-leading parser via a single _parse_tags_value() helper. The blockquote group is stripped from chunk content so it no longer leaks into BM25 / embedding inputs. Mid-section blockquotes (a quoted paragraph in body prose) are left alone — section-leading only. A section-leading blockquote without a tags: key is also left untouched so the rendered "> created:" line still survives in chunk content. Reindex backfills tags onto memories added by older mem_add(tags=) calls without rewriting them on disk. Multi-agent e2e test_share_copies_chunk_with_audit_tag now asserts on metadata.tags rather than content for the shared-from=<src> audit trail; the body content stays in copy.content unchanged. Reader-only PR per planning/mem-add-tags-blockquote-promote-rfc.md. The writer (json.dumps + explicit > prefix) follows in a stacked PR. Co-Authored-By: Claude <[email protected]>

Stripping the per-entry blockquote header from chunk content changes content_hash = sha256(content) (models.py:97), which the differ treats as a new chunk and assigns a fresh uuid4(). After reindex, any external pin of chunk_id (notebooks, scripts, cross-LTM refs) misses, and shared-from=<old-uuid> audit chains reference UUIDs that no longer exist. Caught in PR review. The chunk_links follow-up RFC will close the audit-chain gap permanently with FK + cascade; until then the release note is the only protection. Co-Authored-By: Claude <[email protected]>

pandas-studio and others added 2 commits April 25, 2026 08:20

memtomem merged commit 32e88b0 into main Apr 24, 2026
7 checks passed

github-actions Bot locked and limited conversation to collaborators Apr 24, 2026

memtomem deleted the feat/blockquote-tags-reader branch April 27, 2026 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chunking): promote per-entry blockquote tags to ChunkMetadata.tags#463

feat(chunking): promote per-entry blockquote tags to ChunkMetadata.tags#463
memtomem merged 2 commits intomainfrom
feat/blockquote-tags-reader

memtomem commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memtomem commented Apr 24, 2026

Summary

What changed

Why this is safe

Test plan

Out of scope (follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants