feat(chunking): promote per-entry blockquote tags to ChunkMetadata.tags#463
Merged
feat(chunking): promote per-entry blockquote tags to ChunkMetadata.tags#463
Conversation
mem_add(tags=[...]) advertises tags as a first-class API parameter, but they never reached ChunkMetadata.tags — only file-level YAML frontmatter was inspected. mem_search(tag_filter=) is set membership against ChunkMetadata.tags (search/pipeline.py:365), so anything added through mem_add was silently invisible to tag_filter. Discovered during PR #462 multi-agent e2e: shared-from=<src-uuid> audit tags lived in chunk content, not metadata. The chunker now detects a section-leading blockquote group, extracts a tags: key from it, and unions the result with file-level frontmatter tags into ChunkMetadata.tags. Both shapes the writer has emitted are recognized: - canonical "> tags: [\"a\", \"b\"]" (every line carries `> `) - legacy lazy continuation: "> created:" followed by a bare "tags: [\"a\", \"b\"]" line, which CommonMark glues onto the same blockquote at render time Both single- and double-quoted list literals are parsed (Python repr() is the legacy form; JSON is the canonical form). The frontmatter parser was refactored to share its value parser with the new section-leading parser via a single _parse_tags_value() helper. The blockquote group is stripped from chunk content so it no longer leaks into BM25 / embedding inputs. Mid-section blockquotes (a quoted paragraph in body prose) are left alone — section-leading only. A section-leading blockquote without a tags: key is also left untouched so the rendered "> created:" line still survives in chunk content. Reindex backfills tags onto memories added by older mem_add(tags=) calls without rewriting them on disk. Multi-agent e2e test_share_copies_chunk_with_audit_tag now asserts on metadata.tags rather than content for the shared-from=<src> audit trail; the body content stays in copy.content unchanged. Reader-only PR per planning/mem-add-tags-blockquote-promote-rfc.md. The writer (json.dumps + explicit > prefix) follows in a stacked PR. Co-Authored-By: Claude <[email protected]>
Stripping the per-entry blockquote header from chunk content changes content_hash = sha256(content) (models.py:97), which the differ treats as a new chunk and assigns a fresh uuid4(). After reindex, any external pin of chunk_id (notebooks, scripts, cross-LTM refs) misses, and shared-from=<old-uuid> audit chains reference UUIDs that no longer exist. Caught in PR review. The chunk_links follow-up RFC will close the audit-chain gap permanently with FK + cascade; until then the release note is the only protection. Co-Authored-By: Claude <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
mem_add(tags=[...])now round-trips throughmem_search(tag_filter=...). Previouslymem_addwrote tags into a per-entry blockquote header (> created: .../tags: [...]) but the chunker only inspected file-level YAML frontmatter, soChunkMetadata.tagsstayed empty andtag_filter(set membership atsearch/pipeline.py:365–366) silently missed anything added that way.planning/mem-add-tags-blockquote-promote-rfc.md. Writer (json.dumps+ explicit>prefix) follows in a stacked PR.What changed
MarkdownChunker._extract_section_blockquote_tags(text)parses a section-leading blockquote group, extracts thetags:value, and strips the entire blockquote from chunk content so it no longer leaks into BM25 / embedding inputs. Recognizes the canonical> tags: ["a"]form (post-RFC writer) AND the legacy lazy-continuationtags: ['a']form (current writer).ChunkMetadata.tags. Mid-section blockquotes (quoted paragraphs in body prose) are left alone — section-leading only._parse_tags_value(value, trailing_lines)helper; both call sites now go through it.test_share_copies_chunk_with_audit_tagpreviously asserted theshared-from=<src>audit tag incopy.content. The chunker now strips that header from content; the test asserts oncopy.metadata.tagsinstead. Body text stays incopy.contentunchanged.Why this is safe
mem_add(Pythonrepr(), lazy continuation) parse correctly. Reindex is enough — no on-disk rewrite.tags:key (e.g. just> created:) leaves the chunk content unchanged.Test plan
tests/test_chunking_blockquote_tags.pycovering: canonical / lazy continuation / frontmatter+blockquote union / mid-section noop / no-tags noop / single-vs-double-quote paritytests/test_multi_agent_integration.pytest_share_copies_chunk_with_audit_tagre-targetsmetadata.tagsuv run pytest -m "not ollama"→ 2390 passed, 0 faileduv run ruff check packages/memtomem/src && uv run ruff format --check ...uv run mypy packages/memtomem/src/memtomem/chunking/markdown.py— no issuesOut of scope (follow-ups)
> tags: ["a"]JSON form) — stacked PR.chunk_linkstable for truemem_agent_shareprovenance — separate RFC.🤖 Generated with Claude Code