fix(chunking): seed _split_section line counter past stripped blockquote header by memtomem · Pull Request #466 · memtomem/memtomem

memtomem · 2026-04-25T00:26:35Z

Closes RFC planning/mem-add-tags-blockquote-promote-rfc.md §Follow-ups #2. Independent of #465 (broadcast cleanup) — different files.

Summary

When chunk_file strips a section-leading > created: / > tags: blockquote before passing text to _split_section, base_line still points at the heading line. line_offset started at 0 and counted only body lines, so sub-chunks 2..N reported start_line 3–5 lines earlier than the body they actually covered.

Practical impact: mem_edit on a non-first sub-chunk reads file_lines[start_line..end_line] via replace_chunk_body. With drift, those lines included real body content from the previous sub-chunk — which _find_body_start_index does not classify as a header (no heading, no leading blockquote at the range start), so preserved == [] and the replacement silently dropped that content on save.

What changed

_extract_section_blockquote_tags returns a third value lines_consumed (count of input lines the strip consumed). Backwards-incompatible at the helper API surface, but it's a private classmethod with a single in-tree caller.
chunk_file computes body_offset = 1 (heading) + leading_strip_lines + blockquote_strip_lines and passes it to _split_section as a kwarg. leading_strip_lines is the count of \n consumed by the existing .strip() on raw section text.
_split_section accepts body_offset (default 0) and seeds line_offset with it. Default 0 = pre-strip behavior preserved for callers that don't strip.

Sub-chunk 1's start_line still anchors at the heading (preserves mem_edit header-preservation for the first sub-chunk; current_start = base_line at init is only consulted at the first emit boundary, by which time line_offset has accumulated past the seed).

Test plan

New tests/test_chunking_blockquote_tags.py::TestOversizeSectionLineDrift:
- test_blockquote_header_does_not_drag_sub_chunk_start_lines — controlled oversize section with explicit > created: / > tags: header; sub-chunks 2..N must report start_line >= body-start-file-line. Fails pre-fix.
- test_oversize_section_without_blockquote_unchanged — no header to strip → body_offset = 0 → pre-strip behaviour preserved.
Full suite: uv run pytest -m "not ollama" → 2405 passed, 0 failed (PR feat(mem_add,mem_edit): canonical writer + mem_edit header preservation #464 baseline 2403 + 2 new).
uv run ruff check && uv run ruff format --check clean.
uv run mypy packages/memtomem/src/memtomem/chunking/markdown.py — clean.

RFC §Follow-ups status after this PR

✅ chore: prepare for open-source release #1 — mem_edit header preservation (PR feat(mem_add,mem_edit): canonical writer + mem_edit header preservation #464)
✅ refactor(stm): decouple surfacing via remote-only MCP #2 — _split_section line drift (this PR)
✅ chore(core): strip STM integration residue #3 — block-list shape test (PR feat(mem_add,mem_edit): canonical writer + mem_edit header preservation #464)
⏳ chore: drop packages/memtomem-stm from monorepo (Phase 3b) #4 — add_entries_batch broadcast cleanup (PR fix(mem_batch_add): drop post-index tag broadcast; per-entry tags isolated #465, in CI)

After this + #465 merge, the only remaining item from the blockquote-tags RFC sequence is the chunk_links table — separate RFC.

🤖 Generated with Claude Code

…ote header When chunk_file strips a section-leading ``> created:`` / ``> tags:`` blockquote before passing text to _split_section, base_line still points at the heading. line_offset in _split_section started at 0 and counted only body lines, so sub-chunks 2..N reported start_line 3-5 lines earlier than the body they actually covered. Practical impact: mem_edit on a non-first sub-chunk reads file lines [start_line..end_line] via replace_chunk_body. With drift, those lines included real body content from the previous sub-chunk — which _find_body_start_index does not detect as a header (no heading, no leading blockquote at the start of the range), so preserved == [] and the replacement silently dropped that content on save. Fix: _extract_section_blockquote_tags returns a third value, lines_consumed, counting how many input lines the strip consumed. chunk_file adds it to leading_strip_lines (lines stripped by .strip() on the raw section text) and passes the total + 1 (for the heading line itself) as a body_offset kwarg to _split_section, which seeds line_offset with body_offset. Sub-chunks 1..N now align with file lines as well as the pre-strip code did. Sub-chunk 1's start_line still anchors at the heading (line_offset is only consulted on the first emit boundary, not at init), so mem_edit's header-preservation contract for sub-chunk 1 is preserved. Single-chunk sections and sections without any blockquote header are unchanged: body_offset stays 0 in those paths. Two new tests in test_chunking_blockquote_tags.py: - test_blockquote_header_does_not_drag_sub_chunk_start_lines: a controlled oversize section with an explicit blockquote header, sub-chunks 2..N must report start_line >= body-start file line. Fails pre-fix. - test_oversize_section_without_blockquote_unchanged: pre-PR-A path is preserved when there's no header to strip. Closes RFC ``planning/mem-add-tags-blockquote-promote-rfc.md`` §Follow-ups #2. Co-Authored-By: Claude <[email protected]>

Reviewer N3: gap in coverage — `_extract_section_blockquote_tags` short-circuits to ([], text, 0) when no `tags:` key is found, so the blockquote stays in content and `blockquote_strip_lines` is 0. The existing tests covered (a) tags present + strip and (b) no blockquote at all, but not (c) blockquote present without tags. The new case verifies for an oversize section with `> created:` only: - sub-chunks inherit no tags (no promotion happened) - first sub-chunk's content still carries `> created:` (mem_edit can preserve it on body-only edits) - sub-chunks 2..N start_line still aligns with the body — no drift from the leading_strip path even when the blockquote isn't consumed by the strip helper Also tracks reviewer N1 (paragraph-gap +2 imprecision in _split_section) in the RFC follow-ups: separate from this PR; cosmetic end_line overshoot, not data loss. Co-Authored-By: Claude <[email protected]>

pandas-studio and others added 2 commits April 25, 2026 09:37

memtomem force-pushed the fix/split-section-line-drift branch from 5c6cd5c to 061531a Compare April 25, 2026 00:39

memtomem merged commit d15ce37 into main Apr 25, 2026
7 checks passed

github-actions Bot locked and limited conversation to collaborators Apr 25, 2026

memtomem deleted the fix/split-section-line-drift branch April 27, 2026 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chunking): seed _split_section line counter past stripped blockquote header#466

fix(chunking): seed _split_section line counter past stripped blockquote header#466
memtomem merged 2 commits intomainfrom
fix/split-section-line-drift

memtomem commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memtomem commented Apr 25, 2026

Summary

What changed

Test plan

RFC §Follow-ups status after this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants