fix(chunking): seed _split_section line counter past stripped blockquote header#466
Merged
fix(chunking): seed _split_section line counter past stripped blockquote header#466
Conversation
…ote header When chunk_file strips a section-leading ``> created:`` / ``> tags:`` blockquote before passing text to _split_section, base_line still points at the heading. line_offset in _split_section started at 0 and counted only body lines, so sub-chunks 2..N reported start_line 3-5 lines earlier than the body they actually covered. Practical impact: mem_edit on a non-first sub-chunk reads file lines [start_line..end_line] via replace_chunk_body. With drift, those lines included real body content from the previous sub-chunk — which _find_body_start_index does not detect as a header (no heading, no leading blockquote at the start of the range), so preserved == [] and the replacement silently dropped that content on save. Fix: _extract_section_blockquote_tags returns a third value, lines_consumed, counting how many input lines the strip consumed. chunk_file adds it to leading_strip_lines (lines stripped by .strip() on the raw section text) and passes the total + 1 (for the heading line itself) as a body_offset kwarg to _split_section, which seeds line_offset with body_offset. Sub-chunks 1..N now align with file lines as well as the pre-strip code did. Sub-chunk 1's start_line still anchors at the heading (line_offset is only consulted on the first emit boundary, not at init), so mem_edit's header-preservation contract for sub-chunk 1 is preserved. Single-chunk sections and sections without any blockquote header are unchanged: body_offset stays 0 in those paths. Two new tests in test_chunking_blockquote_tags.py: - test_blockquote_header_does_not_drag_sub_chunk_start_lines: a controlled oversize section with an explicit blockquote header, sub-chunks 2..N must report start_line >= body-start file line. Fails pre-fix. - test_oversize_section_without_blockquote_unchanged: pre-PR-A path is preserved when there's no header to strip. Closes RFC ``planning/mem-add-tags-blockquote-promote-rfc.md`` §Follow-ups #2. Co-Authored-By: Claude <[email protected]>
Reviewer N3: gap in coverage — `_extract_section_blockquote_tags`
short-circuits to ([], text, 0) when no `tags:` key is found, so the
blockquote stays in content and `blockquote_strip_lines` is 0. The
existing tests covered (a) tags present + strip and (b) no blockquote
at all, but not (c) blockquote present without tags.
The new case verifies for an oversize section with `> created:`
only:
- sub-chunks inherit no tags (no promotion happened)
- first sub-chunk's content still carries `> created:` (mem_edit
can preserve it on body-only edits)
- sub-chunks 2..N start_line still aligns with the body — no drift
from the leading_strip path even when the blockquote isn't
consumed by the strip helper
Also tracks reviewer N1 (paragraph-gap +2 imprecision in
_split_section) in the RFC follow-ups: separate from this PR; cosmetic
end_line overshoot, not data loss.
Co-Authored-By: Claude <[email protected]>
5c6cd5c to
061531a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes RFC
planning/mem-add-tags-blockquote-promote-rfc.md§Follow-ups #2. Independent of #465 (broadcast cleanup) — different files.Summary
When
chunk_filestrips a section-leading> created:/> tags:blockquote before passing text to_split_section,base_linestill points at the heading line.line_offsetstarted at 0 and counted only body lines, so sub-chunks 2..N reportedstart_line3–5 lines earlier than the body they actually covered.Practical impact:
mem_editon a non-first sub-chunk readsfile_lines[start_line..end_line]viareplace_chunk_body. With drift, those lines included real body content from the previous sub-chunk — which_find_body_start_indexdoes not classify as a header (no heading, no leading blockquote at the range start), sopreserved == []and the replacement silently dropped that content on save.What changed
_extract_section_blockquote_tagsreturns a third valuelines_consumed(count of input lines the strip consumed). Backwards-incompatible at the helper API surface, but it's a private classmethod with a single in-tree caller.chunk_filecomputesbody_offset = 1 (heading) + leading_strip_lines + blockquote_strip_linesand passes it to_split_sectionas a kwarg.leading_strip_linesis the count of\nconsumed by the existing.strip()on raw section text._split_sectionacceptsbody_offset(default 0) and seedsline_offsetwith it. Default 0 = pre-strip behavior preserved for callers that don't strip.Sub-chunk 1's
start_linestill anchors at the heading (preservesmem_editheader-preservation for the first sub-chunk;current_start = base_lineat init is only consulted at the first emit boundary, by which timeline_offsethas accumulated past the seed).Test plan
tests/test_chunking_blockquote_tags.py::TestOversizeSectionLineDrift:test_blockquote_header_does_not_drag_sub_chunk_start_lines— controlled oversize section with explicit> created:/> tags:header; sub-chunks 2..N must reportstart_line >= body-start-file-line. Fails pre-fix.test_oversize_section_without_blockquote_unchanged— no header to strip →body_offset = 0→ pre-strip behaviour preserved.uv run pytest -m "not ollama"→ 2405 passed, 0 failed (PR feat(mem_add,mem_edit): canonical writer + mem_edit header preservation #464 baseline 2403 + 2 new).uv run ruff check && uv run ruff format --checkclean.uv run mypy packages/memtomem/src/memtomem/chunking/markdown.py— clean.RFC §Follow-ups status after this PR
mem_editheader preservation (PR feat(mem_add,mem_edit): canonical writer + mem_edit header preservation #464)_split_sectionline drift (this PR)add_entries_batchbroadcast cleanup (PR fix(mem_batch_add): drop post-index tag broadcast; per-entry tags isolated #465, in CI)After this + #465 merge, the only remaining item from the blockquote-tags RFC sequence is the chunk_links table — separate RFC.
🤖 Generated with Claude Code