Skip to content

fix(chunking): seed _split_section line counter past stripped blockquote header#466

Merged
memtomem merged 2 commits intomainfrom
fix/split-section-line-drift
Apr 25, 2026
Merged

fix(chunking): seed _split_section line counter past stripped blockquote header#466
memtomem merged 2 commits intomainfrom
fix/split-section-line-drift

Conversation

@memtomem
Copy link
Copy Markdown
Owner

Closes RFC planning/mem-add-tags-blockquote-promote-rfc.md §Follow-ups #2. Independent of #465 (broadcast cleanup) — different files.

Summary

When chunk_file strips a section-leading > created: / > tags: blockquote before passing text to _split_section, base_line still points at the heading line. line_offset started at 0 and counted only body lines, so sub-chunks 2..N reported start_line 3–5 lines earlier than the body they actually covered.

Practical impact: mem_edit on a non-first sub-chunk reads file_lines[start_line..end_line] via replace_chunk_body. With drift, those lines included real body content from the previous sub-chunk — which _find_body_start_index does not classify as a header (no heading, no leading blockquote at the range start), so preserved == [] and the replacement silently dropped that content on save.

What changed

  • _extract_section_blockquote_tags returns a third value lines_consumed (count of input lines the strip consumed). Backwards-incompatible at the helper API surface, but it's a private classmethod with a single in-tree caller.
  • chunk_file computes body_offset = 1 (heading) + leading_strip_lines + blockquote_strip_lines and passes it to _split_section as a kwarg. leading_strip_lines is the count of \n consumed by the existing .strip() on raw section text.
  • _split_section accepts body_offset (default 0) and seeds line_offset with it. Default 0 = pre-strip behavior preserved for callers that don't strip.

Sub-chunk 1's start_line still anchors at the heading (preserves mem_edit header-preservation for the first sub-chunk; current_start = base_line at init is only consulted at the first emit boundary, by which time line_offset has accumulated past the seed).

Test plan

  • New tests/test_chunking_blockquote_tags.py::TestOversizeSectionLineDrift:
    • test_blockquote_header_does_not_drag_sub_chunk_start_lines — controlled oversize section with explicit > created: / > tags: header; sub-chunks 2..N must report start_line >= body-start-file-line. Fails pre-fix.
    • test_oversize_section_without_blockquote_unchanged — no header to strip → body_offset = 0 → pre-strip behaviour preserved.
  • Full suite: uv run pytest -m "not ollama"2405 passed, 0 failed (PR feat(mem_add,mem_edit): canonical writer + mem_edit header preservation #464 baseline 2403 + 2 new).
  • uv run ruff check && uv run ruff format --check clean.
  • uv run mypy packages/memtomem/src/memtomem/chunking/markdown.py — clean.

RFC §Follow-ups status after this PR

After this + #465 merge, the only remaining item from the blockquote-tags RFC sequence is the chunk_links table — separate RFC.

🤖 Generated with Claude Code

pandas-studio and others added 2 commits April 25, 2026 09:37
…ote header

When chunk_file strips a section-leading ``> created:`` / ``> tags:``
blockquote before passing text to _split_section, base_line still
points at the heading. line_offset in _split_section started at 0 and
counted only body lines, so sub-chunks 2..N reported start_line 3-5
lines earlier than the body they actually covered.

Practical impact: mem_edit on a non-first sub-chunk reads file lines
[start_line..end_line] via replace_chunk_body. With drift, those
lines included real body content from the previous sub-chunk —
which _find_body_start_index does not detect as a header (no
heading, no leading blockquote at the start of the range), so
preserved == [] and the replacement silently dropped that content
on save.

Fix: _extract_section_blockquote_tags returns a third value,
lines_consumed, counting how many input lines the strip consumed.
chunk_file adds it to leading_strip_lines (lines stripped by .strip()
on the raw section text) and passes the total + 1 (for the heading
line itself) as a body_offset kwarg to _split_section, which seeds
line_offset with body_offset. Sub-chunks 1..N now align with file
lines as well as the pre-strip code did. Sub-chunk 1's start_line
still anchors at the heading (line_offset is only consulted on the
first emit boundary, not at init), so mem_edit's header-preservation
contract for sub-chunk 1 is preserved.

Single-chunk sections and sections without any blockquote header are
unchanged: body_offset stays 0 in those paths.

Two new tests in test_chunking_blockquote_tags.py:
- test_blockquote_header_does_not_drag_sub_chunk_start_lines: a
  controlled oversize section with an explicit blockquote header,
  sub-chunks 2..N must report start_line >= body-start file line.
  Fails pre-fix.
- test_oversize_section_without_blockquote_unchanged: pre-PR-A path
  is preserved when there's no header to strip.

Closes RFC ``planning/mem-add-tags-blockquote-promote-rfc.md``
§Follow-ups #2.

Co-Authored-By: Claude <[email protected]>
Reviewer N3: gap in coverage — `_extract_section_blockquote_tags`
short-circuits to ([], text, 0) when no `tags:` key is found, so the
blockquote stays in content and `blockquote_strip_lines` is 0. The
existing tests covered (a) tags present + strip and (b) no blockquote
at all, but not (c) blockquote present without tags.

The new case verifies for an oversize section with `> created:`
only:
  - sub-chunks inherit no tags (no promotion happened)
  - first sub-chunk's content still carries `> created:` (mem_edit
    can preserve it on body-only edits)
  - sub-chunks 2..N start_line still aligns with the body — no drift
    from the leading_strip path even when the blockquote isn't
    consumed by the strip helper

Also tracks reviewer N1 (paragraph-gap +2 imprecision in
_split_section) in the RFC follow-ups: separate from this PR; cosmetic
end_line overshoot, not data loss.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem force-pushed the fix/split-section-line-drift branch from 5c6cd5c to 061531a Compare April 25, 2026 00:39
@memtomem memtomem merged commit d15ce37 into main Apr 25, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 25, 2026
@memtomem memtomem deleted the fix/split-section-line-drift branch April 27, 2026 14:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants