Skip to content

feat(indexing): parse temporal-validity frontmatter into chunk metadata#533

Merged
memtomem merged 2 commits intomainfrom
feat/temporal-validity-frontmatter
Apr 29, 2026
Merged

feat(indexing): parse temporal-validity frontmatter into chunk metadata#533
memtomem merged 2 commits intomainfrom
feat/temporal-validity-frontmatter

Conversation

@memtomem
Copy link
Copy Markdown
Owner

Summary

  • First-cut of the temporal-validity RFC Goals 1+2+3: frontmatter parser, schema migration, and indexer threading.
  • Adds valid_from / valid_to parsing for YYYY-MM-DD (date-only) and YYYY-QN (quarter) formats; surfaces the parsed window on ChunkMetadata and persists it to two new chunks columns (valid_from_unix, valid_to_unix).
  • No user-visible behaviour change yet — chunks without frontmatter validity stay always-valid (RFC §Goal 4 backward-compat). The pipeline validity_filter stage, mem_search(as_of=...), the mm search --as-of CLI flag, and the Web UI badge land in follow-up PRs.

Design choices worth noting

  • Inclusive-end semantics: lower bound = unit start (00:00:00 UTC), upper bound = unit's last day end (23:59:59 UTC). Matches the user mental model in the RFC ("from Aug 15" = the whole day Aug 15 onward).
  • Liberal parser: a malformed value on one side returns None for that side without poisoning the other or aborting indexing. A typo in valid_from should not block a file from being indexed.
  • UPDATE includes the new columns: same pattern as tags (also frontmatter-derived). This means reindex propagates frontmatter validity edits to chunks whose content_hash did not change — otherwise an editor changing only the validity window would see no effect on body chunks.

RFC reference

memtomem-docs/memtomem/planning/temporal-validity-rfc.md (commit 297470a on memtomem-docs main). Goals 1+2+3 only.

Test plan

  • uv run pytest packages/memtomem/tests/test_temporal_validity.py — 24 / 24 pass.
  • Touched-module regression sweep (test_chunking, test_chunking_blockquote_tags, test_chunkers_extended, test_storage, test_storage_extended, test_storage_noop, test_sqlite_schema, test_chunk_links_schema, test_temporal_validity) — 143 / 143 pass.
  • uv run ruff check + uv run ruff format --check clean.
  • CI green.

🤖 Generated with Claude Code

pandas-studio and others added 2 commits April 29, 2026 11:45
First-cut implementation of Goals 1+2+3 of the temporal-validity RFC
(planning/temporal-validity-rfc.md, commit 297470a in memtomem-docs).

- ChunkMetadata gains valid_from_unix / valid_to_unix (int | None).
  NULL on either side means unbounded; both NULL = always-valid, the
  RFC §Goal 4 backward-compat default for existing chunks.
- Markdown chunker reads valid_from / valid_to from YAML frontmatter,
  accepting YYYY-MM-DD (date-only) and YYYY-QN (quarter) per RFC.
  Lower bound resolves to the unit's start (00:00:00 UTC), upper bound
  to the unit's last day end (23:59:59 UTC) — inclusive both ends.
- Malformed values on one side return None for that side without
  poisoning the other or aborting indexing — a single typo should not
  prevent the file from being indexed.
- chunks table gains valid_from_unix / valid_to_unix INTEGER columns,
  added via the existing idempotent ALTER TABLE pattern. INSERT,
  UPDATE, and _row_to_chunk thread the values end to end. UPDATE is
  included so reindex propagates frontmatter validity edits to chunks
  whose content_hash did not change.

Pipeline filter (RFC Goal 4 — validity_filter stage), mem_search
as_of=, CLI --as-of, and the Web UI badge land in follow-up PRs.

24 new tests cover parser format edge cases, frontmatter extraction,
chunker → metadata propagation, schema column shape, and migration
idempotency. ruff check + format pass; touched-module suite (chunking,
storage, schema, chunk_links) is 143/143 green.

Co-Authored-By: Claude <[email protected]>
…arch

Review feedback on PR #533 caught that ``bm25_search`` and ``dense_search``
were SELECTing only the 13 core chunk columns and slicing ``row[:13]``
into ``_row_to_chunk``, so the new ``len(row) >= 21`` validity guard
never tripped — every search-derived chunk carried
``valid_from_unix=None`` regardless of what was stored. The same shape
also silently zeroed ``overlap_before`` / ``overlap_after`` and dropped
``importance_score`` from search results, a latent bug independent of
the temporal-validity work.

Switch both queries to ``SELECT c.*, sub.<score>`` so the full row layout
reaches ``_row_to_chunk`` and every defensive guard activates uniformly.
Slice with ``row[:-1]`` / ``row[-1]`` to stay index-agnostic if future
ALTER TABLE additions extend the chunks table further.

Three new round-trip tests in ``test_temporal_validity.py`` lock the
behaviour: BM25 + dense each preserve the validity window for a chunk
written with bounds, and BM25 returns the always-valid ``(None, None)``
pair for chunks written without bounds (so ``c.*`` is not silently
fabricating values for the backward-compat default).

ruff clean; touched-module sweep covers test_storage*, test_chunking*,
test_sqlite_schema, test_chunk_links_schema, test_dedup, test_conflict,
test_expansion, test_indexing_engine, test_context_window — 285/285 pass.

Co-Authored-By: Claude <[email protected]>
@memtomem
Copy link
Copy Markdown
Owner Author

Review feedback addressed in follow-up commit 460592b.

🟡 (addressed) The bm25_search / dense_search queries were SELECTing only the 13 core chunk columns and slicing row[:13] into _row_to_chunk, so the new len(row) >= 21 validity guard never tripped — every search-derived chunk carried valid_from_unix=None regardless of what was stored. The same shape also silently zeroed overlap_before / overlap_after and dropped importance_score, a latent bug independent of this RFC.

Switched both queries to SELECT c.*, sub.<score> so the full row layout reaches _row_to_chunk and every defensive guard activates uniformly. Sliced with row[:-1] / row[-1] to stay index-agnostic if future ALTER TABLE additions extend the chunks table further.

Three new round-trip tests in test_temporal_validity.py lock the behaviour:

  • test_bm25_search_preserves_validity_window
  • test_dense_search_preserves_validity_window
  • test_bm25_search_returns_none_pair_for_unset_chunks (confirms c.* did not silently fabricate values for the always-valid backward-compat default)

Regression sweep is 285/285 across storage / chunking / schema and the search-pipeline-adjacent suite (test_context_window included, since context expansion reads overlap_before/after — the side-effect restoration is safe).

🟢 (deferred) The remaining nits:

  • Nested YAML key (e.g. metadata: { valid_from: ... }) matching — adding a leading-space check is not worth the extra branch right now. Frontmatter in this codebase is overwhelmingly flat.
  • Inline # comment unsupported — better captured as a small RFC doc note in a follow-up PR than as parser code.
  • valid_from > valid_to cross-validation — naturally lives with the pipeline filter PR (Goal 4); a reverse window is unconditionally excluded by the filter, so risk is bounded.
  • Double _FRONT_MATTER_RE.match (in _extract_frontmatter_tags + _extract_validity_window) — intentional separation kept; cost is trivial. If frontmatter fields keep growing, a single parse → dict refactor is the right consolidation moment.

@memtomem memtomem merged commit 93e1ef3 into main Apr 29, 2026
7 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 29, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants