feat(indexing): parse temporal-validity frontmatter into chunk metadata by memtomem · Pull Request #533 · memtomem/memtomem

memtomem · 2026-04-29T02:47:31Z

Summary

First-cut of the temporal-validity RFC Goals 1+2+3: frontmatter parser, schema migration, and indexer threading.
Adds valid_from / valid_to parsing for YYYY-MM-DD (date-only) and YYYY-QN (quarter) formats; surfaces the parsed window on ChunkMetadata and persists it to two new chunks columns (valid_from_unix, valid_to_unix).
No user-visible behaviour change yet — chunks without frontmatter validity stay always-valid (RFC §Goal 4 backward-compat). The pipeline validity_filter stage, mem_search(as_of=...), the mm search --as-of CLI flag, and the Web UI badge land in follow-up PRs.

Design choices worth noting

Inclusive-end semantics: lower bound = unit start (00:00:00 UTC), upper bound = unit's last day end (23:59:59 UTC). Matches the user mental model in the RFC ("from Aug 15" = the whole day Aug 15 onward).
Liberal parser: a malformed value on one side returns None for that side without poisoning the other or aborting indexing. A typo in valid_from should not block a file from being indexed.
UPDATE includes the new columns: same pattern as tags (also frontmatter-derived). This means reindex propagates frontmatter validity edits to chunks whose content_hash did not change — otherwise an editor changing only the validity window would see no effect on body chunks.

RFC reference

memtomem-docs/memtomem/planning/temporal-validity-rfc.md (commit 297470a on memtomem-docs main). Goals 1+2+3 only.

Test plan

uv run pytest packages/memtomem/tests/test_temporal_validity.py — 24 / 24 pass.
Touched-module regression sweep (test_chunking, test_chunking_blockquote_tags, test_chunkers_extended, test_storage, test_storage_extended, test_storage_noop, test_sqlite_schema, test_chunk_links_schema, test_temporal_validity) — 143 / 143 pass.
uv run ruff check + uv run ruff format --check clean.
CI green.

🤖 Generated with Claude Code

First-cut implementation of Goals 1+2+3 of the temporal-validity RFC (planning/temporal-validity-rfc.md, commit 297470a in memtomem-docs). - ChunkMetadata gains valid_from_unix / valid_to_unix (int | None). NULL on either side means unbounded; both NULL = always-valid, the RFC §Goal 4 backward-compat default for existing chunks. - Markdown chunker reads valid_from / valid_to from YAML frontmatter, accepting YYYY-MM-DD (date-only) and YYYY-QN (quarter) per RFC. Lower bound resolves to the unit's start (00:00:00 UTC), upper bound to the unit's last day end (23:59:59 UTC) — inclusive both ends. - Malformed values on one side return None for that side without poisoning the other or aborting indexing — a single typo should not prevent the file from being indexed. - chunks table gains valid_from_unix / valid_to_unix INTEGER columns, added via the existing idempotent ALTER TABLE pattern. INSERT, UPDATE, and _row_to_chunk thread the values end to end. UPDATE is included so reindex propagates frontmatter validity edits to chunks whose content_hash did not change. Pipeline filter (RFC Goal 4 — validity_filter stage), mem_search as_of=, CLI --as-of, and the Web UI badge land in follow-up PRs. 24 new tests cover parser format edge cases, frontmatter extraction, chunker → metadata propagation, schema column shape, and migration idempotency. ruff check + format pass; touched-module suite (chunking, storage, schema, chunk_links) is 143/143 green. Co-Authored-By: Claude <[email protected]>

…arch Review feedback on PR #533 caught that ``bm25_search`` and ``dense_search`` were SELECTing only the 13 core chunk columns and slicing ``row[:13]`` into ``_row_to_chunk``, so the new ``len(row) >= 21`` validity guard never tripped — every search-derived chunk carried ``valid_from_unix=None`` regardless of what was stored. The same shape also silently zeroed ``overlap_before`` / ``overlap_after`` and dropped ``importance_score`` from search results, a latent bug independent of the temporal-validity work. Switch both queries to ``SELECT c.*, sub.<score>`` so the full row layout reaches ``_row_to_chunk`` and every defensive guard activates uniformly. Slice with ``row[:-1]`` / ``row[-1]`` to stay index-agnostic if future ALTER TABLE additions extend the chunks table further. Three new round-trip tests in ``test_temporal_validity.py`` lock the behaviour: BM25 + dense each preserve the validity window for a chunk written with bounds, and BM25 returns the always-valid ``(None, None)`` pair for chunks written without bounds (so ``c.*`` is not silently fabricating values for the backward-compat default). ruff clean; touched-module sweep covers test_storage*, test_chunking*, test_sqlite_schema, test_chunk_links_schema, test_dedup, test_conflict, test_expansion, test_indexing_engine, test_context_window — 285/285 pass. Co-Authored-By: Claude <[email protected]>

memtomem · 2026-04-29T02:57:30Z

Review feedback addressed in follow-up commit 460592b.

🟡 (addressed) The bm25_search / dense_search queries were SELECTing only the 13 core chunk columns and slicing row[:13] into _row_to_chunk, so the new len(row) >= 21 validity guard never tripped — every search-derived chunk carried valid_from_unix=None regardless of what was stored. The same shape also silently zeroed overlap_before / overlap_after and dropped importance_score, a latent bug independent of this RFC.

Switched both queries to SELECT c.*, sub.<score> so the full row layout reaches _row_to_chunk and every defensive guard activates uniformly. Sliced with row[:-1] / row[-1] to stay index-agnostic if future ALTER TABLE additions extend the chunks table further.

Three new round-trip tests in test_temporal_validity.py lock the behaviour:

test_bm25_search_preserves_validity_window
test_dense_search_preserves_validity_window
test_bm25_search_returns_none_pair_for_unset_chunks (confirms c.* did not silently fabricate values for the always-valid backward-compat default)

Regression sweep is 285/285 across storage / chunking / schema and the search-pipeline-adjacent suite (test_context_window included, since context expansion reads overlap_before/after — the side-effect restoration is safe).

🟢 (deferred) The remaining nits:

Nested YAML key (e.g. metadata: { valid_from: ... }) matching — adding a leading-space check is not worth the extra branch right now. Frontmatter in this codebase is overwhelmingly flat.
Inline # comment unsupported — better captured as a small RFC doc note in a follow-up PR than as parser code.
valid_from > valid_to cross-validation — naturally lives with the pipeline filter PR (Goal 4); a reverse window is unconditionally excluded by the filter, so risk is bounded.
Double _FRONT_MATTER_RE.match (in _extract_frontmatter_tags + _extract_validity_window) — intentional separation kept; cost is trivial. If frontmatter fields keep growing, a single parse → dict refactor is the right consolidation moment.

pandas-studio and others added 2 commits April 29, 2026 11:45

memtomem merged commit 93e1ef3 into main Apr 29, 2026
7 checks passed

github-actions Bot locked and limited conversation to collaborators Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(indexing): parse temporal-validity frontmatter into chunk metadata#533

feat(indexing): parse temporal-validity frontmatter into chunk metadata#533
memtomem merged 2 commits intomainfrom
feat/temporal-validity-frontmatter

memtomem commented Apr 29, 2026

Uh oh!

memtomem commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memtomem commented Apr 29, 2026

Summary

Design choices worth noting

RFC reference

Test plan

Uh oh!

memtomem commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants