feat(indexing): parse temporal-validity frontmatter into chunk metadata#533
feat(indexing): parse temporal-validity frontmatter into chunk metadata#533
Conversation
First-cut implementation of Goals 1+2+3 of the temporal-validity RFC (planning/temporal-validity-rfc.md, commit 297470a in memtomem-docs). - ChunkMetadata gains valid_from_unix / valid_to_unix (int | None). NULL on either side means unbounded; both NULL = always-valid, the RFC §Goal 4 backward-compat default for existing chunks. - Markdown chunker reads valid_from / valid_to from YAML frontmatter, accepting YYYY-MM-DD (date-only) and YYYY-QN (quarter) per RFC. Lower bound resolves to the unit's start (00:00:00 UTC), upper bound to the unit's last day end (23:59:59 UTC) — inclusive both ends. - Malformed values on one side return None for that side without poisoning the other or aborting indexing — a single typo should not prevent the file from being indexed. - chunks table gains valid_from_unix / valid_to_unix INTEGER columns, added via the existing idempotent ALTER TABLE pattern. INSERT, UPDATE, and _row_to_chunk thread the values end to end. UPDATE is included so reindex propagates frontmatter validity edits to chunks whose content_hash did not change. Pipeline filter (RFC Goal 4 — validity_filter stage), mem_search as_of=, CLI --as-of, and the Web UI badge land in follow-up PRs. 24 new tests cover parser format edge cases, frontmatter extraction, chunker → metadata propagation, schema column shape, and migration idempotency. ruff check + format pass; touched-module suite (chunking, storage, schema, chunk_links) is 143/143 green. Co-Authored-By: Claude <[email protected]>
…arch Review feedback on PR #533 caught that ``bm25_search`` and ``dense_search`` were SELECTing only the 13 core chunk columns and slicing ``row[:13]`` into ``_row_to_chunk``, so the new ``len(row) >= 21`` validity guard never tripped — every search-derived chunk carried ``valid_from_unix=None`` regardless of what was stored. The same shape also silently zeroed ``overlap_before`` / ``overlap_after`` and dropped ``importance_score`` from search results, a latent bug independent of the temporal-validity work. Switch both queries to ``SELECT c.*, sub.<score>`` so the full row layout reaches ``_row_to_chunk`` and every defensive guard activates uniformly. Slice with ``row[:-1]`` / ``row[-1]`` to stay index-agnostic if future ALTER TABLE additions extend the chunks table further. Three new round-trip tests in ``test_temporal_validity.py`` lock the behaviour: BM25 + dense each preserve the validity window for a chunk written with bounds, and BM25 returns the always-valid ``(None, None)`` pair for chunks written without bounds (so ``c.*`` is not silently fabricating values for the backward-compat default). ruff clean; touched-module sweep covers test_storage*, test_chunking*, test_sqlite_schema, test_chunk_links_schema, test_dedup, test_conflict, test_expansion, test_indexing_engine, test_context_window — 285/285 pass. Co-Authored-By: Claude <[email protected]>
|
Review feedback addressed in follow-up commit 🟡 (addressed) The Switched both queries to Three new round-trip tests in
Regression sweep is 285/285 across storage / chunking / schema and the search-pipeline-adjacent suite ( 🟢 (deferred) The remaining nits:
|
Summary
valid_from/valid_toparsing forYYYY-MM-DD(date-only) andYYYY-QN(quarter) formats; surfaces the parsed window onChunkMetadataand persists it to two newchunkscolumns (valid_from_unix,valid_to_unix).validity_filterstage,mem_search(as_of=...), themm search --as-ofCLI flag, and the Web UI badge land in follow-up PRs.Design choices worth noting
Nonefor that side without poisoning the other or aborting indexing. A typo invalid_fromshould not block a file from being indexed.UPDATEincludes the new columns: same pattern astags(also frontmatter-derived). This means reindex propagates frontmatter validity edits to chunks whosecontent_hashdid not change — otherwise an editor changing only the validity window would see no effect on body chunks.RFC reference
memtomem-docs/memtomem/planning/temporal-validity-rfc.md(commit297470aonmemtomem-docsmain). Goals 1+2+3 only.Test plan
uv run pytest packages/memtomem/tests/test_temporal_validity.py— 24 / 24 pass.test_chunking,test_chunking_blockquote_tags,test_chunkers_extended,test_storage,test_storage_extended,test_storage_noop,test_sqlite_schema,test_chunk_links_schema,test_temporal_validity) — 143 / 143 pass.uv run ruff check+uv run ruff format --checkclean.🤖 Generated with Claude Code