Skip to content

feat(storage): add chunk_links table + one-shot back-fill from shared-from tags#469

Merged
memtomem merged 1 commit intomainfrom
feat/chunk-links-schema-and-backfill
Apr 25, 2026
Merged

feat(storage): add chunk_links table + one-shot back-fill from shared-from tags#469
memtomem merged 1 commit intomainfrom
feat/chunk-links-schema-and-backfill

Conversation

@memtomem
Copy link
Copy Markdown
Owner

Summary

PR-1 of the chunk_links series (private RFC
mem-agent-share-chunk-links-rfc.md). Storage-only change — no
public API surface yet, no behavior change for existing callers.
PR-2
wires the writer / reader Python API; PR-3 (optional) exposes a
mem_share_lineage MCP tool.

Why

mem_agent_share historically encoded provenance as a
shared-from=<uuid> audit tag in the destination chunk's tags array.
Tag-based provenance:

  • doesn't benefit from an index — fanout queries are
    tags LIKE '%shared-from=%', a full table scan;
  • breaks on UUID churn — reindex re-issues chunk ids and the
    shared-from=<old-uuid> chain breaks at the gap;
  • can't enforce a relationship across delete.

This PR adds the storage substrate so PR-2 can wire structured
provenance without a coordinated big-bang change.

Schema

chunk_links keyed on (target_id, link_type) so each destination
chunk has at most one link of each type. Indexes on
(source_id, link_type) and namespace_target cover fanout / per-NS
queries. FK semantics:

  • ON DELETE SET NULL on source_id — preserves the existing
    copy-on-share durability (a teammate's delete doesn't yank yours).
    The markdown shared-from= tag stays in content for human-readable
    provenance after the join becomes NULL.
  • ON DELETE CASCADE on target_id — no dangling pointer when the
    destination chunk is deleted.

link_type validation lives in Python (_VALID_LINK_TYPES), not a
CHECK constraint, so adding a new value (consolidated_from,
reflected_from) is one PR, not two.

Back-fill

One-shot pass on first startup after upgrade:

  1. SELECT id, namespace, tags FROM chunks WHERE tags LIKE '%shared-from=%'.
  2. Parse the source UUID from the tag.
  3. Resolve against chunks.id; missing source → source_id=NULL
    (same end-state as a post-RFC share whose source was later
    deleted).
  4. INSERT OR IGNORE INTO chunk_links (...).
  5. Record completion in _memtomem_meta (chunk_links_backfill_v1).

Bumping the version key triggers a re-run if the parser ever needs to
widen.

Test plan

  • test_chunk_links_schema.py — 10 cases:
    • schema: table + indexes created, create_tables idempotent.
    • FK behavior: SET NULL on source delete, CASCADE on target
      delete, PK uniqueness conflict.
    • back-fill: existing source resolved, missing source → NULL,
      unrelated tags ignored, marker prevents re-scan of
      post-migration rows, malformed tags JSON skipped.
  • Full suite: uv run pytest -m "not ollama" → 2421 passed
    (was 2411, +10 new cases). No regression.
  • ruff check + ruff format --check → clean.

🤖 Generated with Claude Code

…-from tags

PR-1 of the chunk_links series (private RFC
`mem-agent-share-chunk-links-rfc.md`). Storage-only change — no public
API surface yet, no behavior change for existing callers.

`mem_agent_share` historically encoded provenance as a
`shared-from=<uuid>` audit tag baked into the destination chunk's
`tags` array. Tag-based provenance has three problems: it doesn't
benefit from an index (fanout query is `tags LIKE '%shared-from=%'`,
a full table scan), it breaks on UUID churn (reindex re-issues chunk
ids and the audit chain breaks at the gap), and it can't enforce a
relationship across delete.

This PR adds the storage substrate. PR-2 wires the writer
(`mem_agent_share` records into `chunk_links` on share) and reader
Python API; PR-3 (optional) exposes a `mem_share_lineage` MCP tool.

## Schema

`chunk_links` has `PRIMARY KEY (target_id, link_type)` so every
destination chunk has at most one link of each type, plus indexes on
`(source_id, link_type)` and `namespace_target` for fanout / per-NS
audit queries. FK semantics:

- `ON DELETE SET NULL` on `source_id` — preserves the existing
  copy-on-share durability: a teammate deleting their note does not
  delete yours; the link row stays with `source_id=NULL` and the
  destination chunk lives on. Provenance is still recoverable from
  the markdown `shared-from=` tag (still written into content).
- `ON DELETE CASCADE` on `target_id` — destination delete drops the
  row, no dangling pointer.

`link_type` validation lives in Python (`_VALID_LINK_TYPES`) rather
than as a CHECK constraint so adding a new value (`consolidated_from`,
`reflected_from`) is one PR, not two.

## Back-fill

Existing share copies (created before this PR ships) have a
`shared-from=<uuid>` tag in `chunks.tags` but no row in `chunk_links`.
A one-shot pass scans those rows, parses the source UUID, resolves it
against `chunks.id` (NULL if the source was already deleted), and
`INSERT OR IGNORE`s into `chunk_links`. Completion is recorded in
`_memtomem_meta` (`chunk_links_backfill_v1`) so subsequent startups
short-circuit. Bumping the version key triggers a re-run if the
parser ever needs to widen.

## Tests

- `test_chunk_links_schema.py`: table+indexes shape, idempotent re-run,
  FK SET NULL on source delete, FK CASCADE on target delete, PK
  uniqueness conflict.
- Back-fill cases: existing source resolved, missing source → NULL,
  unrelated tags ignored, marker prevents re-scan of post-migration
  rows, malformed `tags` JSON skipped.

Full suite: 2421 passed (was 2411, +10). ruff clean.

Co-Authored-By: Claude <[email protected]>
@memtomem memtomem merged commit 95864c5 into main Apr 25, 2026
7 checks passed
@memtomem memtomem deleted the feat/chunk-links-schema-and-backfill branch April 25, 2026 00:59
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants