Skip to content

refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext#1014

Merged
igorls merged 2 commits intodevelopfrom
refactor/rfc-002-sources-scaffolding
Apr 18, 2026
Merged

refactor(sources): RFC 002 §9 scaffolding — BaseSourceAdapter, registry, PalaceContext#1014
igorls merged 2 commits intodevelopfrom
refactor/rfc-002-sources-scaffolding

Conversation

@igorls
Copy link
Copy Markdown
Member

@igorls igorls commented Apr 18, 2026

Summary

Lands the read-side plugin contract so third-party source adapters can publish pip install mempalace-source-<name> packages against a stable target, matching what #995 did for storage backends on the write side.

Sibling to RFC 001 / #743 (write) and RFC 002 / #990 (spec). Tracking issue: #989.

What's in this PR

  • mempalace/sources/base.pyBaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default implementations of is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes (§2.7). Class-level identity contract: name, adapter_version, capabilities, supported_modes, declared_transformations, default_privacy_class (§2.1, §1.4, §1.5, §6).

  • mempalace/sources/transforms.py — reference implementations of the 13 reserved transformations (§1.4): utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop as pure functions. The six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) ship as identity shims the conversations adapter will override when migrated. get_transformation(name) resolves reserved names.

  • mempalace/sources/registry.py — entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + register() / unregister() (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order. Crucially: no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts).

  • mempalace/sources/context.pyPalaceContext facade (§9) bundling drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks. upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1 so adapters don't need to populate them. skip_current_item() signals laziness. emit() dispatches to hooks and swallows hook exceptions.

  • mempalace/knowledge_graph.pyadd_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible schema migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_infoALTER TABLE ADD COLUMN), so existing palaces upgrade transparently.

  • pyproject.tomlmempalace.sources entry-point group declared (empty on the first-party side for now). Third-party packages can begin registering today; the group being declared is the enabling bit.

Explicitly out of scope (follow-up PRs)

  • miner.pymempalace/sources/filesystem.py. Behavior-preserving rename + READABLE_EXTENSIONS, detect_room(), detect_hall() moving into the adapter.
  • convo_miner.py + normalize.pymempalace/sources/conversations.py. Format-detection if-chain becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping).
  • Closet post-step wired into the conversations adapter (§1.7).
  • CLI --source flag + --mode deprecation alias (§3.3).
  • MCP mempalace_mine tool source parameter.
  • AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation + declared-transformation round-trip tests.
  • Privacy-class floor enforcement (§6.2) — depends on feat: add sensitive content scanner for palace drawers #389 for secrets_possible scanning.

Test plan

  • uv run python -m pytest tests/ --ignore=tests/benchmarks1018 passed (+27 targeted tests for this PR).
  • uv run ruff check . — clean.
  • uv run ruff format --check . — clean.
  • Coverage in tests/test_sources.py:
    • ABC instantiation enforcement (missing required methods → TypeError)
    • Typed records + default values isolation (frozen dataclass field(default_factory=dict) doesn't share state)
    • All 13 reserved transformations present in the registry
    • Pure reserved transforms: correct input → output for each
    • get_transformation resolves reserved names, rejects unknown
    • Registry: register / get_adapter / get_adapter_class / unregister, caching semantics, unknown-name KeyError
    • resolve_adapter_for_source priority order; default = filesystem
    • PalaceContext.upsert_drawer stamps adapter_name / adapter_version / source_file / chunk_index
    • PalaceContext.skip_current_item sets flag; emit dispatches and swallows hook errors
    • KnowledgeGraph.add_triple accepts new kwargs; writes to new columns
    • Legacy palaces without the new columns auto-migrate on open
    • Backwards-compat: existing add_triple callers unchanged

Coordination

cc @Perseusxrltd @JakobSachs @adv3nt3 @zendesk-thittesdorf @mfhens @roip @MrDys — this is the §9 spec surface called out in #989. If you're working on Cursor/OpenCode/Pi/git/factory source adapters, this is the ABC to target. Once this merges, the ~next PR migrates miner.py / convo_miner.py onto the same contract (so we have two first-party reference adapters) and then the in-flight source-ingester PRs can align.

Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 write-side sibling).

…ry, PalaceContext

Lands the read-side contract so third-party adapter authors (@Perseusxrltd,
@JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a
stable target matching what RFC 001 §10 landed on the write side in #995.

Scope (this PR):

- mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only
  ingest() / describe_schema() and default is_current() / source_summary()
  / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata,
  DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3,
  §5.2). Error classes: SourceNotFoundError, AuthRequiredError,
  AdapterClosedError, TransformationViolationError, SchemaConformanceError
  (§2.7). Class-level identity contract: name / adapter_version /
  capabilities / supported_modes / declared_transformations /
  default_privacy_class (§2.1, §1.4, §1.5, §6).

- mempalace/sources/transforms.py: reference implementations of the 13
  reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize,
  whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces,
  blank_line_drop — as pure functions, plus identity shims for the six
  adapter-specific ones (strip_tool_chrome, tool_result_truncate,
  tool_result_omitted, spellcheck_user, synthesized_marker,
  speaker_role_assignment) that the conversations adapter will override
  when migrated. get_transformation(name) resolves by reserved name.

- mempalace/sources/registry.py: entry-point discovery via
  importlib.metadata.entry_points(group="mempalace.sources") + explicit
  register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source()
  implements the §3.3 priority order; crucially, no auto-detection on the
  read side (§3.3 is explicit about that — user intent never inferred from
  on-disk artifacts).

- mempalace/sources/context.py: PalaceContext facade (§9) bundling the
  drawer/closet collections, knowledge graph, palace path, adapter identity,
  and progress hooks core passes into adapter.ingest(). upsert_drawer()
  applies the spec-mandated adapter_name/adapter_version stamps from §5.1.
  skip_current_item() signals laziness; emit() dispatches to hooks and
  swallows hook exceptions.

- mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id
  and adapter_name kwargs (§5.5). Backwards-compatible column migration
  auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA
  table_info then ALTER TABLE ADD COLUMN), matching the pattern used for
  any new palace-side provenance fields.

- pyproject.toml: mempalace.sources entry-point group declared. Empty on
  the first-party side for now — miners migrate in a follow-up; the group
  being present means third-party packages can begin registering today.

Out of scope (explicit follow-ups):

- miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename
  that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into
  the adapter (§9). Larger refactor; lands separately.
- convo_miner.py + normalize.py → mempalace/sources/conversations.py. The
  format-detection if-chain in normalize.py becomes per-format plugins;
  declared_transformations enumerates what the current pipeline already
  does to source bytes (§1.4 existing-code mapping).
- Closet post-step wired into the conversations adapter (§1.7).
- CLI --source flag + --mode deprecation alias (§3.3).
- MCP mempalace_mine tool source parameter.
- AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round-
  trip and declared-transformation round-trip tests.
- Privacy-class floor enforcement (§6.2); depends on #389 for
  secrets_possible scanning.

Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering
the ABC instantiation rules, typed records, all reserved transformations,
the registry register/get/unregister surface, PalaceContext upsert + skip +
emit semantics, and both the new KG provenance kwargs and backwards-
compatible legacy-schema migration.

Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10
cleanup — sibling PR on the write side).
Copilot AI review requested due to automatic review settings April 18, 2026 19:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces the RFC 002 “read-side” source adapter scaffolding so third-party adapters can plug into MemPalace via a stable mempalace.sources contract (mirroring the RFC 001 backend seam).

Changes:

  • Added mempalace.sources public API: BaseSourceAdapter ABC, typed ingest records, adapter registry, and PalaceContext facade.
  • Implemented reserved transformation reference functions + a resolver in mempalace/sources/transforms.py.
  • Extended KnowledgeGraph.add_triple() with optional provenance fields and added an on-open SQLite schema migration for legacy palaces.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mempalace/sources/base.py Defines the adapter contract (ABC), typed records, and source-adapter error types.
mempalace/sources/context.py Adds the PalaceContext facade and drawer upsert helper used by adapters.
mempalace/sources/registry.py Adds registry + entry-point discovery for mempalace.sources adapters.
mempalace/sources/transforms.py Adds reference implementations for reserved transformations + lookup helper.
mempalace/sources/__init__.py Exposes the new sources subsystem public surface.
mempalace/knowledge_graph.py Adds provenance kwargs to add_triple() and schema auto-migration for new columns.
pyproject.toml Declares the mempalace.sources entry-point group.
tests/test_sources.py Adds targeted tests for the new source adapter scaffolding and KG migration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mempalace/sources/transforms.py Outdated
Comment on lines +149 to +153
RESERVED_TRANSFORMATIONS: dict[str, Callable[..., str]] = {
"utf8_replace_invalid": utf8_replace_invalid,
"newline_normalize": newline_normalize,
"whitespace_trim": whitespace_trim,
"whitespace_collapse_internal": whitespace_collapse_internal,
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotations for the reserved transformation registry imply every function is callable like (...)->str, but the registry includes both a bytes->str transform (utf8_replace_invalid) and str->str transforms. Consider introducing a dedicated Transformation type/Protocol (or using overloads / Callable[[Any], str]) so static type checkers and adapter authors don’t get misleading signatures.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/sources/context.py Outdated
Comment on lines +129 to +139
"""Deterministic drawer id: ``<sha1(source_file)>_<chunk_index>``.

Matches the shape existing miners rely on (``source_file`` + chunk index
pair) while keeping the id chroma-safe (no separators that collide with
existing metadata values). Adapters that need a different id scheme can
bypass :meth:`PalaceContext.upsert_drawer` and write through
``drawer_collection.upsert`` directly.
"""
import hashlib

digest = hashlib.sha1(record.source_file.encode("utf-8")).hexdigest()[:16]
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drawer IDs are derived from sha1(source_file) truncated to 16 hex chars (64 bits). That increases the (small but real) risk of collisions and overwriting drawers compared to the existing miners’ sha256-based IDs (typically 24 hex chars). Consider switching to sha256 and a longer prefix (or reusing the existing drawer-id scheme) to keep collision risk negligible.

Suggested change
"""Deterministic drawer id: ``<sha1(source_file)>_<chunk_index>``.
Matches the shape existing miners rely on (``source_file`` + chunk index
pair) while keeping the id chroma-safe (no separators that collide with
existing metadata values). Adapters that need a different id scheme can
bypass :meth:`PalaceContext.upsert_drawer` and write through
``drawer_collection.upsert`` directly.
"""
import hashlib
digest = hashlib.sha1(record.source_file.encode("utf-8")).hexdigest()[:16]
"""Deterministic drawer id: ``<sha256(source_file)[:24]>_<chunk_index>``.
Matches the shape existing miners rely on (``source_file`` + chunk index
pair) while keeping the id chroma-safe (no separators that collide with
existing metadata values). Using a longer SHA-256 prefix keeps collision
risk negligible while preserving the existing id layout. Adapters that
need a different id scheme can bypass :meth:`PalaceContext.upsert_drawer`
and write through ``drawer_collection.upsert`` directly.
"""
import hashlib
digest = hashlib.sha256(record.source_file.encode("utf-8")).hexdigest()[:24]

Copilot uses AI. Check for mistakes.
Comment thread tests/test_sources.py Outdated


def test_source_ref_options_default_is_empty_dict():
# Frozen dataclass must not share default_factory=list instance across instances.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test comment mentions default_factory=list, but SourceRef.options uses default_factory=dict. Updating the comment avoids confusion about what’s being validated (non-shared dict state across instances).

Suggested change
# Frozen dataclass must not share default_factory=list instance across instances.
# Frozen dataclass must not share a default_factory=dict instance across instances.

Copilot uses AI. Check for mistakes.
Comment on lines 95 to 97
""")
self._migrate_schema(conn)
conn.commit()
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a brand-new DB, _init_db creates triples without the new RFC 002 columns and then immediately runs _migrate_schema to ALTER TABLE them in. Consider adding source_drawer_id and adapter_name directly to the CREATE TABLE IF NOT EXISTS triples definition (keeping _migrate_schema for legacy palaces) so the canonical schema is self-contained and avoids extra ALTERs on fresh installs.

Copilot uses AI. Check for mistakes.
Comment thread mempalace/sources/transforms.py Outdated
Comment on lines +93 to +98
# now, we provide identity shims that raise if invoked without adapter-supplied
# context. Adapters that declare these MUST either override with a concrete
# implementation or provide a namespaced reference under
# ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the
# module docstring). The conformance suite looks up the adapter-specific
# implementation first, falling back to these only when none exists.
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above these adapter-specific transformations says the identity shims "raise if invoked" without adapter context, but the implementations currently just return the input unchanged. Please align the comment with the actual behavior (identity shims), or make the functions raise so the conformance suite can’t silently accept missing adapter-specific references.

Suggested change
# now, we provide identity shims that raise if invoked without adapter-supplied
# context. Adapters that declare these MUST either override with a concrete
# implementation or provide a namespaced reference under
# ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the
# module docstring). The conformance suite looks up the adapter-specific
# implementation first, falling back to these only when none exists.
# now, we provide identity shims that leave the input unchanged when no
# adapter-specific implementation is available. Adapters that declare these
# MUST either override with a concrete implementation or provide a namespaced
# reference under
# ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the
# module docstring). The conformance suite looks up the adapter-specific
# implementation first, falling back to these identity shims only when none
# exists.

Copilot uses AI. Check for mistakes.
Five findings from the automated review, fixed with targeted tests where
behavior changed:

1. Transformation Protocol (transforms.py). The registry mixed a bytes-to-str
   transform (utf8_replace_invalid) with str-to-str transforms under a single
   Callable[..., str] type, misleading static type checkers and adapter
   authors. Introduced a Transformation Protocol with __call__(data: bytes|str)
   -> str and retyped the registry + get_transformation return.

2. Drawer-id collision risk (context.py). Switched _build_drawer_id from
   sha1[:16]=64 bits to sha256[:24]=96 bits. 64 bits sits uncomfortably
   close to the birthday bound for palace-sized corpora; 96 bits keeps the
   collision probability negligible while preserving the existing
   <prefix>_<chunk> layout adapters rely on.

3. Fresh-schema KG columns (knowledge_graph.py). source_drawer_id and
   adapter_name now live in the canonical CREATE TABLE so new palaces don't
   take an ALTER round-trip on first open. _migrate_schema stays for legacy
   palaces (SQLite has no ADD COLUMN IF NOT EXISTS, so PRAGMA introspection
   is still needed there).

4. Identity-shim comment (transforms.py). Comment said the adapter-specific
   transforms "raise if invoked without adapter context" but they return
   the input unchanged. Updated the comment to match the actual identity-
   shim behavior Copilot suggested.

5. Test docstring (test_sources.py). Comment mentioned default_factory=list
   but SourceRef.options uses default_factory=dict. Corrected.

Tests: 1020 passed (up from 1018), +2 new tests for the sha256 id shape
and the fresh-schema column presence on new palaces.
@igorls
Copy link
Copy Markdown
Member Author

igorls commented Apr 18, 2026

Thanks @copilot — all five review items addressed in 89904ed. Recap:

  1. Mixed signature types in RESERVED_TRANSFORMATIONS (transforms.py). Introduced a Transformation Protocol with __call__(data: bytes | str, /) -> str and retyped both the registry (dict[str, Transformation]) and get_transformation return. Static type checkers now accept the utf8_replace_invalid bytes→str shape alongside the str→str transforms cleanly.

  2. Drawer-id collision risk (context.py). Switched _build_drawer_id from sha1[:16] (64 bits) to sha256[:24] (96 bits) per your suggestion. 64 bits sits too close to the birthday bound for palace-sized corpora; 96 keeps collision probability negligible. Added a regression test that asserts both the sha256 shape and that the old sha1 scheme is no longer produced.

  3. Fresh-schema columns moved into CREATE TABLE (knowledge_graph.py). source_drawer_id and adapter_name now live in the canonical CREATE TABLE IF NOT EXISTS triples so new palaces get them directly — no ALTER round-trip on first open. _migrate_schema stays for legacy palaces (SQLite still has no ADD COLUMN IF NOT EXISTS). Added a test confirming fresh palaces have the columns without relying on the migration path.

  4. Identity-shim comment (transforms.py). Comment said the six adapter-specific transforms "raise if invoked without adapter context" but they return the input unchanged. Rewrote the comment per your suggested wording so docs and behavior agree.

  5. Test docstring typo (test_sources.py). default_factory=listdefault_factory=dict (matches SourceRef.options).

Tests: 1020 passed (up from 1018), +2 targeted tests. Lint + format clean.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 18, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • chroma-onnx-models.s3.amazonaws.com
    • Triggering command: /usr/bin/python3 python3 -m pytest tests/ --ignore=tests/benchmarks -q (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@igorls igorls merged commit 66090b2 into develop Apr 18, 2026
8 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
…uard

Merges MemPalace#990 (RFC 002 spec), MemPalace#1014 (BaseSourceAdapter/PalaceContext scaffolding),
MemPalace#1013 (Layer3.search_raw None guard), MemPalace#1012 (docs), MemPalace#1010 (chromadb >=1.5.4),
and MemPalace#998 (sweeper/tandem transcript safety net).

Fork changes preserved:
- quarantine_stale_hnsw() in chroma.py (guards HNSW/sqlite drift segfault)
- get-then-create instead of get_or_create (guards ChromaDB 1.5.x metadata segfault)
- paginated status() loop (guards SQLite variable limit on large palaces)
- searcher hits-loop, BM25 fallback, _count_in_scope
- .jsonl exempt from JUNK_FILE_SIZE cap (Claude Code transcripts can be large)
- _validate_where() + operator constants taken from upstream

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
igorls added a commit that referenced this pull request Apr 19, 2026
Version bumps across pyproject.toml, mempalace/version.py, README badge,
uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*).

CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added
covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop
fix + tandem sweeper (#998), None-metadata guards (#999, #1013),
chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681),
HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path
cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
@igorls igorls mentioned this pull request Apr 19, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants