Conversation
…ry, PalaceContext Lands the read-side contract so third-party adapter authors (@Perseusxrltd, @JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a stable target matching what RFC 001 §10 landed on the write side in #995. Scope (this PR): - mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes: SourceNotFoundError, AuthRequiredError, AdapterClosedError, TransformationViolationError, SchemaConformanceError (§2.7). Class-level identity contract: name / adapter_version / capabilities / supported_modes / declared_transformations / default_privacy_class (§2.1, §1.4, §1.5, §6). - mempalace/sources/transforms.py: reference implementations of the 13 reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop — as pure functions, plus identity shims for the six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) that the conversations adapter will override when migrated. get_transformation(name) resolves by reserved name. - mempalace/sources/registry.py: entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + explicit register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order; crucially, no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts). - mempalace/sources/context.py: PalaceContext facade (§9) bundling the drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks core passes into adapter.ingest(). upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1. skip_current_item() signals laziness; emit() dispatches to hooks and swallows hook exceptions. - mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible column migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_info then ALTER TABLE ADD COLUMN), matching the pattern used for any new palace-side provenance fields. - pyproject.toml: mempalace.sources entry-point group declared. Empty on the first-party side for now — miners migrate in a follow-up; the group being present means third-party packages can begin registering today. Out of scope (explicit follow-ups): - miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into the adapter (§9). Larger refactor; lands separately. - convo_miner.py + normalize.py → mempalace/sources/conversations.py. The format-detection if-chain in normalize.py becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping). - Closet post-step wired into the conversations adapter (§1.7). - CLI --source flag + --mode deprecation alias (§3.3). - MCP mempalace_mine tool source parameter. - AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round- trip and declared-transformation round-trip tests. - Privacy-class floor enforcement (§6.2); depends on #389 for secrets_possible scanning. Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering the ABC instantiation rules, typed records, all reserved transformations, the registry register/get/unregister surface, PalaceContext upsert + skip + emit semantics, and both the new KG provenance kwargs and backwards- compatible legacy-schema migration. Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 cleanup — sibling PR on the write side).
There was a problem hiding this comment.
Pull request overview
Introduces the RFC 002 “read-side” source adapter scaffolding so third-party adapters can plug into MemPalace via a stable mempalace.sources contract (mirroring the RFC 001 backend seam).
Changes:
- Added
mempalace.sourcespublic API:BaseSourceAdapterABC, typed ingest records, adapter registry, andPalaceContextfacade. - Implemented reserved transformation reference functions + a resolver in
mempalace/sources/transforms.py. - Extended
KnowledgeGraph.add_triple()with optional provenance fields and added an on-open SQLite schema migration for legacy palaces.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
mempalace/sources/base.py |
Defines the adapter contract (ABC), typed records, and source-adapter error types. |
mempalace/sources/context.py |
Adds the PalaceContext facade and drawer upsert helper used by adapters. |
mempalace/sources/registry.py |
Adds registry + entry-point discovery for mempalace.sources adapters. |
mempalace/sources/transforms.py |
Adds reference implementations for reserved transformations + lookup helper. |
mempalace/sources/__init__.py |
Exposes the new sources subsystem public surface. |
mempalace/knowledge_graph.py |
Adds provenance kwargs to add_triple() and schema auto-migration for new columns. |
pyproject.toml |
Declares the mempalace.sources entry-point group. |
tests/test_sources.py |
Adds targeted tests for the new source adapter scaffolding and KG migration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RESERVED_TRANSFORMATIONS: dict[str, Callable[..., str]] = { | ||
| "utf8_replace_invalid": utf8_replace_invalid, | ||
| "newline_normalize": newline_normalize, | ||
| "whitespace_trim": whitespace_trim, | ||
| "whitespace_collapse_internal": whitespace_collapse_internal, |
There was a problem hiding this comment.
Type annotations for the reserved transformation registry imply every function is callable like (...)->str, but the registry includes both a bytes->str transform (utf8_replace_invalid) and str->str transforms. Consider introducing a dedicated Transformation type/Protocol (or using overloads / Callable[[Any], str]) so static type checkers and adapter authors don’t get misleading signatures.
| """Deterministic drawer id: ``<sha1(source_file)>_<chunk_index>``. | ||
|
|
||
| Matches the shape existing miners rely on (``source_file`` + chunk index | ||
| pair) while keeping the id chroma-safe (no separators that collide with | ||
| existing metadata values). Adapters that need a different id scheme can | ||
| bypass :meth:`PalaceContext.upsert_drawer` and write through | ||
| ``drawer_collection.upsert`` directly. | ||
| """ | ||
| import hashlib | ||
|
|
||
| digest = hashlib.sha1(record.source_file.encode("utf-8")).hexdigest()[:16] |
There was a problem hiding this comment.
Drawer IDs are derived from sha1(source_file) truncated to 16 hex chars (64 bits). That increases the (small but real) risk of collisions and overwriting drawers compared to the existing miners’ sha256-based IDs (typically 24 hex chars). Consider switching to sha256 and a longer prefix (or reusing the existing drawer-id scheme) to keep collision risk negligible.
| """Deterministic drawer id: ``<sha1(source_file)>_<chunk_index>``. | |
| Matches the shape existing miners rely on (``source_file`` + chunk index | |
| pair) while keeping the id chroma-safe (no separators that collide with | |
| existing metadata values). Adapters that need a different id scheme can | |
| bypass :meth:`PalaceContext.upsert_drawer` and write through | |
| ``drawer_collection.upsert`` directly. | |
| """ | |
| import hashlib | |
| digest = hashlib.sha1(record.source_file.encode("utf-8")).hexdigest()[:16] | |
| """Deterministic drawer id: ``<sha256(source_file)[:24]>_<chunk_index>``. | |
| Matches the shape existing miners rely on (``source_file`` + chunk index | |
| pair) while keeping the id chroma-safe (no separators that collide with | |
| existing metadata values). Using a longer SHA-256 prefix keeps collision | |
| risk negligible while preserving the existing id layout. Adapters that | |
| need a different id scheme can bypass :meth:`PalaceContext.upsert_drawer` | |
| and write through ``drawer_collection.upsert`` directly. | |
| """ | |
| import hashlib | |
| digest = hashlib.sha256(record.source_file.encode("utf-8")).hexdigest()[:24] |
|
|
||
|
|
||
| def test_source_ref_options_default_is_empty_dict(): | ||
| # Frozen dataclass must not share default_factory=list instance across instances. |
There was a problem hiding this comment.
This test comment mentions default_factory=list, but SourceRef.options uses default_factory=dict. Updating the comment avoids confusion about what’s being validated (non-shared dict state across instances).
| # Frozen dataclass must not share default_factory=list instance across instances. | |
| # Frozen dataclass must not share a default_factory=dict instance across instances. |
| """) | ||
| self._migrate_schema(conn) | ||
| conn.commit() |
There was a problem hiding this comment.
On a brand-new DB, _init_db creates triples without the new RFC 002 columns and then immediately runs _migrate_schema to ALTER TABLE them in. Consider adding source_drawer_id and adapter_name directly to the CREATE TABLE IF NOT EXISTS triples definition (keeping _migrate_schema for legacy palaces) so the canonical schema is self-contained and avoids extra ALTERs on fresh installs.
| # now, we provide identity shims that raise if invoked without adapter-supplied | ||
| # context. Adapters that declare these MUST either override with a concrete | ||
| # implementation or provide a namespaced reference under | ||
| # ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the | ||
| # module docstring). The conformance suite looks up the adapter-specific | ||
| # implementation first, falling back to these only when none exists. |
There was a problem hiding this comment.
The comment above these adapter-specific transformations says the identity shims "raise if invoked" without adapter context, but the implementations currently just return the input unchanged. Please align the comment with the actual behavior (identity shims), or make the functions raise so the conformance suite can’t silently accept missing adapter-specific references.
| # now, we provide identity shims that raise if invoked without adapter-supplied | |
| # context. Adapters that declare these MUST either override with a concrete | |
| # implementation or provide a namespaced reference under | |
| # ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the | |
| # module docstring). The conformance suite looks up the adapter-specific | |
| # implementation first, falling back to these only when none exists. | |
| # now, we provide identity shims that leave the input unchanged when no | |
| # adapter-specific implementation is available. Adapters that declare these | |
| # MUST either override with a concrete implementation or provide a namespaced | |
| # reference under | |
| # ``mempalace.sources.transforms.<adapter_name>_<transform_name>`` (per the | |
| # module docstring). The conformance suite looks up the adapter-specific | |
| # implementation first, falling back to these identity shims only when none | |
| # exists. |
Five findings from the automated review, fixed with targeted tests where behavior changed: 1. Transformation Protocol (transforms.py). The registry mixed a bytes-to-str transform (utf8_replace_invalid) with str-to-str transforms under a single Callable[..., str] type, misleading static type checkers and adapter authors. Introduced a Transformation Protocol with __call__(data: bytes|str) -> str and retyped the registry + get_transformation return. 2. Drawer-id collision risk (context.py). Switched _build_drawer_id from sha1[:16]=64 bits to sha256[:24]=96 bits. 64 bits sits uncomfortably close to the birthday bound for palace-sized corpora; 96 bits keeps the collision probability negligible while preserving the existing <prefix>_<chunk> layout adapters rely on. 3. Fresh-schema KG columns (knowledge_graph.py). source_drawer_id and adapter_name now live in the canonical CREATE TABLE so new palaces don't take an ALTER round-trip on first open. _migrate_schema stays for legacy palaces (SQLite has no ADD COLUMN IF NOT EXISTS, so PRAGMA introspection is still needed there). 4. Identity-shim comment (transforms.py). Comment said the adapter-specific transforms "raise if invoked without adapter context" but they return the input unchanged. Updated the comment to match the actual identity- shim behavior Copilot suggested. 5. Test docstring (test_sources.py). Comment mentioned default_factory=list but SourceRef.options uses default_factory=dict. Corrected. Tests: 1020 passed (up from 1018), +2 new tests for the sha256 id shape and the fresh-schema column presence on new palaces.
|
Thanks @copilot — all five review items addressed in 89904ed. Recap:
Tests: 1020 passed (up from 1018), +2 targeted tests. Lint + format clean. |
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
…uard Merges MemPalace#990 (RFC 002 spec), MemPalace#1014 (BaseSourceAdapter/PalaceContext scaffolding), MemPalace#1013 (Layer3.search_raw None guard), MemPalace#1012 (docs), MemPalace#1010 (chromadb >=1.5.4), and MemPalace#998 (sweeper/tandem transcript safety net). Fork changes preserved: - quarantine_stale_hnsw() in chroma.py (guards HNSW/sqlite drift segfault) - get-then-create instead of get_or_create (guards ChromaDB 1.5.x metadata segfault) - paginated status() loop (guards SQLite variable limit on large palaces) - searcher hits-loop, BM25 fallback, _count_in_scope - .jsonl exempt from JUNK_FILE_SIZE cap (Claude Code transcripts can be large) - _validate_where() + operator constants taken from upstream Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Version bumps across pyproject.toml, mempalace/version.py, README badge, uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*). CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop fix + tandem sweeper (#998), None-metadata guards (#999, #1013), chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681), HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
Summary
Lands the read-side plugin contract so third-party source adapters can publish
pip install mempalace-source-<name>packages against a stable target, matching what #995 did for storage backends on the write side.Sibling to RFC 001 / #743 (write) and RFC 002 / #990 (spec). Tracking issue: #989.
What's in this PR
mempalace/sources/base.py—BaseSourceAdapterABC with kwargs-onlyingest()/describe_schema()and default implementations ofis_current()/source_summary()/close()(§1.1–1.2). Typed records:SourceRef,SourceItemMetadata,DrawerRecord,RouteHint,SourceSummary,AdapterSchema,FieldSpec(§1.3, §5.2). Error classes (§2.7). Class-level identity contract:name,adapter_version,capabilities,supported_modes,declared_transformations,default_privacy_class(§2.1, §1.4, §1.5, §6).mempalace/sources/transforms.py— reference implementations of the 13 reserved transformations (§1.4):utf8_replace_invalid,newline_normalize,whitespace_trim,whitespace_collapse_internal,line_trim,line_join_spaces,blank_line_dropas pure functions. The six adapter-specific ones (strip_tool_chrome,tool_result_truncate,tool_result_omitted,spellcheck_user,synthesized_marker,speaker_role_assignment) ship as identity shims the conversations adapter will override when migrated.get_transformation(name)resolves reserved names.mempalace/sources/registry.py— entry-point discovery viaimportlib.metadata.entry_points(group="mempalace.sources")+register()/unregister()(§3.1–3.2).resolve_adapter_for_source()implements the §3.3 priority order. Crucially: no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts).mempalace/sources/context.py—PalaceContextfacade (§9) bundling drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks.upsert_drawer()applies the spec-mandatedadapter_name/adapter_versionstamps from §5.1 so adapters don't need to populate them.skip_current_item()signals laziness.emit()dispatches to hooks and swallows hook exceptions.mempalace/knowledge_graph.py—add_triple()gains optionalsource_drawer_idandadapter_namekwargs (§5.5). Backwards-compatible schema migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMAtable_info→ALTER TABLE ADD COLUMN), so existing palaces upgrade transparently.pyproject.toml—mempalace.sourcesentry-point group declared (empty on the first-party side for now). Third-party packages can begin registering today; the group being declared is the enabling bit.Explicitly out of scope (follow-up PRs)
miner.py→mempalace/sources/filesystem.py. Behavior-preserving rename +READABLE_EXTENSIONS,detect_room(),detect_hall()moving into the adapter.convo_miner.py+normalize.py→mempalace/sources/conversations.py. Format-detectionif-chain becomes per-format plugins;declared_transformationsenumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping).--sourceflag +--modedeprecation alias (§3.3).mempalace_minetoolsourceparameter.AbstractSourceAdapterContractSuite(§7.1–7.3): byte-preservation + declared-transformation round-trip tests.secrets_possiblescanning.Test plan
uv run python -m pytest tests/ --ignore=tests/benchmarks— 1018 passed (+27 targeted tests for this PR).uv run ruff check .— clean.uv run ruff format --check .— clean.tests/test_sources.py:TypeError)field(default_factory=dict)doesn't share state)get_transformationresolves reserved names, rejects unknownregister/get_adapter/get_adapter_class/unregister, caching semantics, unknown-nameKeyErrorresolve_adapter_for_sourcepriority order; default =filesystemPalaceContext.upsert_drawerstampsadapter_name/adapter_version/source_file/chunk_indexPalaceContext.skip_current_itemsets flag;emitdispatches and swallows hook errorsKnowledgeGraph.add_tripleaccepts new kwargs; writes to new columnsadd_triplecallers unchangedCoordination
cc @Perseusxrltd @JakobSachs @adv3nt3 @zendesk-thittesdorf @mfhens @roip @MrDys — this is the §9 spec surface called out in #989. If you're working on Cursor/OpenCode/Pi/git/factory source adapters, this is the ABC to target. Once this merges, the ~next PR migrates
miner.py/convo_miner.pyonto the same contract (so we have two first-party reference adapters) and then the in-flight source-ingester PRs can align.Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 write-side sibling).