docs: RFC 002 — Source adapter plugin specification#990
Conversation
Draft plugin specification for source adapters, mirroring RFC 001's role for storage backends. Formalizes the contract six community ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's metadata-only mode have been reinventing ad-hoc, so adapter authors can build to a stable surface. Key decisions: - Single ingest() method; lazy adapters yield SourceItemMetadata ahead of drawers, eager adapters interleave - Declared-transformation model (§1.4) replaces informal verbatim promise with a verifiable one; byte_preserving adapters declare the empty set, declared_lossy adapters enumerate. Existing miner.py and the convo_miner+normalize pipeline map cleanly - Palace is the incremental cursor via is_current(item, metadata); no sidecar persistence - Routing is adapter-owned; detect_room/detect_hall move into the filesystem adapter - Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as json_string field, KG triples route to SQLite knowledge graph - Closets stay core-built as a post-step; adapters may emit flat closet_hints. Closes existing gap where convo drawers get no closets - No per-drawer field renames: source_file, filed_at, source_mtime, added_by, normalize_version, entities, ingest_mode all preserved. Spec adds adapter_name, adapter_version, privacy_class §9 enumerates the cleanup PR prerequisites (mempalace/sources/ module, PalaceContext facade, KnowledgeGraph.add_triple gaining backwards-compatible source_drawer_id + adapter_name params). Tracking issue: #989
|
Thanks for the thorough spec, Ben. Two things to surface for the spec: level: ignore alongside metadata_only: in PR #986 I implemented three tiers — ignore (skip entirely), describe (one drawer), full (normal). The ignore tier is useful for lock files, node_modules artifacts, etc. In adapter terms: adapter yields zero items for that path. Worth making explicit in the spec so adapters have a clean "I've seen this file, deliberately producing nothing" signal vs "I failed to process it." the overlap with .gitignore is not complete and hence this is needed. Config-driven routing: Our paths: section in mempalace.yaml uses fnmatch glob patterns with per-pattern level + description + optional room override. This fits §2.5's "config match" tier. a proposed schema is available in PR #986. |
…ry, PalaceContext Lands the read-side contract so third-party adapter authors (@Perseusxrltd, @JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a stable target matching what RFC 001 §10 landed on the write side in #995. Scope (this PR): - mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only ingest() / describe_schema() and default is_current() / source_summary() / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata, DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3, §5.2). Error classes: SourceNotFoundError, AuthRequiredError, AdapterClosedError, TransformationViolationError, SchemaConformanceError (§2.7). Class-level identity contract: name / adapter_version / capabilities / supported_modes / declared_transformations / default_privacy_class (§2.1, §1.4, §1.5, §6). - mempalace/sources/transforms.py: reference implementations of the 13 reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize, whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces, blank_line_drop — as pure functions, plus identity shims for the six adapter-specific ones (strip_tool_chrome, tool_result_truncate, tool_result_omitted, spellcheck_user, synthesized_marker, speaker_role_assignment) that the conversations adapter will override when migrated. get_transformation(name) resolves by reserved name. - mempalace/sources/registry.py: entry-point discovery via importlib.metadata.entry_points(group="mempalace.sources") + explicit register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source() implements the §3.3 priority order; crucially, no auto-detection on the read side (§3.3 is explicit about that — user intent never inferred from on-disk artifacts). - mempalace/sources/context.py: PalaceContext facade (§9) bundling the drawer/closet collections, knowledge graph, palace path, adapter identity, and progress hooks core passes into adapter.ingest(). upsert_drawer() applies the spec-mandated adapter_name/adapter_version stamps from §5.1. skip_current_item() signals laziness; emit() dispatches to hooks and swallows hook exceptions. - mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id and adapter_name kwargs (§5.5). Backwards-compatible column migration auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA table_info then ALTER TABLE ADD COLUMN), matching the pattern used for any new palace-side provenance fields. - pyproject.toml: mempalace.sources entry-point group declared. Empty on the first-party side for now — miners migrate in a follow-up; the group being present means third-party packages can begin registering today. Out of scope (explicit follow-ups): - miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into the adapter (§9). Larger refactor; lands separately. - convo_miner.py + normalize.py → mempalace/sources/conversations.py. The format-detection if-chain in normalize.py becomes per-format plugins; declared_transformations enumerates what the current pipeline already does to source bytes (§1.4 existing-code mapping). - Closet post-step wired into the conversations adapter (§1.7). - CLI --source flag + --mode deprecation alias (§3.3). - MCP mempalace_mine tool source parameter. - AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round- trip and declared-transformation round-trip tests. - Privacy-class floor enforcement (§6.2); depends on #389 for secrets_possible scanning. Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering the ABC instantiation rules, typed records, all reserved transformations, the registry register/get/unregister surface, PalaceContext upsert + skip + emit semantics, and both the new KG provenance kwargs and backwards- compatible legacy-schema migration. Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10 cleanup — sibling PR on the write side).
…uard Merges MemPalace#990 (RFC 002 spec), MemPalace#1014 (BaseSourceAdapter/PalaceContext scaffolding), MemPalace#1013 (Layer3.search_raw None guard), MemPalace#1012 (docs), MemPalace#1010 (chromadb >=1.5.4), and MemPalace#998 (sweeper/tandem transcript safety net). Fork changes preserved: - quarantine_stale_hnsw() in chroma.py (guards HNSW/sqlite drift segfault) - get-then-create instead of get_or_create (guards ChromaDB 1.5.x metadata segfault) - paginated status() loop (guards SQLite variable limit on large palaces) - searcher hits-loop, BM25 fallback, _count_in_scope - .jsonl exempt from JUNK_FILE_SIZE cap (Claude Code transcripts can be large) - _validate_where() + operator constants taken from upstream Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Version bumps across pyproject.toml, mempalace/version.py, README badge, uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*). CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop fix + tandem sweeper (#998), None-metadata guards (#999, #1013), chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681), HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
… Cursor, Aider, Gemini CLI, Codex CLI, Warp) Adds a new "What's next" bullet for first-class support across the AI coding agent ecosystem. Today's integration is Claude Code-specific (Stop / PreCompact hooks, ~/.claude/projects/*.jsonl mining); the roadmap target is the broader set of coding agents. Path is upstream's RFC 002 source-adapter spec (MemPalace#990): each agent ships a pip install mempalace-source-<agent> package mapping its session format onto the canonical drawer shape, with parity on session_id / agent / wing. Frames the integration matrix as three cells: * read — universal (MCP server is already agent-agnostic) * mine — per-agent via RFC 002 adapters * hook/event — wherever the host exposes a hook surface, falling back to mining-on-cron Fork unblocks the pattern; adapter PRs land per-agent. Companion to the agent-shaped-CLI item already in What's next; together they cover both surfaces of the agent integration story. https://claude.ai/code/session_01GvwducFnFtN8KYmfbWKMR6
…Code, MemPalace#274/MemPalace#232 Cursor, MemPalace#169 Pi, MemPalace#702 Cursor+factory.ai) Updates the multi-agent-support bullet to cite the actual upstream work instead of just gesturing at it. RFC 002 itself is PR MemPalace#990 (tracking issue MemPalace#989). Existing third-party prototypes already proposed against the spec: * OpenCode SQLite — PR MemPalace#23 * Cursor SQLite — issue MemPalace#274 * Cursor JSONL (earlier variant) — PR MemPalace#232 * Pi agent JSONL — PR MemPalace#169 * Combined Cursor + factory.ai — PR MemPalace#702 Each becomes a mempalace-source-<agent> package once RFC 002 lands. Names the path explicitly: fork unblocks the pattern by helping land RFC 002; per-agent adapter PRs land from their respective authors. Aider, Gemini CLI, Codex CLI, and Warp are roadmap targets without existing adapter PRs and are listed as such (no fabricated PR refs). https://claude.ai/code/session_01GvwducFnFtN8KYmfbWKMR6
Summary
pip install mempalace-source-<name>packages via entry points.Tracking issue: #989
Key decisions in the draft
ingest()method — lazy adapters yieldSourceItemMetadataahead of drawers; eager adapters interleave. Avoids the complexity of adiscover/extractsplit for v1 with no clear laziness benefit in current ingesters.declared_transformations: ClassVar[frozenset[str]]with 13 reserved names (utf8_replace_invalid,line_join_spaces,strip_tool_chrome,spellcheck_user, …).byte_preservingadapters declare the empty set;declared_lossyadapters enumerate. Existingminer.pyand theconvo_miner.py+normalize.pypipeline map cleanly onto two declared sets (§1.4 existing-code mapping table). The current pipeline is not byte-preserving — it strips tool chrome, collapses blank lines, joins AI response lines with spaces, applies spellcheck to user turns, truncates Bash output, inserts synthesized markers. This spec makes that honest.is_current(item, existing_metadata)instead of a persisted sidecar cursor blob. Matches existingfile_already_mined()semantics inpalace.py:313.detect_room/detect_hallmove into the filesystem adapter. Three-tier precedence: CLI flags → config match → adapter fallback.str | int | float | boolconstraint (RFC 001 §1.4). Entity hints go into ajson_stringfield alongside the existing flatentities. KG triples route to the SQLite knowledge graph viaKnowledgeGraph.add_triple(), not drawer metadata.build_closet_linespost-step. Adapters MAY hint via a flatcloset_hintsstring. Closes an existing gap whereconvo_miner.pyskips closet building entirely.privacy_floor. Default floor: none (single-user frictionless); enterprise deployments set explicitly.source_file,filed_at,source_mtime,added_by,normalize_version,entities,ingest_mode). Spec adds three:adapter_name,adapter_version,privacy_class. Existing queries keep working.Impact on in-flight PRs
Each of #274, #23, #169, #232, #567/#98, #702 is called out in §10 with the specific alignment work required. #567 (git-mine) is closest to what the spec envisions; formally it becomes the reference first-party adapter for structured extraction. #981 (path-level descriptions) is absorbed as the
metadata_onlyingest mode. #591/#592 (Delphi Oracle, live-stream) deferred to v1.1.Gating cleanup (not in this PR)
§9 enumerates the cleanup PR that must land before enforcement:
mempalace/sources/module withBaseSourceAdapter, typed records, registrymempalace/sources/transforms.pywith reference implementations of every reserved transformationminer.py→mempalace/sources/filesystem.py(behavior preserved,READABLE_EXTENSIONS+detect_room/detect_hallmove with it)convo_miner.py+normalize.py→mempalace/sources/conversations.py(format detection becomes per-format plugins, eliminating theif source_typechain)PalaceContextfacade exposed bypalace.pyso adapters do not import palace internalsKnowledgeGraph.add_triple()gains optionalsource_drawer_id+adapter_nameparams (backwards-compatible)--mode {projects,convos}becomes a deprecated alias for--source {filesystem,conversations}Test plan
canonical_source_bytesfor API-backed adapters,adapter_versionbump semantics)convo_miner.py+normalize.pyconfirmed by a code read — the §1.4 existing-code table is my read; second pair of eyes welcome