Skip to content

docs: RFC 002 — Source adapter plugin specification#990

Merged
igorls merged 1 commit intodevelopfrom
docs/rfc-source-adapter-plugin-spec
Apr 18, 2026
Merged

docs: RFC 002 — Source adapter plugin specification#990
igorls merged 1 commit intodevelopfrom
docs/rfc-source-adapter-plugin-spec

Conversation

@bensig
Copy link
Copy Markdown
Collaborator

@bensig bensig commented Apr 18, 2026

Summary

Tracking issue: #989

Key decisions in the draft

  • Single ingest() method — lazy adapters yield SourceItemMetadata ahead of drawers; eager adapters interleave. Avoids the complexity of a discover / extract split for v1 with no clear laziness benefit in current ingesters.
  • Declared transformations (§1.4) — declared_transformations: ClassVar[frozenset[str]] with 13 reserved names (utf8_replace_invalid, line_join_spaces, strip_tool_chrome, spellcheck_user, …). byte_preserving adapters declare the empty set; declared_lossy adapters enumerate. Existing miner.py and the convo_miner.py + normalize.py pipeline map cleanly onto two declared sets (§1.4 existing-code mapping table). The current pipeline is not byte-preserving — it strips tool chrome, collapses blank lines, joins AI response lines with spaces, applies spellcheck to user turns, truncates Bash output, inserts synthesized markers. This spec makes that honest.
  • Palace-is-the-cursor incremental — adapter implements is_current(item, existing_metadata) instead of a persisted sidecar cursor blob. Matches existing file_already_mined() semantics in palace.py:313.
  • Adapter-owned routing (§2.5) — detect_room/detect_hall move into the filesystem adapter. Three-tier precedence: CLI flags → config match → adapter fallback.
  • Flat metadata — matches ChromaDB's str | int | float | bool constraint (RFC 001 §1.4). Entity hints go into a json_string field alongside the existing flat entities. KG triples route to the SQLite knowledge graph via KnowledgeGraph.add_triple(), not drawer metadata.
  • Closets stay core-built (§1.7) — adapters yield drawers only; core runs build_closet_lines post-step. Adapters MAY hint via a flat closet_hints string. Closes an existing gap where convo_miner.py skips closet building entirely.
  • Privacy class (§6) — declarable per adapter, enforced per-palace via privacy_floor. Default floor: none (single-user frictionless); enterprise deployments set explicitly.
  • No per-drawer field renames — §5.1 preserves every existing field (source_file, filed_at, source_mtime, added_by, normalize_version, entities, ingest_mode). Spec adds three: adapter_name, adapter_version, privacy_class. Existing queries keep working.

Impact on in-flight PRs

Each of #274, #23, #169, #232, #567/#98, #702 is called out in §10 with the specific alignment work required. #567 (git-mine) is closest to what the spec envisions; formally it becomes the reference first-party adapter for structured extraction. #981 (path-level descriptions) is absorbed as the metadata_only ingest mode. #591/#592 (Delphi Oracle, live-stream) deferred to v1.1.

Gating cleanup (not in this PR)

§9 enumerates the cleanup PR that must land before enforcement:

  • mempalace/sources/ module with BaseSourceAdapter, typed records, registry
  • mempalace/sources/transforms.py with reference implementations of every reserved transformation
  • miner.pymempalace/sources/filesystem.py (behavior preserved, READABLE_EXTENSIONS + detect_room/detect_hall move with it)
  • convo_miner.py + normalize.pymempalace/sources/conversations.py (format detection becomes per-format plugins, eliminating the if source_type chain)
  • PalaceContext facade exposed by palace.py so adapters do not import palace internals
  • KnowledgeGraph.add_triple() gains optional source_drawer_id + adapter_name params (backwards-compatible)
  • --mode {projects,convos} becomes a deprecated alias for --source {filesystem,conversations}

Test plan

Draft plugin specification for source adapters, mirroring RFC 001's
role for storage backends. Formalizes the contract six community
ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's
metadata-only mode have been reinventing ad-hoc, so adapter authors
can build to a stable surface.

Key decisions:
- Single ingest() method; lazy adapters yield SourceItemMetadata
  ahead of drawers, eager adapters interleave
- Declared-transformation model (§1.4) replaces informal verbatim
  promise with a verifiable one; byte_preserving adapters declare
  the empty set, declared_lossy adapters enumerate. Existing
  miner.py and the convo_miner+normalize pipeline map cleanly
- Palace is the incremental cursor via is_current(item, metadata);
  no sidecar persistence
- Routing is adapter-owned; detect_room/detect_hall move into the
  filesystem adapter
- Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as
  json_string field, KG triples route to SQLite knowledge graph
- Closets stay core-built as a post-step; adapters may emit flat
  closet_hints. Closes existing gap where convo drawers get no
  closets
- No per-drawer field renames: source_file, filed_at, source_mtime,
  added_by, normalize_version, entities, ingest_mode all preserved.
  Spec adds adapter_name, adapter_version, privacy_class

§9 enumerates the cleanup PR prerequisites (mempalace/sources/
module, PalaceContext facade, KnowledgeGraph.add_triple gaining
backwards-compatible source_drawer_id + adapter_name params).

Tracking issue: #989
@roip
Copy link
Copy Markdown

roip commented Apr 18, 2026

Thanks for the thorough spec, Ben.
LGTM, in #981 my requirements were simple: the system should allow meta data level granularity instead of full content, formalize declarative ignore away from .gitignore. Support path matching patterns for these feature.

Two things to surface for the spec:

level: ignore alongside metadata_only: in PR #986 I implemented three tiers — ignore (skip entirely), describe (one drawer), full (normal). The ignore tier is useful for lock files, node_modules artifacts, etc. In adapter terms: adapter yields zero items for that path. Worth making explicit in the spec so adapters have a clean "I've seen this file, deliberately producing nothing" signal vs "I failed to process it." the overlap with .gitignore is not complete and hence this is needed.
If the intent is to use an ignore adapter, recommending to ship it with the feature as built in.

Config-driven routing: Our paths: section in mempalace.yaml uses fnmatch glob patterns with per-pattern level + description + optional room override. This fits §2.5's "config match" tier. a proposed schema is available in PR #986.

igorls added a commit that referenced this pull request Apr 18, 2026
…ry, PalaceContext

Lands the read-side contract so third-party adapter authors (@Perseusxrltd,
@JakobSachs, @adv3nt3, @zendesk-thittesdorf, @mfhens, @roip, @MrDys) have a
stable target matching what RFC 001 §10 landed on the write side in #995.

Scope (this PR):

- mempalace/sources/base.py: BaseSourceAdapter ABC with kwargs-only
  ingest() / describe_schema() and default is_current() / source_summary()
  / close() (§1.1–1.2). Typed records: SourceRef, SourceItemMetadata,
  DrawerRecord, RouteHint, SourceSummary, AdapterSchema, FieldSpec (§1.3,
  §5.2). Error classes: SourceNotFoundError, AuthRequiredError,
  AdapterClosedError, TransformationViolationError, SchemaConformanceError
  (§2.7). Class-level identity contract: name / adapter_version /
  capabilities / supported_modes / declared_transformations /
  default_privacy_class (§2.1, §1.4, §1.5, §6).

- mempalace/sources/transforms.py: reference implementations of the 13
  reserved transformations (§1.4) — utf8_replace_invalid, newline_normalize,
  whitespace_trim, whitespace_collapse_internal, line_trim, line_join_spaces,
  blank_line_drop — as pure functions, plus identity shims for the six
  adapter-specific ones (strip_tool_chrome, tool_result_truncate,
  tool_result_omitted, spellcheck_user, synthesized_marker,
  speaker_role_assignment) that the conversations adapter will override
  when migrated. get_transformation(name) resolves by reserved name.

- mempalace/sources/registry.py: entry-point discovery via
  importlib.metadata.entry_points(group="mempalace.sources") + explicit
  register()/unregister() surface (§3.1–3.2). resolve_adapter_for_source()
  implements the §3.3 priority order; crucially, no auto-detection on the
  read side (§3.3 is explicit about that — user intent never inferred from
  on-disk artifacts).

- mempalace/sources/context.py: PalaceContext facade (§9) bundling the
  drawer/closet collections, knowledge graph, palace path, adapter identity,
  and progress hooks core passes into adapter.ingest(). upsert_drawer()
  applies the spec-mandated adapter_name/adapter_version stamps from §5.1.
  skip_current_item() signals laziness; emit() dispatches to hooks and
  swallows hook exceptions.

- mempalace/knowledge_graph.py: add_triple() gains optional source_drawer_id
  and adapter_name kwargs (§5.5). Backwards-compatible column migration
  auto-adds the new columns on open of a pre-RFC 002 palace (PRAGMA
  table_info then ALTER TABLE ADD COLUMN), matching the pattern used for
  any new palace-side provenance fields.

- pyproject.toml: mempalace.sources entry-point group declared. Empty on
  the first-party side for now — miners migrate in a follow-up; the group
  being present means third-party packages can begin registering today.

Out of scope (explicit follow-ups):

- miner.py → mempalace/sources/filesystem.py. Behavior-preserving rename
  that also moves READABLE_EXTENSIONS, detect_room(), detect_hall() into
  the adapter (§9). Larger refactor; lands separately.
- convo_miner.py + normalize.py → mempalace/sources/conversations.py. The
  format-detection if-chain in normalize.py becomes per-format plugins;
  declared_transformations enumerates what the current pipeline already
  does to source bytes (§1.4 existing-code mapping).
- Closet post-step wired into the conversations adapter (§1.7).
- CLI --source flag + --mode deprecation alias (§3.3).
- MCP mempalace_mine tool source parameter.
- AbstractSourceAdapterContractSuite (§7.1–7.3): byte-preservation round-
  trip and declared-transformation round-trip tests.
- Privacy-class floor enforcement (§6.2); depends on #389 for
  secrets_possible scanning.

Tests: 1018 passed (up from ~990 on develop), +27 targeted tests covering
the ABC instantiation rules, typed records, all reserved transformations,
the registry register/get/unregister surface, PalaceContext upsert + skip +
emit semantics, and both the new KG provenance kwargs and backwards-
compatible legacy-schema migration.

Refs: #989 (RFC 002 tracking), #990 (RFC 002 spec), #995 (RFC 001 §10
cleanup — sibling PR on the write side).
@igorls igorls merged commit 109d7f2 into develop Apr 18, 2026
6 checks passed
jphein added a commit to jphein/mempalace that referenced this pull request Apr 19, 2026
…uard

Merges MemPalace#990 (RFC 002 spec), MemPalace#1014 (BaseSourceAdapter/PalaceContext scaffolding),
MemPalace#1013 (Layer3.search_raw None guard), MemPalace#1012 (docs), MemPalace#1010 (chromadb >=1.5.4),
and MemPalace#998 (sweeper/tandem transcript safety net).

Fork changes preserved:
- quarantine_stale_hnsw() in chroma.py (guards HNSW/sqlite drift segfault)
- get-then-create instead of get_or_create (guards ChromaDB 1.5.x metadata segfault)
- paginated status() loop (guards SQLite variable limit on large palaces)
- searcher hits-loop, BM25 fallback, _count_in_scope
- .jsonl exempt from JUNK_FILE_SIZE cap (Claude Code transcripts can be large)
- _validate_where() + operator constants taken from upstream

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
igorls added a commit that referenced this pull request Apr 19, 2026
Version bumps across pyproject.toml, mempalace/version.py, README badge,
uv.lock, and plugin manifests (.claude-plugin/*, .codex-plugin/*).

CHANGELOG aligned with main (post-3.3.1) and a new [3.3.2] section added
covering the 11 PRs merged on develop since v3.3.1 — silent-transcript-drop
fix + tandem sweeper (#998), None-metadata guards (#999, #1013),
chromadb ≥1.5.4 for Py 3.13/3.14 (#1010), Windows Unicode (#681),
HNSW quarantine recovery (#1000), PID stacking guard (#1023), doc-path
cleanup (#996, #1012), and RFC 001/002 internal scaffolding (#995, #1014, #990).
@igorls igorls mentioned this pull request Apr 19, 2026
8 tasks
jphein pushed a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
… Cursor, Aider, Gemini CLI, Codex CLI, Warp)

Adds a new "What's next" bullet for first-class support across the AI
coding agent ecosystem. Today's integration is Claude Code-specific
(Stop / PreCompact hooks, ~/.claude/projects/*.jsonl mining); the
roadmap target is the broader set of coding agents.

Path is upstream's RFC 002 source-adapter spec (MemPalace#990): each agent
ships a pip install mempalace-source-<agent> package mapping its
session format onto the canonical drawer shape, with parity on
session_id / agent / wing.

Frames the integration matrix as three cells:
* read — universal (MCP server is already agent-agnostic)
* mine — per-agent via RFC 002 adapters
* hook/event — wherever the host exposes a hook surface, falling
  back to mining-on-cron

Fork unblocks the pattern; adapter PRs land per-agent. Companion
to the agent-shaped-CLI item already in What's next; together they
cover both surfaces of the agent integration story.

https://claude.ai/code/session_01GvwducFnFtN8KYmfbWKMR6
jphein pushed a commit to jphein/mempalace that referenced this pull request Apr 30, 2026
…Code, MemPalace#274/MemPalace#232 Cursor, MemPalace#169 Pi, MemPalace#702 Cursor+factory.ai)

Updates the multi-agent-support bullet to cite the actual upstream
work instead of just gesturing at it. RFC 002 itself is PR MemPalace#990
(tracking issue MemPalace#989). Existing third-party prototypes already
proposed against the spec:

* OpenCode SQLite — PR MemPalace#23
* Cursor SQLite — issue MemPalace#274
* Cursor JSONL (earlier variant) — PR MemPalace#232
* Pi agent JSONL — PR MemPalace#169
* Combined Cursor + factory.ai — PR MemPalace#702

Each becomes a mempalace-source-<agent> package once RFC 002 lands.
Names the path explicitly: fork unblocks the pattern by helping land
RFC 002; per-agent adapter PRs land from their respective authors.

Aider, Gemini CLI, Codex CLI, and Warp are roadmap targets without
existing adapter PRs and are listed as such (no fabricated PR refs).

https://claude.ai/code/session_01GvwducFnFtN8KYmfbWKMR6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants