Skip to content

RFC: Source adapter plugin specification #989

@bensig

Description

@bensig

Context

The storage backend seam (#413, formalized in #743) solved pluggable writes. The mirror problem exists on the read side: source adapters that mine content into the palace.

Six community PRs are now building source-specific ingesters on ad-hoc surfaces:

Plus three ingesters already grafted into core without a shared contract:

  • `mempalace/miner.py` — filesystem miner
  • `mempalace/convo_miner.py` — conversation miner
  • `mempalace/normalize.py` — format detection for four chat-export shapes

Plus one open proposal for a different ingest semantic:

Each PR reinvents source discovery, item identity, incremental-ingest bookkeeping, metadata shape, and chunking strategy. We need a formal plugin specification so adapter authors can build to a stable contract, matching what RFC 001 does for storage backends.

What the spec should define

  1. Adapter interface — unified `ingest()` yielding `SourceItemMetadata | DrawerRecord`, `describe_schema()`, optional `is_current()` for incremental
  2. Declared transformations — every adapter publishes the set of transformations it applies to source bytes; `byte_preserving` adapters declare the empty set. Replaces the informal "verbatim always" promise with a verifiable one (current `convo_miner.py` + `normalize.py` pipeline is extensively transformed; this makes it honest)
  3. Registration — entry-point group `mempalace.sources` (third-party packages ship as `pip install mempalace-source-`)
  4. Metadata schema — universal required fields (no renames to existing drawer fields) + per-adapter declared schema; flat values only per ChromaDB constraints
  5. Privacy class — declarable per adapter; enforced per palace via `privacy_floor`; enables regulated-domain use cases
  6. Incremental ingest — palace IS the cursor via `is_current(item, existing_metadata)`; no sidecar
  7. Closet integration — core builds closets post-step; adapters MAY emit flat `closet_hints`; closes current gap where conversation drawers get no closets
  8. Routing — adapter-owned; `detect_room`/`detect_hall` move into the filesystem adapter
  9. Testing contract — abstract pytest suite including byte-preservation round-trip (for `byte_preserving` adapters) and declared-transformation round-trip (for `declared_lossy` adapters)
  10. Cleanup prerequisite — refactor existing `miner.py` / `convo_miner.py` onto the new contract before third-party adapter PRs merge; `KnowledgeGraph.add_triple()` gains backwards-compatible `source_drawer_id` + `adapter_name` params

Why beyond developer tooling

The adapter pattern is source-agnostic. Beyond the current dev-focused ingesters, it covers Notion / Obsidian (knowledge work), Slack / email / iMessage (communications), Whisper transcripts (creator workflows), and regulated-domain sources (medical / legal / financial) gated on the privacy-class contract.

This is how "structured data for enterprise" reconciles with the ingest commitments: declared-transformation content in the drawer, structured fields in the adapter's declared schema, filtering handled by backends (RFC 001 §2.1).

Current state

No `BaseSourceAdapter` ABC exists. Each ingester is hand-written against palace internals. Format detection accumulates in `normalize.py` as an `if` chain. Contributors building the seventh adapter (CodeRabbit exports, and related technical-workspace mining flagged in recent user feedback) have no contract to build against.

Related

cc @Perseusxrltd @JakobSachs @adv3nt3 @zendesk-thittesdorf @mfhens @roip @MrDys — authors of the in-flight source-ingester work. Your input on whether this spec's shape fits what you've built is the most valuable thing we can get on this thread.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions