Skip to content

Feature: PII Guard — strip, map, and restore personal data before LLM calls #118

@bensig

Description

@bensig

Problem

Any AI memory system that stores and retrieves user context will inevitably handle personally identifiable information — names, emails, phone numbers, addresses, SSNs, medical info, financial data. When that context gets sent to an LLM for processing, the PII goes with it.

This is a liability for every production deployment, and there's no clean modular solution that plugs into existing memory pipelines.

Proposal: PII Guard Module

A lightweight, pluggable PII layer that:

1. Detection & Stripping

  • Scans text for PII (names, emails, phone numbers, addresses, government IDs, financial info, etc.)
  • Uses a combination of NER, regex patterns, and configurable rules
  • Runs before any text is sent to an LLM

2. Identity Mapping

  • Replaces detected PII with deterministic tokens (e.g., John Smith[PERSON_A7x3], [email protected][EMAIL_A7x3])
  • Maintains a local-only identity map that never leaves the user's environment
  • Same entity always maps to the same token within a session, so the LLM can still reason about relationships ("PERSON_A7x3 emailed PERSON_B2k9")

3. Restoration on Request

  • After the LLM responds, tokens get rehydrated back to real identities before the user sees the output
  • User can control restore behavior: always restore, never restore, or ask-per-entity
  • Restoration map is encrypted at rest

4. Pluggable Architecture

  • Works as middleware — sits between mempalace (or any memory system) and the LLM API call
  • Simple interface: sanitize(text) → (clean_text, map) and restore(text, map) → original_text
  • Could also work standalone for any AI pipeline, not just mempalace

Why This Matters

  • Compliance: GDPR, CCPA, HIPAA all have requirements around PII handling
  • Trust: Users storing personal memories/docs need confidence their data isn't leaking to third parties
  • Universal need: This isn't mempalace-specific — every AI system sending user context to an LLM has this problem. Building it here as an open module benefits the entire ecosystem.

Open Questions

  • Should the identity map persist across sessions, or be ephemeral by default?
  • Best approach for multilingual PII detection?
  • Should there be a confidence threshold where low-confidence PII gets flagged for user review rather than auto-stripped?

Would love input from anyone working on privacy-preserving AI pipelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestsecuritySecurity related

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions