Skip to content

RFC: Context Gateway error taxonomy (parse / permission / missing / internal) #762

@memtomem

Description

@memtomem

Motivation

/api/context/overview returns a per-surface summary that collapses
every failure into a single boolean:

# packages/memtomem/src/memtomem/web/routes/context_gateway.py:42-44
except Exception:
    logger.exception("diff_skills failed")
    result["skills"] = {"total": 0, "error": True}

A user staring at "error" can't tell whether to:

  • fix a malformed .md frontmatter (parse error),
  • chmod a directory (permission error),
  • install a missing dependency (ImportError on the diff helpers),
  • or file an issue (server-side bug nobody caused).

All four end in the same toast. Logs have the traceback, but most
end-users never see logs.

Current state

  • Single overview route catches Exception four times (skills,
    commands, agents, settings) and returns error: True. No
    classification.
  • Detail routes (PUT /context/{type}/{name}) have slightly more
    granular errors (404, 409, 500) but no machine-readable
    sub-category.
  • Front-end consumes error as a boolean — flips a red badge in the
    Context Gateway summary card, no further action.

Proposed change

Introduce a small taxonomy returned alongside error: True:

{
  "skills": {
    "total": 0,
    "error": true,
    "error_kind": "parse" | "permission" | "missing" | "internal",
    "error_message": "<short user-facing string, optional>"
  }
}

Mapping rules (catch-then-classify, not classify-then-catch — the
classification is the test surface, the catch is the boundary):

Exception class error_kind UI guidance
yaml.YAMLError, tomllib.TOMLDecodeError, our ParseError parse "Fix the malformed file at <path> and reload."
PermissionError, OSError(EACCES) permission "Read access denied to <path>."
FileNotFoundError, ModuleNotFoundError missing "Path <path> not found / dependency missing."
anything else internal "Server error — see logs / file an issue."

Keep error: True for backwards compat (front-end and any external
callers can ignore error_kind and the existing red-badge flow keeps
working).

Alternatives considered

  • HTTP status codes per kind. Rejected — /context/overview
    returns a partial-success envelope (skills can fail while agents
    succeed); switching to per-surface 4xx breaks the aggregation.
  • Stream errors via WebSocket. Out of scope — current
    request/response contract is fine for a low-cardinality summary.
  • Classify in the front-end via regex on error_message. Rejected
    — classification on the source side is testable; UI regex would
    drift the moment we change a Python exception message.

Open questions

  • error_message UX: surface verbatim (good debug info, leaks paths)
    or redact to a stable phrase (clean UI, less actionable)?
  • Should the taxonomy extend to detail routes (PUT /context/...)?
    Probably yes for consistency, but the surface there is smaller and
    408/409/422/500 already cover most ground.
  • Worth adding a dedicated Counter metric per error_kind for
    watchdog visibility, or is the log line enough?

Out of scope

  • Localization of error messages (Python side stays English; UI maps
    error_kind → i18n key).
  • Front-end "click to expand traceback in dev mode" — separate UX.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions