Skip to content

--extract general dumps Markdown-bold content into emotional room via overly broad \*[^*]+\* regex #536

@krugdenis

Description

@krugdenis

Summary

The EMOTION_MARKERS list in mempalace/general_extractor.py contains a wildcard regex that matches any text wrapped in single asterisks:

EMOTION_MARKERS = [
    r"\blove\b", r"\bscared\b", r"\bafraid\b", ...
    r"\*[^*]+\*",   # <-- this one
]

This pattern matches every Markdown *italic* and every inner pair of **bold**. Because assistant responses in technical conversations use bold for emphasis, command names, variable names, section headers, etc., practically every non-trivial paragraph scores non-zero against emotional. Since classification picks max(scores) and most technical paragraphs trigger only this one marker (no decision/problem/milestone keywords), they all land in the emotional room.

Reproduction

Mined 273 Claude Code session JSONLs (entirely technical DevOps content — Kubernetes, Helm, Elasticsearch, GitLab, Jira) with:

mempalace mine C:/Users/.../.claude/projects/<project>/ \
  --mode convos --extract general --wing vibeOps

Result:

=======================================================
  Done.
  Files processed: 273
  Drawers filed: 2443

  By room:
    emotional            1615 files   <-- 66% of all drawers
    milestone            387 files
    decision             322 files
    problem              114 files
    preference           5 files
=======================================================

Spot-check of mempalace search <any> --wing vibeOps --room emotional returns results like:

  • "Sampling: Type: parentbased_traceidratio — если у трейса есть родитель..."
  • "Namespace: opentelemetry · Имя: instrumentation"
  • Tables of Jira statuses
  • OpenTelemetry collector endpoint lists

Zero emotional content. Every one of these paragraphs only matches \*[^*]+\* via Markdown bold in the source text.

Root cause

general_extractor.py line 160 (mempalace 3.0.14):

EMOTION_MARKERS = [
    ...
    r"\*[^*]+\*",
]

This regex is too greedy for any text that contains Markdown formatting. It was presumably intended to catch things like *whispers* or *sighs* emotes in personal chat logs, but in developer contexts it catches every *foo* / **bar**.

Proposed fixes (in order of preference)

  1. Remove the regex entirely. Users who want emote detection can opt in via a separate marker list. \blove\b, \bscared\b, etc. are already precise enough.
  2. Require a surrounding word boundary that excludes code-like content, e.g. r"(?<!\w)\*[a-z][^*]{2,}\*(?!\w)" plus a blacklist of common programmer terms — but this is brittle.
  3. Strip Markdown before scoring (_extract_prose already exists in the file; extend it to strip **bold** / *italic* / backtick code spans before running the regex scorer).
  4. Prefer non-emotional max when scores tie (low-effort bias fix): if a paragraph scores equally on emotional and another type, pick the other type. Doesn't fix the root cause but reduces false positives.

Option (3) is probably the right long-term answer — the same issue affects the \* marker and will affect future markers that collide with Markdown syntax.

Environment

  • mempalace 3.0.14
  • Source: 273 Claude Code conversation JSONL files (technical DevOps)
  • Extractor mode: --extract general

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions