Skip to content

fix: strip Markdown before emotion scoring to prevent false classification#543

Open
Hybirdss wants to merge 2 commits intoMemPalace:developfrom
Hybirdss:fix/536-emotion-markdown-false-positive
Open

fix: strip Markdown before emotion scoring to prevent false classification#543
Hybirdss wants to merge 2 commits intoMemPalace:developfrom
Hybirdss:fix/536-emotion-markdown-false-positive

Conversation

@Hybirdss
Copy link
Copy Markdown

Problem

The \*[^*]+\* regex in EMOTION_MARKERS matches every Markdown bold (**text**) and italic (*text*) span, causing technical content to be classified as "emotional." A DevOps corpus of 273 Claude Code sessions produced 1,615 emotional drawers out of 2,443 total — 66% — none actually emotional (#536).

Root cause: _score_markers() runs against raw text that still contains Markdown formatting. Any paragraph with a bold word scores non-zero on emotional, and since technical paragraphs rarely match decision/problem/milestone markers, they fall through to emotional as the max scorer.

Fix

Two changes in mempalace/general_extractor.py:

1. Replace the wildcard emote regex with a precise one.

The old r"\*[^*]+\*" matched all Markdown emphasis. The new r"(?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*)" matches only single-word/phrase emotes like *sighs*, *hugs*, *laughs nervously* — but not **bold** or ***bold-italic***. Roleplay/conversational emotes are preserved; Markdown formatting no longer triggers false positives.

2. Add _strip_markdown() to _extract_prose().

Strips headings (## Heading), bold-italic (***text***), bold (**text**), italic (*text*), inline code (`code`), and link syntax ([text](url)) before any marker scoring. This protects all marker types (not just emotional) from Markdown interference.

Stripping order: headings → triple-asterisk → double → single → code → links.

Before / After

Before: **kubectl** → EMOTION_MARKERS matches \*[^*]+\* → emotional (wrong)
After:  kubectl (stripped) → no emotion match → correct classification
Before: *sighs* → EMOTION_MARKERS matches \*[^*]+\* → emotional (correct)
After:  *sighs* → new emote regex matches → emotional (still correct)

Testing

  • Technical text with bold/italic → no longer classified as emotional ✓
  • Real emotional text → still classified correctly ✓
  • Roleplay emotes (*sighs*, *hugs*, *laughs nervously*) → still detected ✓
  • ***bold-italic*** → stripped correctly ✓
  • ## Heading → stripped correctly ✓

Based on the analysis in #536. Thanks to @web3guru888 for confirming the root cause and recommending the Markdown-stripping approach (Option 3).

Fixes #536.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two-pronged approach here is the right call — fixing the regex and stripping Markdown before scoring is more thorough than #542's deletion-only approach.

  • Regex improvement: (?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*) is correct. The negative lookbehind/lookahead properly excludes **bold** while preserving emote-style markers like *sighs*, *hugs*. Lowercase + 20-char cap is a smart scope limiter — real emotes are short and lowercase, bold spans often aren't.
  • _strip_markdown() ordering: Bold-italic stripped before bold before italic is the right precedence — prevents ***text*** leaving behind orphan * that the italic sub would then misfire on.
  • Call site ordering: Stripping after _is_code_line() filtering but before returning is correct — code blocks get filtered at the line level first, then Markdown syntax is cleaned from remaining prose before scoring. No issue there.
  • Minor edge case: The italic strip re.sub(r"\\*([^*]+)\\*", ...) won't accidentally match bullet points like * item text (no closing *), so that's probably fine — but * item text * (trailing asterisk with space) technically would. Low-risk but worth a note in comments if the corpus ever includes that style.
  • The 66% false positive rate cited from #536 is a compelling real-world number that justifies both the regex change and the strip function. Strong LGTM on the approach overall — maintainer will just need to pick between this and #542's simpler delete.

Nice fix.

@web3guru888
Copy link
Copy Markdown

Great to see Option 3 land properly — the Markdown-stripping approach is the right systemic fix here.

The regex(?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*) is the right tool. The negative lookbehind/lookahead isolates single-asterisk emotes without matching **bold** or ***bold-italic***. The lowercase-only constraint (a-z start character) is a sensible heuristic: real roleplay emotes (*sighs*, *hugs*) are almost always lowercase, while Markdown emphasis tends to be mixed-case. Tight scope reduces false positive risk.

_strip_markdown() as defense-in-depth — Even if the emote regex isn't perfect, stripping headings, bold-italic, bold, inline code, and links before any scoring means the regex never encounters the artifacts. Two independent layers of protection rather than one clever pattern doing all the work.

One edge case to note: *Sighs* (uppercase first letter) would NOT be detected as an emote due to the [a-z] constraint. For most conversational corpora this is fine — but if the expected input includes sentence-initial emotes like *Nods slowly*, those would be silently missed. Probably an acceptable trade-off, but worth documenting in a code comment so future contributors don't wonder why.

Stripping order: verify that single-asterisk emotes like *sighs* are NOT caught and removed by _strip_markdown() before the emote regex gets to run. If the strip order is triple → double → single and single-asterisk stripping is unconditional, you'd be removing the very patterns the emote regex is supposed to find. The PR notes the stripping targets bold-italic, bold, and italic specifically — as long as the single-asterisk strip is scoped to non-emote patterns (e.g., not lowercase-only words), we're good.

Scale of the fix: 66% emotional misclassification in a DevOps corpus is severe — essentially the rooms assignment system is broken for any technical content with Markdown. Fixing this is high-leverage. Glad the root cause identification (Markdown bold triggering the emote regex) came through the #536 thread.

Thanks for implementing this cleanly, @Hybirdss.


[MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/]

@Hybirdss
Copy link
Copy Markdown
Author

Thanks for the detailed review!

Good call on the strip order question — _strip_markdown() only removes triple/double asterisks, code spans, links, and headings. Single *word* is left intact so the emote regex can still pick it up. No conflict there.

Re: uppercase emotes — yeah, *Sighs* gets missed by design. In the corpus from #536 every emote was lowercase, and matching [a-zA-Z] would start catching *React* / *Docker* as false positives. Figured that's the safer trade-off. I'll add a quick comment in the code so it's not a mystery later.

Copy link
Copy Markdown

@web3guru888 web3guru888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed — single-asterisk left intact for the emote regex is the right order, and the vs trade-off is well-reasoned. / being false positives is exactly the right thing to avoid in a DevOps corpus. Adding the code comment covers the future contributor question. Approved.

@web3guru888
Copy link
Copy Markdown

Makes sense on both counts. The strip order being upstream of the emote regex was the key question — glad it's intentional and clean.

On uppercase: the React/Docker false positive risk is real. The lowercase-only scope is the right call for a regex approach. If uppercase emotes ever become a use-case (e.g., logs from specific LLM personas that capitalize them), a separate pass or opt-in flag would be the way to add it without broadening the main pattern.

Adding the comment in code is good practice. LGTM.

@Hybirdss
Copy link
Copy Markdown
Author

Appreciate the thorough review and the edge case callouts. Comment pushed, ready for maintainer review.

@Hybirdss
Copy link
Copy Markdown
Author

Makes sense — removing it from EMOTION_MARKERS directly is simpler than fixing the regex. Closing in favor of #562.

@Hybirdss Hybirdss closed this Apr 10, 2026
@Hybirdss Hybirdss reopened this Apr 11, 2026
@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
…ation

The `\*[^*]+\*` regex in EMOTION_MARKERS matched every Markdown bold
and italic span, causing 66% of technical drawers to land in the
"emotional" room. A DevOps corpus of 273 Claude Code sessions produced
1,615 emotional drawers out of 2,443 total — none actually emotional.

Two changes:

1. Remove the `\*[^*]+\*` wildcard regex from EMOTION_MARKERS. The
   remaining 17 word-boundary patterns (\blove\b, \bscared\b, etc.)
   are specific enough for genuine emotion detection.

2. Add `_strip_markdown()` to `_extract_prose()` so bold, italic,
   inline code, and link syntax are stripped before any marker scoring.
   This prevents Markdown formatting from interfering with all marker
   types, not just emotional.

Before: `**kubectl**` scores 1 on emotional via `\*[^*]+\*`.
After:  `kubectl` (stripped) scores 0 on emotional. Correct.

Real emotional text ("I love this project, I am so proud") still
classifies correctly — tested.

Fixes MemPalace#536.
Explains why *Sighs* (uppercase) is intentionally excluded:
matching [a-zA-Z] would false-positive on Markdown italic
around proper nouns like *React* and *Docker*.
@Hybirdss Hybirdss force-pushed the fix/536-emotion-markdown-false-positive branch from 8908eef to e8a2366 Compare April 11, 2026 22:27
@Hybirdss
Copy link
Copy Markdown
Author

Reopened and rebased on main — this is the standalone fix for #536 (1 file, 26 lines). #562 was closed for splitting so this is the active PR for this issue.

@igorls igorls added the bug Something isn't working label Apr 14, 2026
@Hybirdss
Copy link
Copy Markdown
Author

Active fix for #536 — 1 file, 26 lines in mempalace/general_extractor.py. Approved by web3guru888 on 2026-04-10. The 66% emotional misclassification rate on the DevOps corpus (1,615/2,443 drawers) means any technical content with Markdown bold/italic lands in the wrong room. Is there a maintainer who can take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--extract general dumps Markdown-bold content into emotional room via overly broad \*[^*]+\* regex

3 participants