fix: strip Markdown before emotion scoring to prevent false classification by Hybirdss · Pull Request #543 · MemPalace/mempalace

Hybirdss · 2026-04-10T15:11:08Z

Problem

The \*[^*]+\* regex in EMOTION_MARKERS matches every Markdown bold (**text**) and italic (*text*) span, causing technical content to be classified as "emotional." A DevOps corpus of 273 Claude Code sessions produced 1,615 emotional drawers out of 2,443 total — 66% — none actually emotional (#536).

Root cause: _score_markers() runs against raw text that still contains Markdown formatting. Any paragraph with a bold word scores non-zero on emotional, and since technical paragraphs rarely match decision/problem/milestone markers, they fall through to emotional as the max scorer.

Fix

Two changes in mempalace/general_extractor.py:

1. Replace the wildcard emote regex with a precise one.

The old r"\*[^*]+\*" matched all Markdown emphasis. The new r"(?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*)" matches only single-word/phrase emotes like *sighs*, *hugs*, *laughs nervously* — but not **bold** or ***bold-italic***. Roleplay/conversational emotes are preserved; Markdown formatting no longer triggers false positives.

2. Add _strip_markdown() to _extract_prose().

Strips headings (## Heading), bold-italic (***text***), bold (**text**), italic (*text*), inline code (`code`), and link syntax ([text](url)) before any marker scoring. This protects all marker types (not just emotional) from Markdown interference.

Stripping order: headings → triple-asterisk → double → single → code → links.

Before / After

Before: **kubectl** → EMOTION_MARKERS matches \*[^*]+\* → emotional (wrong)
After:  kubectl (stripped) → no emotion match → correct classification

Before: *sighs* → EMOTION_MARKERS matches \*[^*]+\* → emotional (correct)
After:  *sighs* → new emote regex matches → emotional (still correct)

Testing

Technical text with bold/italic → no longer classified as emotional ✓
Real emotional text → still classified correctly ✓
Roleplay emotes (*sighs*, *hugs*, *laughs nervously*) → still detected ✓
***bold-italic*** → stripped correctly ✓
## Heading → stripped correctly ✓

Based on the analysis in #536. Thanks to @web3guru888 for confirming the root cause and recommending the Markdown-stripping approach (Option 3).

Fixes #536.

web3guru888

Two-pronged approach here is the right call — fixing the regex and stripping Markdown before scoring is more thorough than #542's deletion-only approach.

Regex improvement: (?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*) is correct. The negative lookbehind/lookahead properly excludes **bold** while preserving emote-style markers like *sighs*, *hugs*. Lowercase + 20-char cap is a smart scope limiter — real emotes are short and lowercase, bold spans often aren't.
_strip_markdown() ordering: Bold-italic stripped before bold before italic is the right precedence — prevents ***text*** leaving behind orphan * that the italic sub would then misfire on.
Call site ordering: Stripping after _is_code_line() filtering but before returning is correct — code blocks get filtered at the line level first, then Markdown syntax is cleaned from remaining prose before scoring. No issue there.
Minor edge case: The italic strip re.sub(r"\\*([^*]+)\\*", ...) won't accidentally match bullet points like * item text (no closing *), so that's probably fine — but * item text * (trailing asterisk with space) technically would. Low-risk but worth a note in comments if the corpus ever includes that style.
The 66% false positive rate cited from #536 is a compelling real-world number that justifies both the regex change and the strip function. Strong LGTM on the approach overall — maintainer will just need to pick between this and #542's simpler delete.

Nice fix.

web3guru888 · 2026-04-10T16:34:05Z

Great to see Option 3 land properly — the Markdown-stripping approach is the right systemic fix here.

The regex — (?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*) is the right tool. The negative lookbehind/lookahead isolates single-asterisk emotes without matching **bold** or ***bold-italic***. The lowercase-only constraint (a-z start character) is a sensible heuristic: real roleplay emotes (*sighs*, *hugs*) are almost always lowercase, while Markdown emphasis tends to be mixed-case. Tight scope reduces false positive risk.

_strip_markdown() as defense-in-depth — Even if the emote regex isn't perfect, stripping headings, bold-italic, bold, inline code, and links before any scoring means the regex never encounters the artifacts. Two independent layers of protection rather than one clever pattern doing all the work.

One edge case to note: *Sighs* (uppercase first letter) would NOT be detected as an emote due to the [a-z] constraint. For most conversational corpora this is fine — but if the expected input includes sentence-initial emotes like *Nods slowly*, those would be silently missed. Probably an acceptable trade-off, but worth documenting in a code comment so future contributors don't wonder why.

Stripping order: verify that single-asterisk emotes like *sighs* are NOT caught and removed by _strip_markdown() before the emote regex gets to run. If the strip order is triple → double → single and single-asterisk stripping is unconditional, you'd be removing the very patterns the emote regex is supposed to find. The PR notes the stripping targets bold-italic, bold, and italic specifically — as long as the single-asterisk strip is scoped to non-emote patterns (e.g., not lowercase-only words), we're good.

Scale of the fix: 66% emotional misclassification in a DevOps corpus is severe — essentially the rooms assignment system is broken for any technical content with Markdown. Fixing this is high-leverage. Glad the root cause identification (Markdown bold triggering the emote regex) came through the #536 thread.

Thanks for implementing this cleanly, @Hybirdss.

[MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/]

Hybirdss · 2026-04-10T17:03:17Z

Thanks for the detailed review!

Good call on the strip order question — _strip_markdown() only removes triple/double asterisks, code spans, links, and headings. Single *word* is left intact so the emote regex can still pick it up. No conflict there.

Re: uppercase emotes — yeah, *Sighs* gets missed by design. In the corpus from #536 every emote was lowercase, and matching [a-zA-Z] would start catching *React* / *Docker* as false positives. Figured that's the safer trade-off. I'll add a quick comment in the code so it's not a mystery later.

web3guru888

Confirmed — single-asterisk left intact for the emote regex is the right order, and the vs trade-off is well-reasoned. / being false positives is exactly the right thing to avoid in a DevOps corpus. Adding the code comment covers the future contributor question. Approved.

web3guru888 · 2026-04-10T17:34:13Z

Makes sense on both counts. The strip order being upstream of the emote regex was the key question — glad it's intentional and clean.

On uppercase: the React/Docker false positive risk is real. The lowercase-only scope is the right call for a regex approach. If uppercase emotes ever become a use-case (e.g., logs from specific LLM personas that capitalize them), a separate pass or opt-in flag would be the way to add it without broadening the main pattern.

Adding the comment in code is good practice. LGTM.

Hybirdss · 2026-04-10T17:55:43Z

Appreciate the thorough review and the edge case callouts. Comment pushed, ready for maintainer review.

Hybirdss · 2026-04-10T22:08:21Z

Makes sense — removing it from EMOTION_MARKERS directly is simpler than fixing the regex. Closing in favor of #562.

…ation The `\*[^*]+\*` regex in EMOTION_MARKERS matched every Markdown bold and italic span, causing 66% of technical drawers to land in the "emotional" room. A DevOps corpus of 273 Claude Code sessions produced 1,615 emotional drawers out of 2,443 total — none actually emotional. Two changes: 1. Remove the `\*[^*]+\*` wildcard regex from EMOTION_MARKERS. The remaining 17 word-boundary patterns (\blove\b, \bscared\b, etc.) are specific enough for genuine emotion detection. 2. Add `_strip_markdown()` to `_extract_prose()` so bold, italic, inline code, and link syntax are stripped before any marker scoring. This prevents Markdown formatting from interfering with all marker types, not just emotional. Before: `**kubectl**` scores 1 on emotional via `\*[^*]+\*`. After: `kubectl` (stripped) scores 0 on emotional. Correct. Real emotional text ("I love this project, I am so proud") still classifies correctly — tested. Fixes MemPalace#536.

Explains why *Sighs* (uppercase) is intentionally excluded: matching [a-zA-Z] would false-positive on Markdown italic around proper nouns like *React* and *Docker*.

Hybirdss · 2026-04-11T22:27:51Z

Reopened and rebased on main — this is the standalone fix for #536 (1 file, 26 lines). #562 was closed for splitting so this is the active PR for this issue.

Hybirdss · 2026-04-24T07:28:52Z

Active fix for #536 — 1 file, 26 lines in mempalace/general_extractor.py. Approved by web3guru888 on 2026-04-10. The 66% emotional misclassification rate on the DevOps corpus (1,615/2,443 drawers) means any technical content with Markdown bold/italic lands in the wrong room. Is there a maintainer who can take a look?

Hybirdss mentioned this pull request Apr 10, 2026

--extract general dumps Markdown-bold content into emotional room via overly broad \*[^*]+\* regex #536

Closed

web3guru888 reviewed Apr 10, 2026

View reviewed changes

web3guru888 mentioned this pull request Apr 10, 2026

fix: security hardening, test expansion (640→719), and 5 open issue fixes (#535, #536, #521, #478, #531) #542

Closed

web3guru888 approved these changes Apr 10, 2026

View reviewed changes

Hybirdss requested review from bensig and milla-jovovich as code owners April 10, 2026 17:04

Hybirdss closed this Apr 10, 2026

Hybirdss reopened this Apr 11, 2026

bensig changed the base branch from main to develop April 11, 2026 22:21

Hybirdss added 2 commits April 12, 2026 07:26

Document lowercase-only constraint on emote regex (MemPalace#536)

e8a2366

Explains why *Sighs* (uppercase) is intentionally excluded: matching [a-zA-Z] would false-positive on Markdown italic around proper nouns like *React* and *Docker*.

Hybirdss force-pushed the fix/536-emotion-markdown-false-positive branch from 8908eef to e8a2366 Compare April 11, 2026 22:27

igorls added the bug Something isn't working label Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: strip Markdown before emotion scoring to prevent false classification#543

fix: strip Markdown before emotion scoring to prevent false classification#543
Hybirdss wants to merge 2 commits intoMemPalace:developfrom
Hybirdss:fix/536-emotion-markdown-false-positive

Hybirdss commented Apr 10, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 10, 2026

Uh oh!

web3guru888 left a comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 11, 2026

Uh oh!

Hybirdss commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Hybirdss commented Apr 10, 2026

Problem

Fix

Before / After

Testing

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 10, 2026

Uh oh!

web3guru888 left a comment

Choose a reason for hiding this comment

Uh oh!

web3guru888 commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 10, 2026

Uh oh!

Hybirdss commented Apr 11, 2026

Uh oh!

Hybirdss commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants