fix: strip Markdown before emotion scoring to prevent false classification#543
fix: strip Markdown before emotion scoring to prevent false classification#543Hybirdss wants to merge 2 commits intoMemPalace:developfrom
Conversation
web3guru888
left a comment
There was a problem hiding this comment.
Two-pronged approach here is the right call — fixing the regex and stripping Markdown before scoring is more thorough than #542's deletion-only approach.
- Regex improvement:
(?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*)is correct. The negative lookbehind/lookahead properly excludes**bold**while preserving emote-style markers like*sighs*,*hugs*. Lowercase + 20-char cap is a smart scope limiter — real emotes are short and lowercase, bold spans often aren't. _strip_markdown()ordering: Bold-italic stripped before bold before italic is the right precedence — prevents***text***leaving behind orphan*that the italic sub would then misfire on.- Call site ordering: Stripping after
_is_code_line()filtering but before returning is correct — code blocks get filtered at the line level first, then Markdown syntax is cleaned from remaining prose before scoring. No issue there. - Minor edge case: The italic strip
re.sub(r"\\*([^*]+)\\*", ...)won't accidentally match bullet points like* item text(no closing*), so that's probably fine — but* item text *(trailing asterisk with space) technically would. Low-risk but worth a note in comments if the corpus ever includes that style. - The 66% false positive rate cited from #536 is a compelling real-world number that justifies both the regex change and the strip function. Strong LGTM on the approach overall — maintainer will just need to pick between this and #542's simpler delete.
Nice fix.
|
Great to see Option 3 land properly — the Markdown-stripping approach is the right systemic fix here. The regex — _strip_markdown() as defense-in-depth — Even if the emote regex isn't perfect, stripping headings, bold-italic, bold, inline code, and links before any scoring means the regex never encounters the artifacts. Two independent layers of protection rather than one clever pattern doing all the work. One edge case to note: Stripping order: verify that single-asterisk emotes like Scale of the fix: 66% emotional misclassification in a DevOps corpus is severe — essentially the rooms assignment system is broken for any technical content with Markdown. Fixing this is high-leverage. Glad the root cause identification (Markdown bold triggering the emote regex) came through the #536 thread. Thanks for implementing this cleanly, @Hybirdss. [MemPalace-AGI integration — production stats at https://milla-jovovich.github.io/mempalace/integrations/mempalace-agi/] |
|
Thanks for the detailed review! Good call on the strip order question — Re: uppercase emotes — yeah, |
web3guru888
left a comment
There was a problem hiding this comment.
Confirmed — single-asterisk left intact for the emote regex is the right order, and the vs trade-off is well-reasoned. / being false positives is exactly the right thing to avoid in a DevOps corpus. Adding the code comment covers the future contributor question. Approved.
|
Makes sense on both counts. The strip order being upstream of the emote regex was the key question — glad it's intentional and clean. On uppercase: the React/Docker false positive risk is real. The lowercase-only scope is the right call for a regex approach. If uppercase emotes ever become a use-case (e.g., logs from specific LLM personas that capitalize them), a separate pass or opt-in flag would be the way to add it without broadening the main pattern. Adding the comment in code is good practice. LGTM. |
|
Appreciate the thorough review and the edge case callouts. Comment pushed, ready for maintainer review. |
|
Makes sense — removing it from EMOTION_MARKERS directly is simpler than fixing the regex. Closing in favor of #562. |
…ation
The `\*[^*]+\*` regex in EMOTION_MARKERS matched every Markdown bold
and italic span, causing 66% of technical drawers to land in the
"emotional" room. A DevOps corpus of 273 Claude Code sessions produced
1,615 emotional drawers out of 2,443 total — none actually emotional.
Two changes:
1. Remove the `\*[^*]+\*` wildcard regex from EMOTION_MARKERS. The
remaining 17 word-boundary patterns (\blove\b, \bscared\b, etc.)
are specific enough for genuine emotion detection.
2. Add `_strip_markdown()` to `_extract_prose()` so bold, italic,
inline code, and link syntax are stripped before any marker scoring.
This prevents Markdown formatting from interfering with all marker
types, not just emotional.
Before: `**kubectl**` scores 1 on emotional via `\*[^*]+\*`.
After: `kubectl` (stripped) scores 0 on emotional. Correct.
Real emotional text ("I love this project, I am so proud") still
classifies correctly — tested.
Fixes MemPalace#536.
Explains why *Sighs* (uppercase) is intentionally excluded: matching [a-zA-Z] would false-positive on Markdown italic around proper nouns like *React* and *Docker*.
8908eef to
e8a2366
Compare
|
Active fix for #536 — 1 file, 26 lines in |
Problem
The
\*[^*]+\*regex inEMOTION_MARKERSmatches every Markdown bold (**text**) and italic (*text*) span, causing technical content to be classified as "emotional." A DevOps corpus of 273 Claude Code sessions produced 1,615 emotional drawers out of 2,443 total — 66% — none actually emotional (#536).Root cause:
_score_markers()runs against raw text that still contains Markdown formatting. Any paragraph with a bold word scores non-zero on emotional, and since technical paragraphs rarely match decision/problem/milestone markers, they fall through to emotional as the max scorer.Fix
Two changes in
mempalace/general_extractor.py:1. Replace the wildcard emote regex with a precise one.
The old
r"\*[^*]+\*"matched all Markdown emphasis. The newr"(?<!\*)\*([a-z][a-z ]{0,20})\*(?!\*)"matches only single-word/phrase emotes like*sighs*,*hugs*,*laughs nervously*— but not**bold**or***bold-italic***. Roleplay/conversational emotes are preserved; Markdown formatting no longer triggers false positives.2. Add
_strip_markdown()to_extract_prose().Strips headings (
## Heading), bold-italic (***text***), bold (**text**), italic (*text*), inline code (`code`), and link syntax ([text](url)) before any marker scoring. This protects all marker types (not just emotional) from Markdown interference.Stripping order: headings → triple-asterisk → double → single → code → links.
Before / After
Testing
*sighs*,*hugs*,*laughs nervously*) → still detected ✓***bold-italic***→ stripped correctly ✓## Heading→ stripped correctly ✓Based on the analysis in #536. Thanks to @web3guru888 for confirming the root cause and recommending the Markdown-stripping approach (Option 3).
Fixes #536.