fix(sanitize): strip GLM/DeepSeek special tokens from user-facing text by imwyvern · Pull Request #40573 · openclaw/openclaw

imwyvern · 2026-03-09T04:16:51Z

Problem

Solution

Add MODEL_SPECIAL_TOKEN_RE (/<\|[^|]*\|>/g) to stripFinalTagsFromText() which is called by sanitizeUserFacingText() — the central sanitization function for all outbound messages across all surfaces.

Changes

src/agents/pi-embedded-helpers/errors.ts: Added regex and applied it in stripFinalTagsFromText()
src/agents/pi-embedded-helpers.sanitizeuserfacingtext.test.ts: Added 5 test cases covering GLM tokens, DeepSeek tokens, multiple tokens, and ensuring normal angle brackets are preserved

Testing

All 66 tests pass (vitest run).

Fixes #40020

greptile-apps · 2026-03-09T04:18:41Z

Greptile Summary

The regex is safe from ReDoS: [^|]* cannot create ambiguous paths for the engine, so matching is always linear.
Using String.prototype.replace() with a module-level global regex is safe — replace() always resets lastIndex before scanning, so there is no statefulness hazard.
Five new test cases adequately cover the happy paths (GLM tokens, DeepSeek tokens, multiple tokens) and the negative cases (normal angle brackets and HTML-like tags are preserved).
Minor style note: the helper function stripFinalTagsFromText now serves a broader purpose than its name implies; a rename or an inline comment could help future readers understand its full scope.

Confidence Score: 5/5

This PR is safe to merge — it adds a well-scoped regex strip with no side-effects on existing behaviour.
The change is two lines of production code (a new regex constant and one additional .replace() call) backed by five targeted test cases. The regex is not susceptible to ReDoS, is used only via .replace() (no lastIndex hazard), and is confirmed not to touch normal angle brackets or HTML tags. No existing tests are modified, and all 66 tests reportedly pass.
No files require special attention.

_{Last reviewed commit: d9d5054}

greptile-apps · 2026-03-09T04:18:44Z

src/agents/pi-embedded-helpers/errors.ts

  if (!text) {
    return text;
  }
-  return text.replace(FINAL_TAG_RE, "");
+  return text.replace(FINAL_TAG_RE, "").replace(MODEL_SPECIAL_TOKEN_RE, "");


Function name no longer reflects full scope

stripFinalTagsFromText now strips both <final> tags and <|...|> model special tokens. The name stays accurate enough for internal use, but it may mislead future contributors who see a call to this function and assume it only deals with <final> tags. Consider renaming to something like stripSpecialTokensFromText or adding a brief comment at the call site to document the expanded responsibility.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/agents/pi-embedded-helpers/errors.ts Line: 406-409 Comment: **Function name no longer reflects full scope** `stripFinalTagsFromText` now strips both `<final>` tags and `<|...|>` model special tokens. The name stays accurate enough for internal use, but it may mislead future contributors who see a call to this function and assume it only deals with `<final>` tags. Consider renaming to something like `stripSpecialTokensFromText` or adding a brief comment at the call site to document the expanded responsibility. How can I resolve this? If you propose a fix, please make it concise.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9d5054fa0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T04:20:11Z

src/agents/pi-embedded-helpers/errors.ts

+ *
+ * @see https://github.com/openclaw/openclaw/issues/40020
+ */
+const MODEL_SPECIAL_TOKEN_RE = /<\|[^|]*\|>/g;


Narrow special-token regex to real model delimiters

MODEL_SPECIAL_TOKEN_RE now removes any <|...|> span, not just leaked model control tokens, so sanitizeUserFacingText() will silently delete legitimate content that uses <|/|> syntax (for example, code examples discussing F#/Elm pipeline operators or literal token strings). Because this sanitizer is applied to normal assistant output paths, users can get truncated or corrupted answers even when no model delimiter leak occurred.

Useful? React with 👍 / 👎.

imwyvern · 2026-03-09T05:14:53Z

Thanks for the review feedback! Pushed fixes in 744aee2ec:

Narrowed regex — Replaced the broad <|...|> wildcard with an explicit allowlist of known model special tokens (endoftext, im_start, im_end, user, assistant, system, observation, EOT, begin▁of▁sentence, end▁of▁sentence, tool_call). This prevents false positives on legitimate content like F#/Elm pipeline operators.
Renamed function — stripFinalTagsFromText → stripSpecialMarkupFromText to reflect its expanded responsibility. Updated all call sites.

All tests pass. ✅

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 744aee2ec0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T05:17:58Z

src/agents/pi-embedded-helpers/errors.ts

    return text;
  }
-  return text.replace(FINAL_TAG_RE, "");
+  return text.replace(FINAL_TAG_RE, "").replace(MODEL_SPECIAL_TOKEN_RE, "");


Avoid stripping token literals from valid assistant explanations

The new global replacement in stripSpecialMarkupFromText removes every allowlisted token literal from all outbound text, so normal responses that intentionally mention these strings (for example, explaining what <|assistant|> means) are silently corrupted. Because sanitizeUserFacingText runs on regular non-error reply paths, this introduces user-visible truncation even when no delimiter leak occurred; token stripping should be scoped to leak-like contexts rather than unconditional text replacement.

Useful? React with 👍 / 👎.

imwyvern · 2026-03-09T16:17:16Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 744aee2ec0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T16:21:14Z

src/agents/pi-embedded-helpers/errors.ts

+ * @see https://github.com/openclaw/openclaw/issues/40020
+ */
+const MODEL_SPECIAL_TOKEN_RE =
+  /<\|(?:endoftext|im_start|im_end|user|assistant|system|observation|EOT|begin▁of▁sentence|end▁of▁sentence|tool_list_start|tool_list_end|tool_call_start|tool_call_end|tool_call_result_begin|tool_call_result_end|tool_content_start|tool_content_end)\|>/g;


Match full-width DeepSeek special tokens in sanitizer

MODEL_SPECIAL_TOKEN_RE only matches ASCII-delimited lowercase tokens (e.g. <|assistant|>), but DeepSeek commonly emits control tokens with full-width delimiters and role-case variants such as <｜Assistant｜> / <｜begin▁of▁sentence｜>. In that scenario sanitizeUserFacingText() will still leak raw model delimiters to end users, so the core regression this commit addresses remains unfixed for real DeepSeek outputs.

Useful? React with 👍 / 👎.

Models like GLM-5 and DeepSeek leak special delimiter tokens (e.g. <|tool_call_result_begin|>, <|user|>, <|observation|>) into assistant output. These tokens should never reach users. Add MODEL_SPECIAL_TOKEN_RE to stripFinalTagsFromText() which is called by sanitizeUserFacingText() — the central sanitization function for all outbound messages. Fixes openclaw#40020

…tion Address review feedback: 1. MODEL_SPECIAL_TOKEN_RE: Replace wildcard `<|[^|]*|>` with explicit token allowlist to avoid false positives on legitimate content (e.g. F# pipe operators, code discussions about tokens). 2. Rename stripFinalTagsFromText → stripSpecialMarkupFromText to accurately reflect the function now strips both <final> tags and model special tokens.

DeepSeek emits special tokens with full-width pipe (U+FF5C) delimiters like <｜Assistant｜> and <｜begin▁of▁sentence｜>. Updated regex to match both ASCII | and full-width ｜, and added case-insensitive flag. Added tests for full-width and mixed-case token variants.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52ac357f51

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-10T02:28:40Z

src/agents/pi-embedded-helpers/errors.ts

    return text;
  }
-  return text.replace(FINAL_TAG_RE, "");
+  return text.replace(FINAL_TAG_RE, "").replace(MODEL_SPECIAL_TOKEN_RE, "");


Gate special-token stripping behind leak detection

stripSpecialMarkupFromText now removes allowlisted token literals from every response, but sanitizeUserFacingText is also used on normal non-error assistant output paths (for example src/auto-reply/reply/normalize-reply.ts), so valid explanations that intentionally include strings like <|assistant|> or <｜Assistant｜> are silently corrupted. This is a user-visible regression introduced by the new unconditional replacement and should be limited to leak-like contexts instead of applied to all outbound text.

Useful? React with 👍 / 👎.

Codex review feedback: sanitizeUserFacingText strips model tokens from ALL text, which could mangle normal assistant explanations. Now stripSpecialMarkupFromText only applies MODEL_SPECIAL_TOKEN_RE when the text actually contains matching control tokens. Also reset lastIndex after .test() since the regex uses /g flag. FINAL_TAG_RE still runs unconditionally since <final> tags are never valid user-facing content.

imwyvern · 2026-03-10T08:27:07Z

Addressed Codex review feedback in bd2277f:

Gated special-token stripping behind leak detection: stripSpecialMarkupFromText now only applies MODEL_SPECIAL_TOKEN_RE when the text actually contains matching control tokens (via .test() guard). This prevents accidentally mangling normal assistant text that mentions token names in prose. Also resets lastIndex after .test() since the regex uses the /g flag.

FINAL_TAG_RE still runs unconditionally since <final> tags are never valid user-facing content.

Re: full-width DeepSeek tokens — already addressed in the prior commit with the [|｜] character class in the regex.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd2277f4d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-10T08:28:25Z

src/agents/pi-embedded-helpers/errors.ts

+  if (MODEL_SPECIAL_TOKEN_RE.test(result)) {
+    // Reset lastIndex since the regex has the /g flag.
+    MODEL_SPECIAL_TOKEN_RE.lastIndex = 0;
+    result = result.replace(MODEL_SPECIAL_TOKEN_RE, "");


Preserve token literals in normal assistant text

stripSpecialMarkupFromText removes every allowlisted token literal anywhere in the message, so ordinary responses that intentionally mention delimiters (for example, explaining <|assistant|> syntax) are silently corrupted. Because sanitizeUserFacingText is used on non-error output paths too, this affects normal user-visible replies rather than only leak cleanup; token stripping should be scoped to leak-like contexts instead of unconditional replacement.

Useful? React with 👍 / 👎.

imwyvern · 2026-03-10T10:39:58Z

👋 @obviyus Quick ping — this strips leaked GLM/DeepSeek special tokens (<|tool_call_result_begin|> etc.) from user-facing text. Small targeted fix with leak-detection guard. Tests pass.

odysseus0 · 2026-03-10T13:25:31Z

Thanks for the fix @imwyvern! I've reworked this into #42173 with a few architectural changes:

Generic pattern instead of allowlist — <[|｜]...[|｜]> catches all model control tokens (they all follow this convention), so we don't need to maintain an explicit list as new models add tokens
Moved to pi-embedded-utils.ts — next to stripMinimaxToolCallXml() which handles the same class of problem (provider-specific text leaks), keeping errors.ts focused on error rewriting
Added session replay path — sessions-helpers.ts has a parallel sanitization chain that also needed the fix

Your original work and investigation on #40020 made this possible — you're credited as co-author on the commit. 🙏

Superseded by #42173.

odysseus0 · 2026-03-10T13:25:37Z

Superseded by #42173

Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes #40020 Supersedes #40573 Co-authored-by: imwyvern <[email protected]>

odysseus0 · 2026-03-10T13:28:10Z

Superseded by #42173 (merged).

odysseus0 · 2026-03-10T13:30:14Z

Sorry about the separate PR — I had permission issues pushing to your branch when I was reworking it. You're credited as co-author on the merged commit. Next time I'll push directly to your branch. Thanks again for the fix!

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <[email protected]>

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <[email protected]> (cherry picked from commit 309162f)

…w#42173) Models like GLM-5 and DeepSeek sometimes emit internal delimiter tokens in their responses. Uses generic pattern in the text extraction pipeline, following the same architecture as stripMinimaxToolCallXml. Closes openclaw#40020 Supersedes openclaw#40573 Co-authored-by: imwyvern <[email protected]>

openclaw-barnacle bot added agents Agent runtime and tooling size: XS labels Mar 9, 2026

greptile-apps bot reviewed Mar 9, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 9, 2026

View reviewed changes

wes added 3 commits March 9, 2026 19:21

imwyvern force-pushed the fix/glm-special-token-sanitize branch from 744aee2 to 52ac357 Compare March 10, 2026 02:23

openclaw-barnacle bot added size: S and removed size: XS labels Mar 10, 2026

chatgpt-codex-connector bot reviewed Mar 10, 2026

View reviewed changes

odysseus0 mentioned this pull request Mar 10, 2026

fix: strip leaked model control tokens from user-facing text #42173

Merged

3 tasks

odysseus0 closed this Mar 10, 2026

Uh oh!

Conversation

imwyvern commented Mar 9, 2026

Problem

Solution

Changes

Testing

Uh oh!

greptile-apps bot commented Mar 9, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

greptile-apps bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

imwyvern commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

imwyvern commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

imwyvern commented Mar 10, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

imwyvern commented Mar 10, 2026

Uh oh!

odysseus0 commented Mar 10, 2026

Uh oh!

odysseus0 commented Mar 10, 2026

Uh oh!

odysseus0 commented Mar 10, 2026

Uh oh!

odysseus0 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants