Skip to content

Uncommon characters such as '’' (U+2019) in system prompts can degrade LLM performance #2812

@ackalker

Description

@ackalker

Describe the bug

Files such as crates/goose/src/prompts/plan.md and
crates/goose/src/prompts/system_gpt_4.1.md contain characters such as '’' (U+2019: RIGHT SINGLE QUOTATION MARK) and others, which are probably quite uncommon in LLM training data, cause uncommon tokenization, and so can negatively influence (however subtly) the quality of LLM output, especially when used in system prompts or instructions for tool use.

To Reproduce
Steps to reproduce the behavior:

  1. Clone the Goose Git repository
  2. In a (Bourne-like) shell with its current directory at the root of the Goose Git repository clone, execute the following command:
    $ grep --color='always' -r -P -n "[^\x00-\x7F]" .|less -R
  3. Note the lines with uncommon quotation characters (ignore emoji).

Expected behavior
Strictly ASCII characters in system prompts and tool instructions, especially when used as word prefixes or suffixes, except when using clearly separated emoji.

Screenshots

Image
OpenAI online tokenizer output for text containing right quotation mark. Notice that the important word user's is split into two tokens, which can influence LLM attention.

Image
OpenAI online tokenizer ourput for the same text, but this time containing a common ' apostrophe. The important word user's is tokenized as a single token.

Please provide following information:

  • OS & Arch: [WSL Ubuntu 24.04.2 LTS x86]
  • Interface: [CLI]
  • Version: [v1.0.24]
  • Extensions enabled: [Developer Tools, Computer Controller, Memory]
  • Provider & Model: [Ollama - qwen3:14b-16k-ctx]

Additional context
Add any other context about the problem here.

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    p3Priority 3 - Low

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions