-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Uncommon characters such as '’' (U+2019) in system prompts can degrade LLM performance #2812
Description
Describe the bug
Files such as crates/goose/src/prompts/plan.md and
crates/goose/src/prompts/system_gpt_4.1.md contain characters such as '’' (U+2019: RIGHT SINGLE QUOTATION MARK) and others, which are probably quite uncommon in LLM training data, cause uncommon tokenization, and so can negatively influence (however subtly) the quality of LLM output, especially when used in system prompts or instructions for tool use.
To Reproduce
Steps to reproduce the behavior:
- Clone the Goose Git repository
- In a (Bourne-like) shell with its current directory at the root of the Goose Git repository clone, execute the following command:
$ grep --color='always' -r -P -n "[^\x00-\x7F]" .|less -R - Note the lines with uncommon quotation characters (ignore emoji).
Expected behavior
Strictly ASCII characters in system prompts and tool instructions, especially when used as word prefixes or suffixes, except when using clearly separated emoji.
Screenshots

OpenAI online tokenizer output for text containing right quotation mark. Notice that the important word user's is split into two tokens, which can influence LLM attention.

OpenAI online tokenizer ourput for the same text, but this time containing a common ' apostrophe. The important word user's is tokenized as a single token.
Please provide following information:
- OS & Arch: [WSL Ubuntu 24.04.2 LTS x86]
- Interface: [CLI]
- Version: [v1.0.24]
- Extensions enabled: [Developer Tools, Computer Controller, Memory]
- Provider & Model: [Ollama - qwen3:14b-16k-ctx]
Additional context
Add any other context about the problem here.
N/A