-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Default system prompt degrades kimi-k2.5 performance on coding benchmarks #20258
Description
Description
We are reporting this on behalf of Moonshot AI. In our internal evaluations, we found that the current default system prompt appears to degrade kimi-k2.5 performance on coding- and reasoning-oriented benchmarks.
Summary of observed impact
| Benchmark | With fine-tuned prompt | With default prompt |
|---|---|---|
| Benchmark A | 58.0 ± 2.4 |
54.1 ± 3.8 |
| Benchmark B | 67.1 ± 1.0 |
60.0 ± 2.4 |
Across both benchmarks, the default prompt is not neutral for Kimi. It appears to reduce both average performance and result stability.
Why the default prompt may be harmful
Based on prompt inspection, we believe there are at least three concrete issues.
1. Overly aggressive brevity constraints
The prompt repeatedly instructs the model to minimize response length, including guidance such as:
- “minimize output tokens as much as possible”
- “should NOT answer with unnecessary preamble or postamble”
- “MUST answer concisely with fewer than 4 lines”
- “One word answers are best”
For a reasoning-oriented coding model, these constraints appear too aggressive. They bias the model toward underspecified or shallow responses and may suppress useful planning, explanation, and intermediate reasoning behavior.
2. Misaligned few-shot examples
The few-shot examples in the default prompt are primarily trivial question-answer pairs, such as:
2+2How many golf balls fit inside a jetta?is 11 a prime number?
These examples do not resemble the kinds of tasks the model is expected to perform in coding and engineering settings.
3. Internally conflicting instructions
The prompt also appears to contain contradictory guidance. For example, it instructs the model to explain what a command does and why it is being run, while also discouraging explanatory text before or after responses.
These competing instructions likely create instability in response style and behavior, which may contribute to the higher variance we observe in benchmark results.
Related issues:
- default system prompt materiallly freaks out / degrades reasoning of every high reasoning model #10927
- System prompt block Qwen3.5 natural thinking process #18799
Plugins
No response
OpenCode version
1.2.27
Steps to reproduce
We are unable to provide a public reproduction workflow. The underlying evaluation datasets and benchmark setup are internal only.
Screenshot and/or share link
No response
Operating System
Ubuntu 22.04
Terminal
No response