Skip to content

Default system prompt degrades kimi-k2.5 performance on coding benchmarks #20258

@Yuxin-Dong

Description

@Yuxin-Dong

Description

We are reporting this on behalf of Moonshot AI. In our internal evaluations, we found that the current default system prompt appears to degrade kimi-k2.5 performance on coding- and reasoning-oriented benchmarks.

Summary of observed impact

Benchmark With fine-tuned prompt With default prompt
Benchmark A 58.0 ± 2.4 54.1 ± 3.8
Benchmark B 67.1 ± 1.0 60.0 ± 2.4

Across both benchmarks, the default prompt is not neutral for Kimi. It appears to reduce both average performance and result stability.

Why the default prompt may be harmful

Based on prompt inspection, we believe there are at least three concrete issues.

1. Overly aggressive brevity constraints

The prompt repeatedly instructs the model to minimize response length, including guidance such as:

  • “minimize output tokens as much as possible”
  • “should NOT answer with unnecessary preamble or postamble”
  • “MUST answer concisely with fewer than 4 lines”
  • “One word answers are best”

For a reasoning-oriented coding model, these constraints appear too aggressive. They bias the model toward underspecified or shallow responses and may suppress useful planning, explanation, and intermediate reasoning behavior.

2. Misaligned few-shot examples

The few-shot examples in the default prompt are primarily trivial question-answer pairs, such as:

  • 2+2
  • How many golf balls fit inside a jetta?
  • is 11 a prime number?

These examples do not resemble the kinds of tasks the model is expected to perform in coding and engineering settings.

3. Internally conflicting instructions

The prompt also appears to contain contradictory guidance. For example, it instructs the model to explain what a command does and why it is being run, while also discouraging explanatory text before or after responses.

These competing instructions likely create instability in response style and behavior, which may contribute to the higher variance we observe in benchmark results.

Related issues:

Plugins

No response

OpenCode version

1.2.27

Steps to reproduce

We are unable to provide a public reproduction workflow. The underlying evaluation datasets and benchmark setup are internal only.

Screenshot and/or share link

No response

Operating System

Ubuntu 22.04

Terminal

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcoreAnything pertaining to core functionality of the application (opencode server stuff)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions