[Feature]: Improve effort → thinking_budget mapping granularity for models that exhaust their budget

### Issue Checklist

- [x] I understand that issues are for reporting problems and requesting features, not for off-topic comments, and I will provide as much detail as possible to help resolve the issue.
- [x] I have checked the pinned issues and searched through the existing [open issues](https://github.com/CherryHQ/cherry-studio/issues), [closed issues](https://github.com/CherryHQ/cherry-studio/issues?q=is%3Aissue%20state%3Aclosed), and [discussions](https://github.com/CherryHQ/cherry-studio/discussions) and did not find a similar suggestion.
- [x] I have provided a short and descriptive title so that developers can quickly understand the issue when browsing the issue list, rather than vague titles like "A suggestion" or "Stuck."
- [x] The latest version of Cherry Studio does not include the feature I am suggesting.

### Platform

macOS

### Version

v1.8.4

### Is your feature request related to an existing issue?

Related to #13831.

While attempting to fix the issue where `effort` was not correctly converted to `thinking_budget`, I discovered that the current `effort → thinking_budget` mapping strategy itself is fundamentally flawed for certain models.

**Experiment**: Using Qwen3.5-397B-A17B with `low` effort, `thinking_budget` was correctly converted to 4096 tokens. However, this actually made the model **overthink significantly more** compared to not passing `thinking_budget` at all:

| Scenario | Prompt | Thinking Time |
|---|---|---|
| No `thinking_budget` (default) | "Hi" | ~11s |
| `thinking_budget: 4096` (low effort) | "Hi" | ~49s |

This reveals that the model tends to **exhaust** whatever thinking budget it receives, regardless of the actual complexity of the prompt. Even 4096 tokens (which is only ~5% of Qwen3.5's ~80k max thinking budget) is far too much for a trivial prompt like "Hi" — a budget of 100 would already be excessive in this case.

### Desired Solution

We need a more nuanced approach to `effort → thinking_budget` mapping that accounts for model-specific behavior differences. Some potential directions:

1. **Model-family-specific scaling**: Different model families (Qwen, Claude, Gemini, etc.) may need fundamentally different mapping curves. Qwen3.5 clearly needs much more aggressive reduction at lower effort levels compared to Claude models.

2. **Non-linear mapping**: Instead of the current linear interpolation between `min` and `max`, consider exponential or logarithmic curves that provide finer granularity at the lower end.

3. **Provider-level overrides**: Allow `THINKING_TOKEN_MAP` entries to optionally specify custom effort ratios or mapping functions per model family.

4. **Consider not sending `thinking_budget` at low effort**: For models that behave better without an explicit thinking budget, the "low" effort setting could simply omit the parameter rather than sending a small value that paradoxically increases thinking.

The ideal solution would be for models to dynamically decide thinking intensity based on prompt complexity, but since that's a model-level behavior we can't control from the client side, we need smarter client-side heuristics.

### Alternative Solutions

- **Per-model opt-out**: Add a flag in model configuration to disable `thinking_budget` passthrough entirely, letting the model use its default behavior.
- **User-configurable thinking budget**: Expose the raw `thinking_budget` value as an advanced setting, allowing power users to fine-tune it manually.
- **Adaptive approach**: Start with a very low budget and increase it if the model indicates truncated reasoning (though this would require multiple API calls).

### Additional Information

The current `EFFORT_RATIO` mapping:
- `minimal`: 0.01
- `low`: 0.05
- `medium`: 0.5
- `high`: 0.95
- `xhigh`: 1.0

For Qwen3.5 with a max of ~80k tokens, even `minimal` (0.01) would yield ~800 tokens, which may still cause overthinking for simple prompts. The root issue is that some models interpret any explicit thinking budget as a signal to think **at least** that much, rather than treating it as a maximum.

Code reference: `src/renderer/src/aiCore/utils/reasoning.ts` — `getReasoningEffort()` and the `EFFORT_RATIO` / `findTokenLimit()` mechanisms.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Improve effort → thinking_budget mapping granularity for models that exhaust their budget #13844

Issue Checklist

Platform

Version

Is your feature request related to an existing issue?

Desired Solution

Alternative Solutions

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scenario	Prompt	Thinking Time
No `thinking_budget` (default)	"Hi"	~11s
`thinking_budget: 4096` (low effort)	"Hi"	~49s

[Feature]: Improve effort → thinking_budget mapping granularity for models that exhaust their budget #13844

Description

Issue Checklist

Platform

Version

Is your feature request related to an existing issue?

Desired Solution

Alternative Solutions

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions