Issue Checklist
Platform
macOS
Version
v1.8.4
Is your feature request related to an existing issue?
Related to #13831.
While attempting to fix the issue where effort was not correctly converted to thinking_budget, I discovered that the current effort → thinking_budget mapping strategy itself is fundamentally flawed for certain models.
Experiment: Using Qwen3.5-397B-A17B with low effort, thinking_budget was correctly converted to 4096 tokens. However, this actually made the model overthink significantly more compared to not passing thinking_budget at all:
| Scenario |
Prompt |
Thinking Time |
No thinking_budget (default) |
"Hi" |
~11s |
thinking_budget: 4096 (low effort) |
"Hi" |
~49s |
This reveals that the model tends to exhaust whatever thinking budget it receives, regardless of the actual complexity of the prompt. Even 4096 tokens (which is only ~5% of Qwen3.5's ~80k max thinking budget) is far too much for a trivial prompt like "Hi" — a budget of 100 would already be excessive in this case.
Desired Solution
We need a more nuanced approach to effort → thinking_budget mapping that accounts for model-specific behavior differences. Some potential directions:
-
Model-family-specific scaling: Different model families (Qwen, Claude, Gemini, etc.) may need fundamentally different mapping curves. Qwen3.5 clearly needs much more aggressive reduction at lower effort levels compared to Claude models.
-
Non-linear mapping: Instead of the current linear interpolation between min and max, consider exponential or logarithmic curves that provide finer granularity at the lower end.
-
Provider-level overrides: Allow THINKING_TOKEN_MAP entries to optionally specify custom effort ratios or mapping functions per model family.
-
Consider not sending thinking_budget at low effort: For models that behave better without an explicit thinking budget, the "low" effort setting could simply omit the parameter rather than sending a small value that paradoxically increases thinking.
The ideal solution would be for models to dynamically decide thinking intensity based on prompt complexity, but since that's a model-level behavior we can't control from the client side, we need smarter client-side heuristics.
Alternative Solutions
- Per-model opt-out: Add a flag in model configuration to disable
thinking_budget passthrough entirely, letting the model use its default behavior.
- User-configurable thinking budget: Expose the raw
thinking_budget value as an advanced setting, allowing power users to fine-tune it manually.
- Adaptive approach: Start with a very low budget and increase it if the model indicates truncated reasoning (though this would require multiple API calls).
Additional Information
The current EFFORT_RATIO mapping:
minimal: 0.01
low: 0.05
medium: 0.5
high: 0.95
xhigh: 1.0
For Qwen3.5 with a max of ~80k tokens, even minimal (0.01) would yield ~800 tokens, which may still cause overthinking for simple prompts. The root issue is that some models interpret any explicit thinking budget as a signal to think at least that much, rather than treating it as a maximum.
Code reference: src/renderer/src/aiCore/utils/reasoning.ts — getReasoningEffort() and the EFFORT_RATIO / findTokenLimit() mechanisms.
Issue Checklist
Platform
macOS
Version
v1.8.4
Is your feature request related to an existing issue?
Related to #13831.
While attempting to fix the issue where
effortwas not correctly converted tothinking_budget, I discovered that the currenteffort → thinking_budgetmapping strategy itself is fundamentally flawed for certain models.Experiment: Using Qwen3.5-397B-A17B with
loweffort,thinking_budgetwas correctly converted to 4096 tokens. However, this actually made the model overthink significantly more compared to not passingthinking_budgetat all:thinking_budget(default)thinking_budget: 4096(low effort)This reveals that the model tends to exhaust whatever thinking budget it receives, regardless of the actual complexity of the prompt. Even 4096 tokens (which is only ~5% of Qwen3.5's ~80k max thinking budget) is far too much for a trivial prompt like "Hi" — a budget of 100 would already be excessive in this case.
Desired Solution
We need a more nuanced approach to
effort → thinking_budgetmapping that accounts for model-specific behavior differences. Some potential directions:Model-family-specific scaling: Different model families (Qwen, Claude, Gemini, etc.) may need fundamentally different mapping curves. Qwen3.5 clearly needs much more aggressive reduction at lower effort levels compared to Claude models.
Non-linear mapping: Instead of the current linear interpolation between
minandmax, consider exponential or logarithmic curves that provide finer granularity at the lower end.Provider-level overrides: Allow
THINKING_TOKEN_MAPentries to optionally specify custom effort ratios or mapping functions per model family.Consider not sending
thinking_budgetat low effort: For models that behave better without an explicit thinking budget, the "low" effort setting could simply omit the parameter rather than sending a small value that paradoxically increases thinking.The ideal solution would be for models to dynamically decide thinking intensity based on prompt complexity, but since that's a model-level behavior we can't control from the client side, we need smarter client-side heuristics.
Alternative Solutions
thinking_budgetpassthrough entirely, letting the model use its default behavior.thinking_budgetvalue as an advanced setting, allowing power users to fine-tune it manually.Additional Information
The current
EFFORT_RATIOmapping:minimal: 0.01low: 0.05medium: 0.5high: 0.95xhigh: 1.0For Qwen3.5 with a max of ~80k tokens, even
minimal(0.01) would yield ~800 tokens, which may still cause overthinking for simple prompts. The root issue is that some models interpret any explicit thinking budget as a signal to think at least that much, rather than treating it as a maximum.Code reference:
src/renderer/src/aiCore/utils/reasoning.ts—getReasoningEffort()and theEFFORT_RATIO/findTokenLimit()mechanisms.