Skip to content

Conversation

@roomote
Copy link
Contributor

@roomote roomote bot commented Oct 25, 2025

This PR attempts to address Issue #8821. Feedback and guidance are welcome.

Problem

GLM 4.6 Turbo via Chutes was failing with the error:

"Requested token count exceeds the model's maximum context length of 202752 tokens. You requested a total of 233093 tokens: 30341 tokens from the input messages and 202752 tokens for the completion."

The issue was that maxTokens was set to 202752, using the entire context window for output and leaving no room for input tokens.

Solution

  • Adjusted maxTokens from 202752 to 40960 (20% of the 200K context window)
  • This allocation leaves sufficient room for input tokens while maintaining generous output capacity
  • Added clarifying comment about the 20% calculation

Testing

  • All existing tests pass
  • Type checking passes
  • Linting passes

Fixes #8821


Important

Adjust GLM-4.6-turbo max output tokens to 40960 and implement a centralized 20% cap for max tokens to prevent context limit errors.

  • Behavior:
    • Adjust maxTokens for zai-org/GLM-4.6-turbo in chutes.ts from 202752 to 40960 to prevent context limit errors.
    • Implement centralized 20% cap for max tokens in base-openai-compatible-provider.ts.
  • Testing:
    • Update tests in chutes.spec.ts and zai.spec.ts to verify the 20% cap logic.
    • Ensure tests pass for new max token logic.

This description was created by Ellipsis for d461d3e. You can customize this summary. It will automatically update as commits are pushed.

Fixes issue where GLM-4.6-turbo was requesting the entire context window
(202752 tokens) for output, leaving no room for input tokens. Now set to
40960 tokens (20% of 200k context) to allow sufficient input space.

Fixes #8821
@roomote roomote bot requested review from cte, jr and mrubens as code owners October 25, 2025 11:15
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Oct 25, 2025
@roomote
Copy link
Contributor Author

roomote bot commented Oct 25, 2025

✅ Code Review Complete

No issues found. The revert restores the raw Chutes API value (202752) while the centralized 20% cap in getModelMaxOutputTokens() ensures the effective max tokens will be ~40,551, preventing the original token limit error.

Follow Along on Roo Code Cloud

Copy link
Contributor Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.

@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Oct 25, 2025
@mrubens
Copy link
Collaborator

mrubens commented Oct 25, 2025

@roomote there should already be logic to cap the output tokens to 20% of the context window. Please figure out why that’s not working correctly.

@roomote
Copy link
Contributor Author

roomote bot commented Oct 25, 2025

Thanks for the heads up. Investigating why the 20% output cap did not apply to GLM-4.6-turbo via Chutes. I will trace the centralized capping logic, verify contextWindow and model metadata, and check provider specific overrides and request building. Will follow up with root cause and fix shortly.

…roviders (Chutes)\n\nRoot cause: BaseOpenAiCompatibleProvider sent model.info.maxTokens directly, bypassing getModelMaxOutputTokens, so the 20% context-window cap wasn’t applied for Chutes (e.g., GLM-4.6-turbo). Fix applies getModelMaxOutputTokens with format="openai" before sending max_tokens, preventing context limit overruns.\n\nTests: update Chutes spec to expect clamped max_tokens; all affected tests pass.
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Oct 25, 2025
…e zai.spec to expect clamped max_tokens (min(model.maxTokens, ceil(contextWindow*0.2))) for OpenAI-compatible requests.
Copy link
Contributor Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.

@roomote
Copy link
Contributor Author

roomote bot commented Oct 25, 2025

Summary: I traced the 20 percent max output cap. The centralized logic already exists in getModelMaxOutputTokens() at getModelMaxOutputTokens(), but OpenAI-compatible providers were bypassing it. Root cause: BaseOpenAiCompatibleProvider was passing model.info.maxTokens straight through to max_tokens in createStream, so providers like Chutes and Z AI did not get the 20 percent clamp. Fix: BaseOpenAiCompatibleProvider now calls getModelMaxOutputTokens with format=openai before sending max_tokens, enforcing the 20 percent cap while preserving existing exceptions (for example GPT-5). Change: see BaseOpenAiCompatibleProvider and the param calculation in createStream near this call site. Tests: updated expectations to match the centralized clamp in chutes.spec and zai.spec. All local tests pass and CI is green across unit, integration, and compile. This should prevent GLM-4.6-turbo via Chutes from requesting output tokens beyond the 20 percent cap and eliminate context limit errors.

@mrubens
Copy link
Collaborator

mrubens commented Oct 25, 2025

@roomote can you undo the change in fc04a3a then?

Copy link
Contributor Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.

@roomote
Copy link
Contributor Author

roomote bot commented Oct 25, 2025

Thanks for the quick review. I will revert the change introduced in fc04a3a on this branch and keep the centralized 20% output cap in place. I will push an update shortly.

…2 in types; rely on centralized 20% cap at request time
@roomote
Copy link
Contributor Author

roomote bot commented Oct 25, 2025

Update: reverted the change from fc04a3a. Restored packages/types/src/providers/chutes.ts entry for model zai-org/GLM-4.6-turbo to maxTokens 202752. The centralized 20 percent output cap remains enforced at request time via getModelMaxOutputTokens with format=openai in BaseOpenAiCompatibleProvider, so runtime behavior still prevents context limit overruns. All tests pass locally and CI checks are green.

Copy link
Contributor Author

@roomote roomote bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 25, 2025
@mrubens mrubens merged commit 98b8d5b into main Oct 25, 2025
11 checks passed
@mrubens mrubens deleted the fix/glm-4.6-turbo-max-tokens branch October 25, 2025 18:08
@github-project-automation github-project-automation bot moved this from Triage to Done in Roo Code Roadmap Oct 25, 2025
@github-project-automation github-project-automation bot moved this from New to Done in Roo Code Roadmap Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[BUG] GLM 4.6 Turbo via Chutes doesn't work because of incorrect max output token count

4 participants