-
Notifications
You must be signed in to change notification settings - Fork 2.8k
fix: adjust GLM-4.6-turbo max output tokens to prevent context limit errors #8822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes issue where GLM-4.6-turbo was requesting the entire context window (202752 tokens) for output, leaving no room for input tokens. Now set to 40960 tokens (20% of 200k context) to allow sufficient input space. Fixes #8821
✅ Code Review CompleteNo issues found. The revert restores the raw Chutes API value (202752) while the centralized 20% cap in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found.
|
@roomote there should already be logic to cap the output tokens to 20% of the context window. Please figure out why that’s not working correctly. |
|
Thanks for the heads up. Investigating why the 20% output cap did not apply to GLM-4.6-turbo via Chutes. I will trace the centralized capping logic, verify contextWindow and model metadata, and check provider specific overrides and request building. Will follow up with root cause and fix shortly. |
…roviders (Chutes)\n\nRoot cause: BaseOpenAiCompatibleProvider sent model.info.maxTokens directly, bypassing getModelMaxOutputTokens, so the 20% context-window cap wasn’t applied for Chutes (e.g., GLM-4.6-turbo). Fix applies getModelMaxOutputTokens with format="openai" before sending max_tokens, preventing context limit overruns.\n\nTests: update Chutes spec to expect clamped max_tokens; all affected tests pass.
…e zai.spec to expect clamped max_tokens (min(model.maxTokens, ceil(contextWindow*0.2))) for OpenAI-compatible requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found.
|
Summary: I traced the 20 percent max output cap. The centralized logic already exists in getModelMaxOutputTokens() at getModelMaxOutputTokens(), but OpenAI-compatible providers were bypassing it. Root cause: BaseOpenAiCompatibleProvider was passing model.info.maxTokens straight through to max_tokens in createStream, so providers like Chutes and Z AI did not get the 20 percent clamp. Fix: BaseOpenAiCompatibleProvider now calls getModelMaxOutputTokens with format=openai before sending max_tokens, enforcing the 20 percent cap while preserving existing exceptions (for example GPT-5). Change: see BaseOpenAiCompatibleProvider and the param calculation in createStream near this call site. Tests: updated expectations to match the centralized clamp in chutes.spec and zai.spec. All local tests pass and CI is green across unit, integration, and compile. This should prevent GLM-4.6-turbo via Chutes from requesting output tokens beyond the 20 percent cap and eliminate context limit errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found.
|
Thanks for the quick review. I will revert the change introduced in fc04a3a on this branch and keep the centralized 20% output cap in place. I will push an update shortly. |
…2 in types; rely on centralized 20% cap at request time
|
Update: reverted the change from fc04a3a. Restored packages/types/src/providers/chutes.ts entry for model zai-org/GLM-4.6-turbo to maxTokens 202752. The centralized 20 percent output cap remains enforced at request time via getModelMaxOutputTokens with format=openai in BaseOpenAiCompatibleProvider, so runtime behavior still prevents context limit overruns. All tests pass locally and CI checks are green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found.
This PR attempts to address Issue #8821. Feedback and guidance are welcome.
Problem
GLM 4.6 Turbo via Chutes was failing with the error:
The issue was that
maxTokenswas set to 202752, using the entire context window for output and leaving no room for input tokens.Solution
maxTokensfrom 202752 to 40960 (20% of the 200K context window)Testing
Fixes #8821
Important
Adjust
GLM-4.6-turbomax output tokens to 40960 and implement a centralized 20% cap for max tokens to prevent context limit errors.maxTokensforzai-org/GLM-4.6-turboinchutes.tsfrom 202752 to 40960 to prevent context limit errors.base-openai-compatible-provider.ts.chutes.spec.tsandzai.spec.tsto verify the 20% cap logic.This description was created by
for d461d3e. You can customize this summary. It will automatically update as commits are pushed.