Add benchmarks for Qwen3-235B-A22B and Qwen3-32B#3908
Add benchmarks for Qwen3-235B-A22B and Qwen3-32B#3908AlongWY wants to merge 1 commit intoAider-AI:mainfrom AlongWY:patch-1
Conversation
|
Thanks for your interest in aider and for taking the time to make this PR. How did you benchmark this model? Where was it running? With what quantization or other inference settings? |
|
Thank you for taking the time to review the PR! We loaded the Qwen3-235B-A22B and Qwen3-32B models using VLLM with bfloat16 precision, and tested them in non-thinking mode. Here were the settings we used: - name: openai/${MODEL_NAME}
use_temperature: 0.6
extra_params:
max_tokens: 24000
top_p: 0.95
top_k: 20
temperature: 0.6Aider is such a great AI Pair Programming Assistant! Thanks for everything! |
Your parameters are for the thinking mode. Qwen provide these parameters for non-thinking:
It might affect the results if you run it with these settings. |
Thank you for your recommendation regarding the inference settings! We have re-evaluated Qwen3's performance on the Aider benchmark using the suggested parameters and found that Qwen3-235B-A22B achieves better results under these settings(61.8 to 65.3 with whole format). We have updated our results accordingly and resubmitted them with the recommended configuration. |
|
Hi @paul-gauthier, We've re-evaluated Qwen3's performance on the Aider benchmark with the recommended configuration and updated our results accordingly. - name: openai/${MODEL_NAME}
use_temperature: 0.7
extra_params:
max_tokens: 24000
top_p: 0.8
top_k: 20
min_p: 0.0
temperature: 0.7Would it be possible to merge this PR soon if everything looks good? Let us know if you need any further adjustments or clarifications. Thanks again for your time and support! |
|
What about Qwen3 30B A3B? (Considering its throughput is quite impressive, and the cost is acceptable) |
|
@AlongWY woulds it be possible to test in thinking mode as well? |
|
Hi @paul-gauthier, Just wanted to follow up gently on this PR — I'm checking in to see if there's anything preventing it from being merged at the moment. If there are any issues or further changes needed, please let me know — we're happy to help address them! Thanks again for your time and consideration. |
|
why is the context so small? Curious also: what happened in diff mode? |
|
Hi, I tested qwen3-32b with the same configuration, but I got different results. Can you provide me with the detailed log? |
|
|
Any chance we could see these models with thinking enabled as well? I am really curious what the performance looks like then. |
|
what about the thinking version with architect mode enabled? |
|
I am benchmarking |
Seems like you are using |
|
I'm benchmarking |
Sorry for mistake then. It seems you are enabling reasoning while the PR author stated these results were with reasoning disabled. In my local testing the model actually performs better on coding without reasoning (by adding /nothink to system prompt). In that case it would be good to set |
Hello @paul-gauthier, It seems that you run models with wrong paramaters and wrong mode(non-thinking mode, not thinking mode). We force the model don't think by change chat template when deploy with vllm: change to Add And we rerun the benchmark with the recommended paramaters |
|
These settings: Only get ~51% results... |
|
I won't be able to merge this PR since I am unable to reproduce scores even close to those reported here. |
This was expected. Can you share your results? It's interesting to see the actual performance |
An explanation for the low or inconsistent performance might be that you've made a mistake in your system prompt. You have: But the correct way to disable thinking on qwen3 models is to have "/no_think" instead with an underscore. |
Welp, that's exactly what they did: QwenLM/Qwen3@2af51e0 I think I've also seen the variants with a |
|
I wonder if openrouter deploys the model bfloat16 similarly to how it was tested by the PR author. @AlongWY are you able to reproduce your results in openrouter or some other LLM cloud provider? |
|
According to Openrouter FAQ, it is just a proxy for other LLM providers. @paul-gauthier do you know which provider was used for the verification? Looking at the Qwen3 235B A22B provider list, all of them but one are listed as using FP8 which is different from the settings used by @AlongWY: Fireworks doesn't list the precision, but due to the higher TPS I imagine it should be 4-bit. @AlongWY my suggestion is to try reproducing the results using a public API provider and opening a new PR that can be verified independently. Update: seems like Alibaba cloud has an official API for Qwen, so maybe that should be used for reproducing the results @AlongWY : https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api |
|
I add the results from this PR to a new blog entry: Thanks for doing the benchmarking and sharing your results! |
|
Hi @paul-gauthier @tarruda , thank you for your feedback! After trying OpenRouter myself, I indeed found that it fails to activate the no-thinking mode successfully (though this is related to third-party providers, who deploy the thinking mode by default). Following @tarruda 's suggestion, I tested Qwen's official API (https://dashscope.aliyuncs.com/compatible-mode/v1 ) with the following configuration: export OPENAI_API_BASE="https://dashscope.aliyuncs.com/compatible-mode/v1"
export OPENAI_API_KEY="YOUR API KEY"- name: openai/qwen3-235b-a22b
use_temperature: 0.7
streaming: false
extra_params:
stream: false
max_tokens: 16384
top_p: 0.8
top_k: 20
temperature: 0.7
enable_thinking: false
extra_body:
enable_thinking: falseHere are the detailed test results:
- dirname: 2025-05-09-12-12-40--qwen3-235b-a22b.unthink_16k_whole
test_cases: 225
model: openai/qwen3-235b-a22b
edit_format: whole
commit_hash: 8159cbf-dirty
pass_rate_1: 26.2
pass_rate_2: 62.7
pass_num_1: 59
pass_num_2: 141
percent_cases_well_formed: 100.0
error_outputs: 0
num_malformed_responses: 0
num_with_malformed_responses: 0
user_asks: 153
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 1
total_tests: 225
command: aider --model openai/qwen3-235b-a22b
date: 2025-05-09
versions: 0.82.4.dev
seconds_per_case: 49.7
total_cost: 0.0000
- dirname: 2025-05-09-10-44-53--qwen3-235b-a22b.unthink_16k_diff
test_cases: 225
model: openai/qwen3-235b-a22b
edit_format: diff
commit_hash: 8159cbf-dirty
pass_rate_1: 28.0
pass_rate_2: 59.1
pass_num_1: 63
pass_num_2: 133
percent_cases_well_formed: 92.9
error_outputs: 19
num_malformed_responses: 19
num_with_malformed_responses: 16
user_asks: 118
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
test_timeouts: 3
total_tests: 225
command: aider --model openai/qwen3-235b-a22b
date: 2025-05-09
versions: 0.82.4.dev
seconds_per_case: 45.1
total_cost: 0.0000The full log can be found here: The final results show some fluctuations, but the overall performance is broadly consistent with my previous results. And someone get close results using fireworks. |
|
I didn't see any aider results of Qwen3-30B-A3B, so I executed the benchmark against a local llama-server + Q8_0 quant. Since it was not clear to me how to disable thinking with aider, I used an http proxy that injected the I don´t understand the time it took to run: and model_settings.yaml: To use the Later I might give it a shot at benchmarking |
Thanks for running some tests for the smaller MoE model. Could you try running it with /no_think instead of /nothink as not including the underscore can induce some negative variation in the test results due to it not being the spelling used in documentation. |
In my local testing |
|
Just FYI people on the discord server have since long run these configurations. 30B-A3B with whole and thinking enabled will score 44-45%. Unlike all other models including the 32B, it benefits from thinking, and these results hold even down to Q4. The setting With diff the score drops to 31-32% so for that you're better off with the 32B (38-40%). |
|
Following the discussion in this comment, I am attempting to configure and test the My current configuration is as follows: In model-settings-file: ~/.aider.model.settings.ymlIn - name: openai/qwen3-235b-a22b
use_temperature: 0.7
streaming: false
extra_params:
stream: false
max_tokens: 16384
top_p: 0.8
top_k: 20
temperature: 0.7
enable_thinking: false
extra_body:
enable_thinking: falseI have also ensured that the litellm.APIError: APIError: OpenAIException - Connection error.Could anyone provide suggestions or point out potential misconfigurations that might be leading to this connection issue? Any insights would be greatly |
|
Connection error suggests |



Add benchmarks for Qwen3-235B-A22B and Qwen3-32B