Skip to content

Add benchmarks for Qwen3-235B-A22B and Qwen3-32B#3908

Closed
AlongWY wants to merge 1 commit intoAider-AI:mainfrom
AlongWY:patch-1
Closed

Add benchmarks for Qwen3-235B-A22B and Qwen3-32B#3908
AlongWY wants to merge 1 commit intoAider-AI:mainfrom
AlongWY:patch-1

Conversation

@AlongWY
Copy link
Copy Markdown

@AlongWY AlongWY commented Apr 28, 2025

Add benchmarks for Qwen3-235B-A22B and Qwen3-32B

@paul-gauthier
Copy link
Copy Markdown
Collaborator

Thanks for your interest in aider and for taking the time to make this PR.

How did you benchmark this model? Where was it running? With what quantization or other inference settings?

@AlongWY
Copy link
Copy Markdown
Author

AlongWY commented Apr 29, 2025

Thank you for taking the time to review the PR!

We loaded the Qwen3-235B-A22B and Qwen3-32B models using VLLM with bfloat16 precision, and tested them in non-thinking mode. Here were the settings we used:

- name: openai/${MODEL_NAME}
  use_temperature: 0.6
  extra_params:
    max_tokens: 24000
    top_p: 0.95
    top_k: 20
    temperature: 0.6

Aider is such a great AI Pair Programming Assistant!
Big shoutout to the Aider leaderboard too — it's super helpful and gives a clear picture of how models are performing.

Thanks for everything!

@R-Dson
Copy link
Copy Markdown

R-Dson commented Apr 30, 2025

Thank you for taking the time to review the PR!

We loaded the Qwen3-235B-A22B and Qwen3-32B models using VLLM with bfloat16 precision, and tested them in non-thinking mode. Here were the settings we used:

- name: openai/${MODEL_NAME}
  use_temperature: 0.6
  extra_params:
    max_tokens: 24000
    top_p: 0.95
    top_k: 20
    temperature: 0.6

Aider is such a great AI Pair Programming Assistant! Big shoutout to the Aider leaderboard too — it's super helpful and gives a clear picture of how models are performing.

Thanks for everything!

Your parameters are for the thinking mode. Qwen provide these parameters for non-thinking:

For non-thinking mode (enable_thinking=False), we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.

It might affect the results if you run it with these settings.

@AlongWY
Copy link
Copy Markdown
Author

AlongWY commented Apr 30, 2025

Thank you for taking the time to review the PR!
We loaded the Qwen3-235B-A22B and Qwen3-32B models using VLLM with bfloat16 precision, and tested them in non-thinking mode. Here were the settings we used:

- name: openai/${MODEL_NAME}
  use_temperature: 0.6
  extra_params:
    max_tokens: 24000
    top_p: 0.95
    top_k: 20
    temperature: 0.6

Aider is such a great AI Pair Programming Assistant! Big shoutout to the Aider leaderboard too — it's super helpful and gives a clear picture of how models are performing.
Thanks for everything!

Your parameters are for the thinking mode. Qwen provide these parameters for non-thinking:

For non-thinking mode (enable_thinking=False), we suggest using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0.

It might affect the results if you run it with these settings.

Thank you for your recommendation regarding the inference settings! We have re-evaluated Qwen3's performance on the Aider benchmark using the suggested parameters and found that Qwen3-235B-A22B achieves better results under these settings(61.8 to 65.3 with whole format).

We have updated our results accordingly and resubmitted them with the recommended configuration.
Thanks again for your valuable suggestion!

@AlongWY
Copy link
Copy Markdown
Author

AlongWY commented May 1, 2025

Hi @paul-gauthier,

We've re-evaluated Qwen3's performance on the Aider benchmark with the recommended configuration and updated our results accordingly.

- name: openai/${MODEL_NAME}
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7

Would it be possible to merge this PR soon if everything looks good? Let us know if you need any further adjustments or clarifications.

Thanks again for your time and support!

@Yefori-Go
Copy link
Copy Markdown

What about Qwen3 30B A3B? (Considering its throughput is quite impressive, and the cost is acceptable)

@pcfreak30
Copy link
Copy Markdown

@AlongWY woulds it be possible to test in thinking mode as well?

@AlongWY
Copy link
Copy Markdown
Author

AlongWY commented May 5, 2025

Hi @paul-gauthier

Just wanted to follow up gently on this PR — I'm checking in to see if there's anything preventing it from being merged at the moment.

If there are any issues or further changes needed, please let me know — we're happy to help address them!

Thanks again for your time and consideration.

@FellowTraveler
Copy link
Copy Markdown

FellowTraveler commented May 6, 2025

why is the context so small? Curious also: what happened in diff mode?

@kamuy-shennai
Copy link
Copy Markdown

Hi, I tested qwen3-32b with the same configuration, but I got different results. Can you provide me with the detailed log?
pass_rate_1: 11.1 pass_rate_2: 24.4 pass_num_1: 25 pass_num_2: 55

@AlongWY
Copy link
Copy Markdown
Author

AlongWY commented May 6, 2025

Hi, I tested qwen3-32b with the same configuration, but I got different results. Can you provide me with the detailed log? pass_rate_1: 11.1 pass_rate_2: 24.4 pass_num_1: 25 pass_num_2: 55

@Mushoz
Copy link
Copy Markdown
Contributor

Mushoz commented May 6, 2025

Any chance we could see these models with thinking enabled as well? I am really curious what the performance looks like then.

@Emasoft
Copy link
Copy Markdown

Emasoft commented May 7, 2025

what about the thinking version with architect mode enabled?

@paul-gauthier
Copy link
Copy Markdown
Collaborator

I am benchmarking openrouter/qwen/qwen3-235b-a22b and don't seem to be getting anywhere close the results from this PR.

- name: openrouter/qwen/qwen3-30b-a3b
  use_temperature: 0.6
  extra_params:
    max_tokens: 24000
    top_p: 0.95
    top_k: 20
    temperature: 0.6
    extra_body:
      reasoning: high

@tarruda
Copy link
Copy Markdown

tarruda commented May 7, 2025

I am benchmarking openrouter/qwen/qwen3-235b-a22b and don't seem to be getting anywhere close the results from this PR.

Seems like you are using qwen3-30b-a3b instead of qwen3-235b-a22b

@paul-gauthier
Copy link
Copy Markdown
Collaborator

I'm benchmarking openrouter/qwen/qwen3-235b-a22b too, with those same settings.

@tarruda
Copy link
Copy Markdown

tarruda commented May 7, 2025

I'm benchmarking openrouter/qwen/qwen3-235b-a22b too, with those same settings.

Sorry for mistake then.

It seems you are enabling reasoning while the PR author stated these results were with reasoning disabled. The context size of 24k seems very small for Qwen 3 reasoning which can be very verbose and if exceeds that length will start to loop/hallucinate. (ignore this, I can see now that 24k is "max tokens" and not "context length)

In my local testing the model actually performs better on coding without reasoning (by adding /nothink to system prompt). In that case it would be good to set Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 as recommended by the model card.

@AlongWY
Copy link
Copy Markdown
Author

AlongWY commented May 8, 2025

I am benchmarking openrouter/qwen/qwen3-235b-a22b and don't seem to be getting anywhere close the results from this PR.

- name: openrouter/qwen/qwen3-30b-a3b
  use_temperature: 0.6
  extra_params:
    max_tokens: 24000
    top_p: 0.95
    top_k: 20
    temperature: 0.6
    extra_body:
      reasoning: high

Hello @paul-gauthier, It seems that you run models with wrong paramaters and wrong mode(non-thinking mode, not thinking mode).

We force the model don't think by change chat template when deploy with vllm:

change tokenizer_config.json(that means enable_thinking=False)

    {%- if enable_thinking is defined and enable_thinking is false %}
        {{- '<think>\n\n</think>\n\n' }}
    {%- endif %}

to

    {{- '<think>\n\n</think>\n\n' }}

Add /nothink to system prompt should work too.

And we rerun the benchmark with the recommended paramaters Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 from the model card.

- name: openai/qwen3-235b-a22b
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7

@paul-gauthier
Copy link
Copy Markdown
Collaborator

paul-gauthier commented May 8, 2025

These settings:

- name: openrouter/qwen/qwen3-235b-a22b
  system_prompt_prefix: "/nothink"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7

Only get ~51% results...

- dirname: 2025-05-08-13-52-53--qwen3-235b-shash-nothink
  test_cases: 224
  model: openrouter/qwen/qwen3-235b-a22b
  edit_format: diff
  commit_hash: a39cec8
  pass_rate_1: 23.2
  pass_rate_2: 51.3
  pass_num_1: 52
  pass_num_2: 115
  percent_cases_well_formed: 89.3
  error_outputs: 39
  num_malformed_responses: 33
  num_with_malformed_responses: 24
  user_asks: 109
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2959015
  completion_tokens: 366247
  test_timeouts: 6
  total_tests: 225
  command: aider --model openrouter/qwen/qwen3-235b-a22b
  date: 2025-05-08
  versions: 0.82.4.dev
  seconds_per_case: 94.3
  total_cost: 0.6636

@paul-gauthier
Copy link
Copy Markdown
Collaborator

I won't be able to merge this PR since I am unable to reproduce scores even close to those reported here.

@SymphonyNineth
Copy link
Copy Markdown

I won't be able to merge this PR since I am unable to reproduce scores even close to those reported here.

This was expected. Can you share your results? It's interesting to see the actual performance

@new01
Copy link
Copy Markdown

new01 commented May 8, 2025

These settings:

- name: openrouter/qwen/qwen3-235b-a22b
  system_prompt_prefix: "/nothink"
  use_temperature: 0.7
  extra_params:
    max_tokens: 24000
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7

Only get ~51% results...

- dirname: 2025-05-08-13-52-53--qwen3-235b-shash-nothink
  test_cases: 224
  model: openrouter/qwen/qwen3-235b-a22b
  edit_format: diff
  commit_hash: a39cec8
  pass_rate_1: 23.2
  pass_rate_2: 51.3
  pass_num_1: 52
  pass_num_2: 115
  percent_cases_well_formed: 89.3
  error_outputs: 39
  num_malformed_responses: 33
  num_with_malformed_responses: 24
  user_asks: 109
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  prompt_tokens: 2959015
  completion_tokens: 366247
  test_timeouts: 6
  total_tests: 225
  command: aider --model openrouter/qwen/qwen3-235b-a22b
  date: 2025-05-08
  versions: 0.82.4.dev
  seconds_per_case: 94.3
  total_cost: 0.6636

An explanation for the low or inconsistent performance might be that you've made a mistake in your system prompt.

You have:

system_prompt_prefix: "/nothink"

But the correct way to disable thinking on qwen3 models is to have "/no_think" instead with an underscore.

@gcp
Copy link
Copy Markdown
Contributor

gcp commented May 8, 2025

But the correct way to disable thinking on qwen3 models is to have "/no_think" instead with an underscore.

The Qwen 3 documentation has both formats. If only one works, they should fix the docs. (From testing, I think both work).

Welp, that's exactly what they did: QwenLM/Qwen3@2af51e0

I think I've also seen the variants with a \, but given the above I'll assume those are the canonical recommended ones.

@tarruda
Copy link
Copy Markdown

tarruda commented May 8, 2025

I wonder if openrouter deploys the model bfloat16 similarly to how it was tested by the PR author.

@AlongWY are you able to reproduce your results in openrouter or some other LLM cloud provider?

@tarruda
Copy link
Copy Markdown

tarruda commented May 8, 2025

According to Openrouter FAQ, it is just a proxy for other LLM providers. @paul-gauthier do you know which provider was used for the verification?

Looking at the Qwen3 235B A22B provider list, all of them but one are listed as using FP8 which is different from the settings used by @AlongWY:

image

Fireworks doesn't list the precision, but due to the higher TPS I imagine it should be 4-bit.

@AlongWY my suggestion is to try reproducing the results using a public API provider and opening a new PR that can be verified independently.

Update: seems like Alibaba cloud has an official API for Qwen, so maybe that should be used for reproducing the results @AlongWY : https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api

@paul-gauthier
Copy link
Copy Markdown
Collaborator

I add the results from this PR to a new blog entry:

https://aider.chat/2025/05/08/qwen3.html

Thanks for doing the benchmarking and sharing your results!

@paul-gauthier
Copy link
Copy Markdown
Collaborator

2025-05-08_11-42-40

@AlongWY
Copy link
Copy Markdown
Author

AlongWY commented May 9, 2025

Hi @paul-gauthier @tarruda , thank you for your feedback! After trying OpenRouter myself, I indeed found that it fails to activate the no-thinking mode successfully (though this is related to third-party providers, who deploy the thinking mode by default). Following @tarruda 's suggestion, I tested Qwen's official API (https://dashscope.aliyuncs.com/compatible-mode/v1 ) with the following configuration:

export OPENAI_API_BASE="https://dashscope.aliyuncs.com/compatible-mode/v1"
export OPENAI_API_KEY="YOUR API KEY"
- name: openai/qwen3-235b-a22b
  use_temperature: 0.7
  streaming: false
  extra_params:
    stream: false
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    temperature: 0.7
    enable_thinking: false
    extra_body:
      enable_thinking: false

Here are the detailed test results:

whole format

- dirname: 2025-05-09-12-12-40--qwen3-235b-a22b.unthink_16k_whole
  test_cases: 225
  model: openai/qwen3-235b-a22b
  edit_format: whole
  commit_hash: 8159cbf-dirty
  pass_rate_1: 26.2
  pass_rate_2: 62.7
  pass_num_1: 59
  pass_num_2: 141
  percent_cases_well_formed: 100.0
  error_outputs: 0
  num_malformed_responses: 0
  num_with_malformed_responses: 0
  user_asks: 153
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 1
  total_tests: 225
  command: aider --model openai/qwen3-235b-a22b
  date: 2025-05-09
  versions: 0.82.4.dev
  seconds_per_case: 49.7
  total_cost: 0.0000

diff format

- dirname: 2025-05-09-10-44-53--qwen3-235b-a22b.unthink_16k_diff
  test_cases: 225
  model: openai/qwen3-235b-a22b
  edit_format: diff
  commit_hash: 8159cbf-dirty
  pass_rate_1: 28.0
  pass_rate_2: 59.1
  pass_num_1: 63
  pass_num_2: 133
  percent_cases_well_formed: 92.9
  error_outputs: 19
  num_malformed_responses: 19
  num_with_malformed_responses: 16
  user_asks: 118
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 3
  total_tests: 225
  command: aider --model openai/qwen3-235b-a22b
  date: 2025-05-09
  versions: 0.82.4.dev
  seconds_per_case: 45.1
  total_cost: 0.0000

The full log can be found here:

The final results show some fluctuations, but the overall performance is broadly consistent with my previous results.

And someone get close results using fireworks.

@paul-gauthier
Copy link
Copy Markdown
Collaborator

I was able to benchmark Qwen3 235B A22B via the official Alibaba API. It scored 60% using diff and 62% using the whole edit format. The leaderboard and Qwen3 article have both been updated.

2025-05-10_09-13-07

@tarruda
Copy link
Copy Markdown

tarruda commented May 11, 2025

I didn't see any aider results of Qwen3-30B-A3B, so I executed the benchmark against a local llama-server + Q8_0 quant. Since it was not clear to me how to disable thinking with aider, I used an http proxy that injected the /nothink system prompt to all requests. Here are the results:

- dirname: 2025-05-10-21-42-57--qwen3-30b-polyglot
  test_cases: 225
  model: openai/qwen3-30b
  edit_format: whole
  commit_hash: 3daf7d4-dirty
  pass_rate_1: 12.4
  pass_rate_2: 28.4
  pass_num_1: 28
  pass_num_2: 64
  percent_cases_well_formed: 99.6
  error_outputs: 8
  num_malformed_responses: 1
  num_with_malformed_responses: 1
  user_asks: 173
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 1
  prompt_tokens: 2250144
  completion_tokens: 359857
  test_timeouts: 9
  total_tests: 225
  command: aider --model openai/qwen3-30b
  date: 2025-05-10
  versions: 0.83.1.dev
  seconds_per_case: 127.1
  total_cost: 0.0000

I don´t understand the exausted_context_windows: 1 stat as I was running it with the full 128K enabled. This is the command I ran:

time ./benchmark/docker.sh benchmark/benchmark.py qwen3-30b-polyglot --model openai/qwen3-30b --edit-format whole --threads 1 --exercises-dir polyglot-benchmark --read-model-settings model_settings.yaml

time it took to run:

real    514m30.995s
user    0m29.652s
sys     1m8.698s

and model_settings.yaml:

- name: openai/Qwen3
  edit_format: whole
  use_repo_map: true
  use_temperature: 0.7
  extra_params:
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7

To use the time command and run the script directly from outside docker, I had to make some small tweaks to the docker.sh script: #3999

Later I might give it a shot at benchmarking IQ4_XS quant of Qwen3-235B-A22B on mac studio to see how it compares with the Alibaba API

@new01
Copy link
Copy Markdown

new01 commented May 11, 2025

I didn't see any aider results of Qwen3-30B-A3B, so I executed the benchmark against a local llama-server + Q8_0 quant. Since it was not clear to me how to disable thinking with aider, I used an http proxy that injected the /nothink system prompt to all requests. Here are the results:

- dirname: 2025-05-10-21-42-57--qwen3-30b-polyglot
  test_cases: 225
  model: openai/qwen3-30b
  edit_format: whole
  commit_hash: 3daf7d4-dirty
  pass_rate_1: 12.4
  pass_rate_2: 28.4
  pass_num_1: 28
  pass_num_2: 64
  percent_cases_well_formed: 99.6
  error_outputs: 8
  num_malformed_responses: 1
  num_with_malformed_responses: 1
  user_asks: 173
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 1
  prompt_tokens: 2250144
  completion_tokens: 359857
  test_timeouts: 9
  total_tests: 225
  command: aider --model openai/qwen3-30b
  date: 2025-05-10
  versions: 0.83.1.dev
  seconds_per_case: 127.1
  total_cost: 0.0000

I don´t understand the exausted_context_windows: 1 stat as I was running it with the full 128K enabled. This is the command I ran:

time ./benchmark/docker.sh benchmark/benchmark.py qwen3-30b-polyglot --model openai/qwen3-30b --edit-format whole --threads 1 --exercises-dir polyglot-benchmark --read-model-settings model_settings.yaml

time it took to run:

real    514m30.995s
user    0m29.652s
sys     1m8.698s

and model_settings.yaml:

- name: openai/Qwen3
  edit_format: whole
  use_repo_map: true
  use_temperature: 0.7
  extra_params:
    top_p: 0.8
    top_k: 20
    min_p: 0.0
    temperature: 0.7

To use the time command and run the script directly from outside docker, I had to make some small tweaks to the docker.sh script: #3999

Later I might give it a shot at benchmarking IQ4_XS quant of Qwen3-235B-A22B on mac studio to see how it compares with the Alibaba API

Thanks for running some tests for the smaller MoE model.

Could you try running it with /no_think instead of /nothink as not including the underscore can induce some negative variation in the test results due to it not being the spelling used in documentation.

@tarruda
Copy link
Copy Markdown

tarruda commented May 11, 2025

Could you try running it with /no_think instead of /nothink as not including the underscore can induce some negative variation in the test results due to it not being the spelling used in documentation.

In my local testing /nothink seems to work, but I see now that the official documentation recomments /no_think so will switch to it and re-run the benchmarks later

@gcp
Copy link
Copy Markdown
Contributor

gcp commented May 11, 2025

Just FYI people on the discord server have since long run these configurations. 30B-A3B with whole and thinking enabled will score 44-45%. Unlike all other models including the 32B, it benefits from thinking, and these results hold even down to Q4.

The setting system_prompt_prefix from aider allows easily injecting /no_think with llama.cpp.

With diff the score drops to 31-32% so for that you're better off with the 32B (38-40%).

@hensybex
Copy link
Copy Markdown

hensybex commented May 13, 2025

Following the discussion in this comment, I am attempting to configure and test the Qwen3-235B-A22B model via its official API within Aider. However, I'm encountering difficulties with the setup.

My current configuration is as follows:

In ~/.aider.conf.yml:

model-settings-file: ~/.aider.model.settings.yml

In ~/.aider.model.settings.yml:

- name: openai/qwen3-235b-a22b
  use_temperature: 0.7
  streaming: false
  extra_params:
    stream: false
    max_tokens: 16384
    top_p: 0.8
    top_k: 20
    temperature: 0.7
    enable_thinking: false
  extra_body:
    enable_thinking: false

I have also ensured that the OPENAI_API_BASE and OPENAI_API_KEY environment variables are correctly exported.Despite these settings, attempts to utilize the model within Aider result in the following error:

litellm.APIError: APIError: OpenAIException - Connection error.

Could anyone provide suggestions or point out potential misconfigurations that might be leading to this connection issue? Any insights would be greatly

@gcp
Copy link
Copy Markdown
Contributor

gcp commented May 13, 2025

Connection error suggests OPENAI_API_BASE or OPENAI_API_KEY are wrong, or the provider is down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.