| Model | Organization | Global Average | Reasoning Average | aider | Mathematics Average | Data Analysis Average | Language Average | IF Average |
|---|---|---|---|---|---|---|---|---|
| o3 High | OpenAI | 81.19 | 93.33 | 79.6 | 85 | 67.02 | 76 | 86.17 |
| Gemini 2.5 Pro Preview (2025-05-06) | 79.66 | 88.25 | 76.9 | 88.63 | 68.85 | 71.81 | 83.5 | |
| Claude 4 Opus Thinking | Anthropic | 79.32 | 90.47 | 72 | 88.25 | 70.73 | 73.72 | 80.74 |
| o4-Mini High | OpenAI | 77.39 | 88.11 | 72 | 84.9 | 68.33 | 66.05 | 84.96 |
| DeepSeek R1 (2025-05-28) | DeepSeek | 77.38 | 91.08 | 71.6 | 85.26 | 71.54 | 64.82 | 79.95 |
| Claude 4 Sonnet Thinking | Anthropic | 77.04 | 95.25 | 61.3 | 85.25 | 69.84 | 70.19 | 80.43 |
| Gemini 2.5 Pro Preview (2025-03-25) | 76.99 | 87.53 | 72.9 | 89.16 | 62.47 | 69.31 | 80.59 | |
| Claude 3.7 Sonnet Thinking | Anthropic | 73.12 | 76.17 | 64.9 | 79 | 69.11 | 68.27 | 81.25 |
| Qwen 3 235B A22B | Alibaba | 72.27 | 78.61 | 59.6 | 78.78 | 68.31 | 60.61 | 87.73 |
| Claude 4 Opus | Anthropic | 71.16 | 56.44 | 70.7 | 78.79 | 66.51 | 76.11 | 78.38 |
| Gemini 2.5 Flash Preview (2025-05-20) | 70.70 | 78.53 | 55.1 | 84.1 | 69.85 | 57.04 | 79.56 | |
| DeepSeek R1 | DeepSeek | 69.48 | 77.17 | 56.9 | 77.91 | 69.63 | 54.77 | 80.51 |
| Grok 3 Mini Beta (High) | xAI | 69.38 | 87.61 | 49.3 | 77 | 64.58 | 59.09 | 78.7 |
| Gemini 2.5 Flash Preview | 67.73 | 73.47 | 47.1 | 81.8 | 65.53 | 59.43 | 79.02 | |
| Qwen 3 32B | Alibaba | 66.99 | 77.75 | 40.0 | 75.58 | 68.29 | 55.15 | 85.17 |
| Claude 4 Sonnet | Anthropic | 66.13 | 54.86 | 56.4 | 76.39 | 64.68 | 67.18 | 77.25 |
| QwQ 32B | Alibaba | 62.76 | 76.72 | 20.9 | 76.08 | 69.53 | 51.48 | 81.83 |
| Claude 3.7 Sonnet | Anthropic | 62.30 | 49.11 | 60.4 | 64.65 | 59.96 | 63.19 | 76.49 |
| GPT-4.5 Preview | OpenAI | 60.74 | 54.42 | 44.9 | 67.94 | 60.07 | 64.76 | 72.33 |
| DeepSeek V3.1 | DeepSeek | 60.52 | 44.28 | 55.1 | 71.44 | 64.02 | 46.82 | 81.47 |
| Grok 3 Beta | xAI | 59.79 | 48.53 | 53.3 | 62.75 | 55.63 | 53.8 | 84.74 |
| GPT-4.1 | OpenAI | 59.53 | 44.39 | 52.4 | 62.39 | 66.4 | 54.55 | 77.05 |
| ChatGPT-4o | OpenAI | 56.28 | 48.81 | 45.3 | 55.72 | 66.52 | 49.43 | 71.92 |
| Claude 3.5 Sonnet | Anthropic | 54.22 | 43.22 | 51.6 | 50.54 | 56.19 | 54.48 | 69.3 |
| Qwen2.5 Max | Alibaba | 52.53 | 38.53 | 21.8 | 56.87 | 64.27 | 58.37 | 75.35 |
| GPT-4.1 Mini | OpenAI | 52.44 | 53.78 | 32.4 | 58.78 | 61.34 | 38 | 70.31 |
| Llama 4 Maverick 17B 128E Instruct | Meta | 48.75 | 43.83 | 15.6 | 60.58 | 47.11 | 49.65 | 75.75 |
| GPT-4o | OpenAI | 45.43 | 39.75 | 18.2 | 41.48 | 63.53 | 44.68 | 64.94 |
| Gemma 3 27B | 41.10 | 34.42 | 4.9 | 52.27 | 38.8 | 41.31 | 74.9 | |
| Claude 3.5 Haiku | Anthropic | 40.79 | 26.19 | 28 | 34.84 | 54.12 | 39.71 | 61.88 |
| GPT-4.1 Nano | OpenAI | 37.53 | 35.58 | 8.9 | 42.39 | 49.82 | 30.96 | 57.54 |
| GPT-4o Mini | OpenAI | 34.85 | 25.64 | 3.6 | 38.05 | 55.1 | 29.88 | 56.8 |
20 个赞
Qwen 3 235B A22B 和 Qwen 3 32B竟然这么强?
我还觉得弱了呢(
qwen官方的livebench 1120测试成绩其实还高不少
不过考虑到大小确实是外星科技了
gemini2.5pro 0506已上,代码能力更强了
但是其他能力感觉略有降低,虽然livebench几乎是全面增强…
qwen3同步aider官方成绩,有所下降
这东西是思考模型吗
qwen3全系列是可切换模式的,通过系统提示词添加
/think和
/no_think切换
aider成绩
235b标注了非思考模式
32b没写…
claude4 整体aider倒退了…
看了眼感觉跟artificialanalysis.ai差不多,那个模型多很多。。观察看看
确实不错!就是排版有点费眼…
模型真的很全,而且也比较符合预期
此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。