[持续更新]livebench0425+aider综合榜单(DeepSeek R1 0528)

Model Organization Global Average Reasoning Average aider Mathematics Average Data Analysis Average Language Average IF Average
o3 High OpenAI 81.19 93.33 79.6 85 67.02 76 86.17
Gemini 2.5 Pro Preview (2025-05-06) Google 79.66 88.25 76.9 88.63 68.85 71.81 83.5
Claude 4 Opus Thinking Anthropic 79.32 90.47 72 88.25 70.73 73.72 80.74
o4-Mini High OpenAI 77.39 88.11 72 84.9 68.33 66.05 84.96
DeepSeek R1 (2025-05-28) DeepSeek 77.38 91.08 71.6 85.26 71.54 64.82 79.95
Claude 4 Sonnet Thinking Anthropic 77.04 95.25 61.3 85.25 69.84 70.19 80.43
Gemini 2.5 Pro Preview (2025-03-25) Google 76.99 87.53 72.9 89.16 62.47 69.31 80.59
Claude 3.7 Sonnet Thinking Anthropic 73.12 76.17 64.9 79 69.11 68.27 81.25
Qwen 3 235B A22B Alibaba 72.27 78.61 59.6 78.78 68.31 60.61 87.73
Claude 4 Opus Anthropic 71.16 56.44 70.7 78.79 66.51 76.11 78.38
Gemini 2.5 Flash Preview (2025-05-20) Google 70.70 78.53 55.1 84.1 69.85 57.04 79.56
DeepSeek R1 DeepSeek 69.48 77.17 56.9 77.91 69.63 54.77 80.51
Grok 3 Mini Beta (High) xAI 69.38 87.61 49.3 77 64.58 59.09 78.7
Gemini 2.5 Flash Preview Google 67.73 73.47 47.1 81.8 65.53 59.43 79.02
Qwen 3 32B Alibaba 66.99 77.75 40.0 75.58 68.29 55.15 85.17
Claude 4 Sonnet Anthropic 66.13 54.86 56.4 76.39 64.68 67.18 77.25
QwQ 32B Alibaba 62.76 76.72 20.9 76.08 69.53 51.48 81.83
Claude 3.7 Sonnet Anthropic 62.30 49.11 60.4 64.65 59.96 63.19 76.49
GPT-4.5 Preview OpenAI 60.74 54.42 44.9 67.94 60.07 64.76 72.33
DeepSeek V3.1 DeepSeek 60.52 44.28 55.1 71.44 64.02 46.82 81.47
Grok 3 Beta xAI 59.79 48.53 53.3 62.75 55.63 53.8 84.74
GPT-4.1 OpenAI 59.53 44.39 52.4 62.39 66.4 54.55 77.05
ChatGPT-4o OpenAI 56.28 48.81 45.3 55.72 66.52 49.43 71.92
Claude 3.5 Sonnet Anthropic 54.22 43.22 51.6 50.54 56.19 54.48 69.3
Qwen2.5 Max Alibaba 52.53 38.53 21.8 56.87 64.27 58.37 75.35
GPT-4.1 Mini OpenAI 52.44 53.78 32.4 58.78 61.34 38 70.31
Llama 4 Maverick 17B 128E Instruct Meta 48.75 43.83 15.6 60.58 47.11 49.65 75.75
GPT-4o OpenAI 45.43 39.75 18.2 41.48 63.53 44.68 64.94
Gemma 3 27B Google 41.10 34.42 4.9 52.27 38.8 41.31 74.9
Claude 3.5 Haiku Anthropic 40.79 26.19 28 34.84 54.12 39.71 61.88
GPT-4.1 Nano OpenAI 37.53 35.58 8.9 42.39 49.82 30.96 57.54
GPT-4o Mini OpenAI 34.85 25.64 3.6 38.05 55.1 29.88 56.8
20 个赞

Qwen 3 235B A22B 和 Qwen 3 32B竟然这么强?

我还觉得弱了呢(
qwen官方的livebench 1120测试成绩其实还高不少
不过考虑到大小确实是外星科技了

gemini2.5pro 0506已上,代码能力更强了
但是其他能力感觉略有降低,虽然livebench几乎是全面增强…

qwen3同步aider官方成绩,有所下降

这东西是思考模型吗

qwen3全系列是可切换模式的,通过系统提示词添加
/think和
/no_think切换
aider成绩
235b标注了非思考模式
32b没写…

claude4 整体aider倒退了…

看了眼感觉跟artificialanalysis.ai差不多,那个模型多很多。。观察看看

确实不错!就是排版有点费眼…
模型真的很全,而且也比较符合预期

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。