我们先进的大模型已经遥遥领先于Claude Sonnet-4和其他国产开源.jpg
可于https://chat.qwen.ai 体验
我们先进的大模型已经遥遥领先于Claude Sonnet-4和其他国产开源.jpg
可于https://chat.qwen.ai 体验
遥遥领先!!
最近吹起cli的风
qwen cli在哪可以使用来着
冲冲 测起来
确实遥遥领先
不知道 hyperbolic 费用怎么算的,就不放了
原本是想测不同来源的kimi k2差异的,结果… ![]()
保持原本专案设定,调整 prompt 分数应该会更高
Aider Polyglot: 60.4
- dirname: 2025-07-22-19-36-56--qwen-qwen3-coder-hyperbolic-02
test_cases: 225
model: openai/Qwen/Qwen3-Coder-480B-A35B-Instruct
edit_format: diff
commit_hash: f38200c-dirty
pass_rate_1: 32.4
pass_rate_2: 60.4
pass_num_1: 73
pass_num_2: 136
percent_cases_well_formed: 95.1
error_outputs: 14
num_malformed_responses: 14
num_with_malformed_responses: 11
user_asks: 97
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 3002019
completion_tokens: 429041
test_timeouts: 4
total_tests: 225
command: aider --model openai/Qwen/Qwen3-Coder-480B-A35B-Instruct
date: 2025-07-22
versions: 0.85.3.dev
seconds_per_case: 39.4
Aider Polyglot: 55.6
- dirname: 2025-07-22-18-22-34--kimi-k2-together-02
test_cases: 225
model: openrouter/moonshotai/kimi-k2
edit_format: diff
commit_hash: f38200c-dirty
pass_rate_1: 20.9
pass_rate_2: 55.6
pass_num_1: 47
pass_num_2: 125
percent_cases_well_formed: 93.3
error_outputs: 17
num_malformed_responses: 15
num_with_malformed_responses: 15
user_asks: 72
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 2465203
completion_tokens: 367131
test_timeouts: 5
total_tests: 225
command: aider --model openrouter/moonshotai/kimi-k2
date: 2025-07-22
versions: 0.85.3.dev
seconds_per_case: 41.9
total_cost: 3.5666
costs: $0.0159/test-case, $3.57 total, $3.57 projected
Aider Polyglot: 51.1
─
- dirname: 2025-07-22-17-55-27--kimi-k2-deepinfra-02
test_cases: 225
model: openrouter/moonshotai/kimi-k2
edit_format: diff
commit_hash: f38200c-dirty
pass_rate_1: 20.4
pass_rate_2: 51.1
pass_num_1: 46
pass_num_2: 115
percent_cases_well_formed: 96.0
error_outputs: 9
num_malformed_responses: 9
num_with_malformed_responses: 9
user_asks: 50
lazy_comments: 0
syntax_errors: 0
indentation_errors: 0
exhausted_context_windows: 0
prompt_tokens: 2197170
completion_tokens: 366197
test_timeouts: 6
total_tests: 225
command: aider --model openrouter/moonshotai/kimi-k2
date: 2025-07-22
versions: 0.85.3.dev
seconds_per_case: 24.0
total_cost: 2.0141
costs: $0.0090/test-case, $2.01 total, $2.01 projected
这测评靠谱吗
跑分肯定没问题啊,发出来的跑分基本都是可以跑到的,不至于直接打自己脸
qwen的跑分,kimi酱显得很眉清目秀
等大佬测试
emm,找点渠道测一下..支持国产 ![]()
这些都比 DeepSeek 低太多了,
DeepSeek R1 (0528) 71.4%
o3-pro (high) 84.9%
gemini-2.5-pro-preview-06-05 (32k think) 83.1%
别和think模型比呀 时间不一样的啊
这个模型速度也挺慢的,你可以试试
R1有些供应商 吐字超快 就是贵了点
可于https://chat.qwen.ai 体验,什么意思指2api吗
我在Qwen Chat 感觉还行