gemini 2.5 速测 显著弱于o3mini high grok 3r 强于R1

3人帽子题 :white_check_mark: 及格分有了

组合极值 :white_check_mark:
简单图论 :white_check_mark:
线性汉诺塔 :cross_mark:

5人帽子题 :cross_mark:
Five wise men sit on a bench. They face the same direction, each wearing a hat. Each wise man can only see the hat of the person in front of him, but not his own hat or the hat of the person behind him. They know there are 7 hats in total, 3 black, 1 white, and 3 red. Randomly select 5 hats for the five wise men to wear. First ask the fifth person (he can see the four people in front): Can you determine the color of your hat? "He said yes, then ask the fourth person, the third person, the second person, and the first person in turn. What will they say? (You can only say yes or no) Among the first 4 people, are there some people who can determine the color of their hats no matter what? Please guess the color of their hats

凯文猜数 :cross_mark:
Kevin plays a computer game. There are N consecutive natural numbers from 1 to N. min=1
The program randomly selects 2 numbers at the beginning. Kevin guesses 1 number each round.
The program tells
1 Guess correctly
2 Greater than all the selected numbers
3 Smaller than all the selected numbers
4 Between two numbers
Kevin hopes that the strategy can ensure that 2 numbers are determined in no more than M rounds in all cases
1 Find the optimal strategy
And M under the strategy?

There are N consecutive natural numbers from 1 to N

When M=2, what is the maximum value of n allowed

结论:显著弱于o3mini high grok 3R,强于R1.

13 个赞

这些题o3minihigh和grok 3t 分别是什么表现?

2 个赞

泛化推理 o3 mini high(100%) ≈ grok 3R(96%) 同对同错频繁出现.

列出的这些题目o3 mini high 都是能稳定正确的,grok 3R对于帽子题不稳定

gemini 2.5 对于超出o3mini high能力的题,也完全没有显现出解答的可能性,

6 个赞

这么拉吗 这么晚出 o3high都打不过,更不用说o1pro了

3 个赞

这个难说

1 个赞

不至于,感觉和o3 mini high各有千秋

代码能力赶上Claude3.7了吗

不知道诶,从来没用gemini写过代码 :tieba_087:

就o1-pro那算力和费用,目前除开o3-deep research也没有谁能比得了了吧?这种大力出奇迹的模型,感觉其他家不太会搞

真的和o1-pro平级的话也不会让你免费使用吧,Gemini是多模态学习辅助类路线的,代码不用太期待

5 个赞

不是跑分很高吗

纯逻辑谜题不太了解,所以测这个的不多 :tieba_087:

但是grok3thinking做编码和数学是相对差的
这两类问题还是相对“日用”
相比于在某个领域上蒸馏的模型,世界知识更广的模型要更具备实用价值

在我的经验中,grok3thinking 对代码的分析能力很强!完全不弱于o3mini high 但是在 涉及 现实概念的软件工程方面较差,[用户界面/交互 经常出BUG,claude这方面最好],另外修改已有代码再输出也容易出问题,据他们团队说存在某种配置问题?

2 个赞

昨晚测试数学题不是很强:thinking:

什么题目能看看吗

这组题目前三题昨天晚上全对 :tieba_016:

特别是第一题。第一个能做对的模型

没有,谷歌家不适合写代码

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。