gemini 2.5 速测显著弱于o3mini high grok 3r 强于R1

pwtramp123 · 2025 年3 月 25 日 18:15

3人帽子题 及格分有了

组合极值
简单图论
线性汉诺塔

5人帽子题
Five wise men sit on a bench. They face the same direction, each wearing a hat. Each wise man can only see the hat of the person in front of him, but not his own hat or the hat of the person behind him. They know there are 7 hats in total, 3 black, 1 white, and 3 red. Randomly select 5 hats for the five wise men to wear. First ask the fifth person (he can see the four people in front): Can you determine the color of your hat? "He said yes, then ask the fourth person, the third person, the second person, and the first person in turn. What will they say? (You can only say yes or no) Among the first 4 people, are there some people who can determine the color of their hats no matter what? Please guess the color of their hats

凯文猜数
Kevin plays a computer game. There are N consecutive natural numbers from 1 to N. min=1
The program randomly selects 2 numbers at the beginning. Kevin guesses 1 number each round.
The program tells
1 Guess correctly
2 Greater than all the selected numbers
3 Smaller than all the selected numbers
4 Between two numbers
Kevin hopes that the strategy can ensure that 2 numbers are determined in no more than M rounds in all cases
1 Find the optimal strategy
And M under the strategy?

There are N consecutive natural numbers from 1 to N

When M=2, what is the maximum value of n allowed

结论：显著弱于o3mini high grok 3R,强于R1.

Biss · 2025 年3 月 25 日 18:51

这些题o3minihigh和grok 3t 分别是什么表现？

pwtramp123 · 2025 年3 月 25 日 19:00

泛化推理 o3 mini high（100%） ≈ grok 3R（96%）同对同错频繁出现.

列出的这些题目o3 mini high 都是能稳定正确的，grok 3R对于帽子题不稳定

gemini 2.5 对于超出o3mini high能力的题，也完全没有显现出解答的可能性，

voi · 2025 年3 月 25 日 22:53

这么拉吗这么晚出 o3high都打不过，更不用说o1pro了

lueluelue · 2025 年3 月 25 日 22:55

这个难说

lueluelue · 2025 年3 月 25 日 22:55

不至于，感觉和o3 mini high各有千秋

voi · 2025 年3 月 25 日 22:56

代码能力赶上Claude3.7了吗

lueluelue · 2025 年3 月 25 日 22:56

不知道诶，从来没用gemini写过代码

YU_TAKASAKI · 2025 年3 月 25 日 23:01

就o1-pro那算力和费用，目前除开o3-deep research也没有谁能比得了了吧？这种大力出奇迹的模型，感觉其他家不太会搞

真的和o1-pro平级的话也不会让你免费使用吧，Gemini是多模态学习辅助类路线的，代码不用太期待

handsome · 2025 年3 月 26 日 00:27

不是跑分很高吗

liulapatuoni · 2025 年3 月 26 日 00:32

纯逻辑谜题不太了解，所以测这个的不多

liulapatuoni · 2025 年3 月 26 日 00:41

但是grok3thinking做编码和数学是相对差的
这两类问题还是相对“日用”
相比于在某个领域上蒸馏的模型，世界知识更广的模型要更具备实用价值

pwtramp123 · 2025 年3 月 26 日 04:54

在我的经验中，grok3thinking 对代码的分析能力很强！完全不弱于o3mini high 但是在涉及现实概念的软件工程方面较差，[用户界面/交互经常出BUG，claude这方面最好],另外修改已有代码再输出也容易出问题,据他们团队说存在某种配置问题?

Hiccup_620 · 2025 年3 月 26 日 05:11

昨晚测试数学题不是很强

liulapatuoni · 2025 年3 月 26 日 05:33

什么题目能看看吗

Hiccup_620 · 2025 年3 月 26 日 06:31

liulapatuoni · 2025 年3 月 26 日 06:43

这组题目前三题昨天晚上全对

Hiccup_620 · 2025 年3 月 26 日 06:44

特别是第一题。第一个能做对的模型

zeduwfd · 2025 年3 月 26 日 06:45

没有，谷歌家不适合写代码

system · 2025 年4 月 25 日 06:45

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。

话题		回复	浏览量
为了对比ai的数学能力，我让ai做了高考题… 开发调优 ChatGPT , OpenAI , 人工智能 , 纯水	17	692	2025 年6 月 16 日
找了一道理科高考数学大题，o1模型能做出来。模型来挑战？开发调优 ChatGPT , OpenAI , 人工智能	59	1278	2025 年2 月 7 日
试了下chatgpt o4 数学能力蛮强的资源荟萃 ChatGPT , 人工智能	30	778	2025 年5 月 18 日
[2.0更新]终极LLM性能排行榜-综合了 28 个顶尖基准测试前沿快讯人工智能	60	2439	2025 年6 月 5 日
2025年4月20日丘成桐数学水平考试题目与参考答案，来测AI啦搞七捻三人工智能	98	1347	2025 年5 月 30 日

gemini 2.5 速测 显著弱于o3mini high grok 3r 强于R1

相关话题

gemini 2.5 速测显著弱于o3mini high grok 3r 强于R1