等了一天了没人跑,那还是自己来吧

| Model | Organization | Global Average | Reasoning Average | Coding Average | Mathematics Average | Data Analysis Average | Language Average | IF Average |
|---|---|---|---|---|---|---|---|---|
| claude-3-7-sonnet-thinking | Anthropic | 76.10 | 87.83 | 74.54 | 79.00 | 74.05 | 59.93 | 81.25 |
| o3-mini-2025-01-31-high | OpenAI | 75.88 | 89.58 | 82.74 | 77.29 | 70.64 | 50.68 | 84.36 |
| o1-2024-12-17-high | OpenAI | 75.67 | 91.58 | 69.69 | 80.32 | 65.47 | 65.39 | 81.55 |
| qwq-32b | Alibaba | 71.96 | 83.50 | 72.23 | 77.82 | 65.03 | 51.35 | 81.83 |
| deepseek-r1 | DeepSeek | 71.57 | 83.17 | 66.74 | 80.71 | 69.78 | 48.53 | 80.51 |
| deepseek-V3-0324 | DeepSeek | 70.2 | 75.3 | 73.5 | 73.7 | 60.1 | 50.1 | 88.5 |
| o3-mini-2025-01-31-medium | OpenAI | 70.01 | 86.33 | 65.38 | 72.37 | 66.56 | 46.26 | 83.16 |
| gpt-4.5-preview | OpenAI | 68.95 | 71.08 | 75.18 | 69.33 | 64.33 | 61.45 | 72.33 |
| gemini-2.0-flash-thinking-exp-01-21 | 66.92 | 78.17 | 53.49 | 75.85 | 69.37 | 42.18 | 82.47 | |
| claude-3-7-sonnet | Anthropic | 65.56 | 66.00 | 67.49 | 63.26 | 63.37 | 56.76 | 76.49 |
| gemini-2.0-pro-exp-02-05 | 65.13 | 60.08 | 63.49 | 70.97 | 68.02 | 44.85 | 83.38 | |
| gemini-exp-1206 | 64.09 | 57.00 | 63.41 | 72.36 | 63.16 | 51.29 | 77.34 | |
| o3-mini-2025-01-31-low | OpenAI | 62.45 | 69.83 | 61.46 | 63.06 | 62.04 | 38.25 | 80.06 |
| qwen2.5-max | Alibaba | 62.29 | 51.42 | 64.41 | 58.35 | 67.93 | 56.28 | 75.35 |
