We propose Sentient Agent as a Judge, the first fully-automated evaluation framework that simulates evolving human emotion and inner cognition to benchmark higher - order social reasoning in LLMs.
An illustration of our proposed SAGE, a novel framework to automatically assess higher-order social cognition in target LLMs
Here we presents the Sentient leaderboard using DeepSeek-V3 as the judge, alongside Arena rankings for comparison. We mainly focus on the top-10 models from the Arena leaderboard for which APIs are available (e.g., Grok-3 was excluded due to lack of access). Additionally, we include eight representative LLMs from four major families and two smaller-scale instruction-tuned models.
| Model | Sentient | Supportive Dialogue | Arena | ||||
| Name | Date | Rank | Score | Success | Failure | Rank | Score |
| Gemini2.5-Pro-Preview | 2025-06-05 | 1 | 82.4 | 55 | 4 | 1 | 1476 |
| GPT-4o | 2025-03-26 | 2 | 79.9 | 51 | 4 | 3 | 1417 |
| GPT-4.1 | 2025-04-14 | 3 | 68.2 | 35 | 13 | 9 | 1377 |
| Gemini2.5-Flash-Think | 2025-05-20 | 4 | 66.1 | 39 | 14 | 3 | 1420 |
| o3 | 2025-04-16 | 5 | 62.7 | 32 | 14 | 3 | 1420 |
| GPT-4.5-Preview | 2025-02-27 | 6 | 62.7 | 23 | 15 | 6 | 1406 |
| Claude4.0-Think | 2025-05-14 | 7 | 61.8 | 22 | 17 | 22 | 1339 |
| Claude3.7-Think | 2025-02-24 | 8 | 61.3 | 23 | 19 | 33 | 1307 |
| Doubao-1.5-Pro-Think | 2025-04-28 | 9 | 61.2 | 29 | 20 | - | - |
| Hunyuan-TurboS | 2025-04-16 | 10 | 58.9 | 23 | 21 | 9 | 1366 |
| Claude3.7 | 2025-02-24 | 11 | 54.8 | 19 | 24 | 37 | 1300 |
| DeepSeek-V3-0324 | 2025-03-24 | 12 | 54.4 | 19 | 23 | 9 | 1377 |
| Claude4.0 | 2025-05-23 | 13 | 53.8 | 16 | 64 | - | - |
| DeepSeek-R1 | 2025-01-21 | 14 | 53.7 | 31 | 28 | 11 | 1366 |
| Qwen3-235B-A22B | 2025-04-29 | 15 | 53.7 | 20 | 22 | 13 | 1356 |
| QwQ-32B | 2025-03-06 | 16 | 44.3 | 18 | 40 | 25 | 1324 |
| o4-mini | 2025-04-16 | 17 | 35.9 | 10 | 48 | 11 | 1357 |
| Llama3.3-70B | 2024-12-06 | 18 | 33.3 | 7 | 47 | 65 | 1265 |
| o1 | 2024-12-17 | 19 | 29.0 | 5 | 51 | 22 | 1343 |
| Qwen2.5-72B | 2024-09-19 | 20 | 19.1 | 4 | 70 | 64 | 1265 |
Sentient leaderboard using DeepSeek-V3 as the sentient agent. Arena scores are included for comparison.
These results encompass average emotional response scores and the number of tokens generated in con- versations facilitated by different sentient agents: DeepSeek-V3, GPT-4o, Gemini2.5, and Gemini2.5- Think.
DeepSeek-V3 GPT-4o
Gemini2.5 Gemini2.5-think
we conceptualize a two- dimensional “Social Cognition Coordinate”. The Y-axis represents the interaction focus, ranging from empathy-oriented (top) to solution-oriented (bottom). The X-axis captures the interaction style, from structured (left) to creative (right). We plot the models within this coordinate space based on qualitative analysis of their dialogue patterns.
we validate the reasonableness of SAGE by examining the correlation between user emotions – the primary output metric of our framework – and internal user thoughts and dialogue utterances.
Correlation between emotion and internaluser thought Correlation between emotion and dialogue utterance
Example dialogues of representative LLMs with the simulated user. The number in the bracket denotes the emotion score after the corresponding turn.
Configure your LLM API in the following files:
sentient-agent-as-a-judge/simulator_response.py: Modify thecall_llm()function.sentient-agent-as-a-judge/npc_response.py: Modify thecall_npc()function.
Test the LLM set in npc_response with the LLM-based simulator set in simulator_response. Use our preset simulator profiles located at sentient-agent-as-a-judge/profile/simulator_profile_withfirsttalk.jsonl.
To execute the test, run the following command:
sentient-agent-as-a-judge/test_npc.pyIf you wish to generate your own simulator profiles, follow these steps:
Create seed talking sets and seed topic sets for generating profiles. These should include typical conversations and topics relevant to a scene.
Example Talking: 今天去公园了,真开心!
Example Topic: 在学校成绩总是不好怎么办
Configure your LLM API in the following file:
sentient-agent-as-a-judge/profile/build_profile.py: Modify thecall_llm()function.
Run the following command to build profiles without the first talk:
sentient-agent-as-a-judge/profile/build_profile.pySet the profile path and store file path in sentient-agent-as-a-judge/build_profile_withfirsttalk.py. Then, run the following command:
sentient-agent-as-a-judge/build_profile_withfirsttalk.py







