Skip to content

Latest commit

 

History

History

README.md

Sentient Agent as a Judge

Overview

We propose Sentient Agent as a Judge, the first fully-automated evaluation framework that simulates evolving human emotion and inner cognition to benchmark higher - order social reasoning in LLMs.

framework

An illustration of our proposed SAGE, a novel framework to automatically assess higher-order social cognition in target LLMs

Main Result

Sentient leaderboard

Here we presents the Sentient leaderboard using DeepSeek-V3 as the judge, alongside Arena rankings for comparison. We mainly focus on the top-10 models from the Arena leaderboard for which APIs are available (e.g., Grok-3 was excluded due to lack of access). Additionally, we include eight representative LLMs from four major families and two smaller-scale instruction-tuned models.

ModelSentientSupportive Dialogue Arena
Name Date Rank Score Success Failure Rank Score
Gemini2.5-Pro-Preview 2025-06-05 1 82.4 55 4 1 1476
GPT-4o 2025-03-26 2 79.9 51 4 3 1417
GPT-4.1 2025-04-14 3 68.2 35 13 9 1377
Gemini2.5-Flash-Think 2025-05-20 4 66.1 39 14 3 1420
o3 2025-04-16 5 62.7 32 14 3 1420
GPT-4.5-Preview 2025-02-27 6 62.7 23 15 6 1406
Claude4.0-Think 2025-05-14 7 61.8 22 17 22 1339
Claude3.7-Think 2025-02-24 8 61.3 23 19 33 1307
Doubao-1.5-Pro-Think 2025-04-28 9 61.2 29 20 - -
Hunyuan-TurboS 2025-04-16 10 58.9 23 21 9 1366
Claude3.7 2025-02-24 11 54.8 19 24 37 1300
DeepSeek-V3-0324 2025-03-24 12 54.4 19 23 9 1377
Claude4.0 2025-05-23 13 53.8 16 64 - -
DeepSeek-R1 2025-01-21 14 53.7 31 28 11 1366
Qwen3-235B-A22B 2025-04-29 15 53.7 20 22 13 1356
QwQ-32B 2025-03-06 16 44.3 18 40 25 1324
o4-mini 2025-04-16 17 35.9 10 48 11 1357
Llama3.3-70B 2024-12-06 18 33.3 7 47 65 1265
o1 2024-12-17 19 29.0 5 51 22 1343
Qwen2.5-72B 2024-09-19 20 19.1 4 70 64 1265

Sentient leaderboard using DeepSeek-V3 as the sentient agent. Arena scores are included for comparison.

Results of different sentient agents

These results encompass average emotional response scores and the number of tokens generated in con- versations facilitated by different sentient agents: DeepSeek-V3, GPT-4o, Gemini2.5, and Gemini2.5- Think.

DeepSeek-V3                                                       GPT-4o

     Gemini2.5                                                  Gemini2.5-think

Social Cognition Coordinate

we conceptualize a two- dimensional “Social Cognition Coordinate”. The Y-axis represents the interaction focus, ranging from empathy-oriented (top) to solution-oriented (bottom). The X-axis captures the interaction style, from structured (left) to creative (right). We plot the models within this coordinate space based on qualitative analysis of their dialogue patterns.

BLRI and Utterance Quality Test

we validate the reasonableness of SAGE by examining the correlation between user emotions – the primary output metric of our framework – and internal user thoughts and dialogue utterances.

                

     Correlation between emotion and internaluser thought                   Correlation between emotion and dialogue utterance    

Case Study

Example dialogues of representative LLMs with the simulated user. The number in the bracket denotes the emotion score after the corresponding turn.

Getting Started

Step 1: Set Up Your LLM API

Configure your LLM API in the following files:

  • sentient-agent-as-a-judge/simulator_response.py: Modify the call_llm() function.
  • sentient-agent-as-a-judge/npc_response.py: Modify the call_npc() function.

Step 2: Run the Code

Test the LLM set in npc_response with the LLM-based simulator set in simulator_response. Use our preset simulator profiles located at sentient-agent-as-a-judge/profile/simulator_profile_withfirsttalk.jsonl.

To execute the test, run the following command:

sentient-agent-as-a-judge/test_npc.py

Prepare Simulator Profiles

If you wish to generate your own simulator profiles, follow these steps:

Step 1: Build Seed Sets

Create seed talking sets and seed topic sets for generating profiles. These should include typical conversations and topics relevant to a scene.

Example Talking: 今天去公园了,真开心!

Example Topic: 在学校成绩总是不好怎么办

Step 2: Set Up Your LLM API

Configure your LLM API in the following file:

  • sentient-agent-as-a-judge/profile/build_profile.py: Modify the call_llm() function.

Step 3: Build Profiles Without First Talk

Run the following command to build profiles without the first talk:

sentient-agent-as-a-judge/profile/build_profile.py

Step 4: Build Profiles With First Talk

Set the profile path and store file path in sentient-agent-as-a-judge/build_profile_withfirsttalk.py. Then, run the following command:

sentient-agent-as-a-judge/build_profile_withfirsttalk.py