SAGE

Sentient Agent as a Judge

Overview

We propose Sentient Agent as a Judge, the first fully-automated evaluation framework that simulates evolving human emotion and inner cognition to benchmark higher - order social reasoning in LLMs.

_{An illustration of our proposed SAGE, a novel framework to automatically assess higher-order social cognition in target LLMs}

Main Result

Sentient leaderboard

Here we presents the Sentient leaderboard using DeepSeek-V3 as the judge, alongside Arena rankings for comparison. We mainly focus on the top-10 models from the Arena leaderboard for which APIs are available (e.g., Grok-3 was excluded due to lack of access). Additionally, we include eight representative LLMs from four major families and two smaller-scale instruction-tuned models.

Model		Sentient		Supportive Dialogue		Arena
Name	Date	Rank	Score	Success	Failure	Rank	Score
Gemini2.5-Pro-Preview	2025-06-05	1	82.4	55	4	1	1476
GPT-4o	2025-03-26	2	79.9	51	4	3	1417
GPT-4.1	2025-04-14	3	68.2	35	13	9	1377
Gemini2.5-Flash-Think	2025-05-20	4	66.1	39	14	3	1420
o3	2025-04-16	5	62.7	32	14	3	1420
GPT-4.5-Preview	2025-02-27	6	62.7	23	15	6	1406
Claude4.0-Think	2025-05-14	7	61.8	22	17	22	1339
Claude3.7-Think	2025-02-24	8	61.3	23	19	33	1307
Doubao-1.5-Pro-Think	2025-04-28	9	61.2	29	20	-	-
Hunyuan-TurboS	2025-04-16	10	58.9	23	21	9	1366
Claude3.7	2025-02-24	11	54.8	19	24	37	1300
DeepSeek-V3-0324	2025-03-24	12	54.4	19	23	9	1377
Claude4.0	2025-05-23	13	53.8	16	64	-	-
DeepSeek-R1	2025-01-21	14	53.7	31	28	11	1366
Qwen3-235B-A22B	2025-04-29	15	53.7	20	22	13	1356
QwQ-32B	2025-03-06	16	44.3	18	40	25	1324
o4-mini	2025-04-16	17	35.9	10	48	11	1357
Llama3.3-70B	2024-12-06	18	33.3	7	47	65	1265
o1	2024-12-17	19	29.0	5	51	22	1343
Qwen2.5-72B	2024-09-19	20	19.1	4	70	64	1265

_{Sentient leaderboard using DeepSeek-V3 as the sentient agent. Arena scores are included for comparison.}

Results of different sentient agents

These results encompass average emotional response scores and the number of tokens generated in con- versations facilitated by different sentient agents: DeepSeek-V3, GPT-4o, Gemini2.5, and Gemini2.5- Think.

_DeepSeek-V3 _GPT-4o

_Gemini2.5 _{Gemini2.5-think}

Social Cognition Coordinate

we conceptualize a two- dimensional “Social Cognition Coordinate”. The Y-axis represents the interaction focus, ranging from empathy-oriented (top) to solution-oriented (bottom). The X-axis captures the interaction style, from structured (left) to creative (right). We plot the models within this coordinate space based on qualitative analysis of their dialogue patterns.

BLRI and Utterance Quality Test

we validate the reasonableness of SAGE by examining the correlation between user emotions – the primary output metric of our framework – and internal user thoughts and dialogue utterances.

_{Correlation between emotion and internaluser thought} _{Correlation between emotion and dialogue utterance}

Case Study

Example dialogues of representative LLMs with the simulated user. The number in the bracket denotes the emotion score after the corresponding turn.

Getting Started

Step 1: Set Up Your LLM API

Configure your LLM API in the following files:

sentient-agent-as-a-judge/simulator_response.py: Modify the call_llm() function.
sentient-agent-as-a-judge/npc_response.py: Modify the call_npc() function.

Step 2: Run the Code

Test the LLM set in npc_response with the LLM-based simulator set in simulator_response. Use our preset simulator profiles located at sentient-agent-as-a-judge/profile/simulator_profile_withfirsttalk.jsonl.

To execute the test, run the following command:

sentient-agent-as-a-judge/test_npc.py

Prepare Simulator Profiles

If you wish to generate your own simulator profiles, follow these steps:

Step 1: Build Seed Sets

Create seed talking sets and seed topic sets for generating profiles. These should include typical conversations and topics relevant to a scene.

Example Talking: 今天去公园了，真开心！

Example Topic: 在学校成绩总是不好怎么办

Step 2: Set Up Your LLM API

Configure your LLM API in the following file:

sentient-agent-as-a-judge/profile/build_profile.py: Modify the call_llm() function.

Step 3: Build Profiles Without First Talk

Run the following command to build profiles without the first talk:

sentient-agent-as-a-judge/profile/build_profile.py

Step 4: Build Profiles With First Talk

Set the profile path and store file path in sentient-agent-as-a-judge/build_profile_withfirsttalk.py. Then, run the following command:

sentient-agent-as-a-judge/build_profile_withfirsttalk.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Sentient Agent as a Judge

Overview

Main Result

Sentient leaderboard

Results of different sentient agents

Social Cognition Coordinate

BLRI and Utterance Quality Test

Case Study

Getting Started

Step 1: Set Up Your LLM API

Step 2: Run the Code

Prepare Simulator Profiles

Step 1: Build Seed Sets

Step 2: Set Up Your LLM API

Step 3: Build Profiles Without First Talk

Step 4: Build Profiles With First Talk

Name		Name	Last commit message	Last commit date
parent directory ..
dataset		dataset
figures		figures
profile		profile
README.md		README.md
build_profile_withfirsttalk.py		build_profile_withfirsttalk.py
npc_response.py		npc_response.py
simulator_response.py		simulator_response.py
test_npc.py		test_npc.py

FilesExpand file tree

SAGE

Directory actions

More options

Directory actions

More options

Latest commit

History

SAGE

Folders and files

parent directory

README.md

Sentient Agent as a Judge

Overview

Main Result

Sentient leaderboard

Results of different sentient agents

Social Cognition Coordinate

BLRI and Utterance Quality Test

Case Study

Getting Started

Step 1: Set Up Your LLM API

Step 2: Run the Code

Prepare Simulator Profiles

Step 1: Build Seed Sets

Step 2: Set Up Your LLM API

Step 3: Build Profiles Without First Talk

Step 4: Build Profiles With First Talk