SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Introduction

Welcome to the official repository for SocialMaze, a novel benchmark designed to comprehensively evaluate the social reasoning capabilities of Large Language Models (LLMs). SocialMaze systematically incorporates three core challenges inherent in real-world social interactions: deep reasoning, dynamic interaction, and information uncertainty. It features six diverse tasks across three key settings—social reasoning games, daily-life interactions, and digital community platforms.

While this repository contains the full code for data generation and model evaluation across all tasks, we strongly recommend using our pre-packaged dataset on Hugging Face for a streamlined model evaluation experience: SocialMaze on Hugging Face Datasets 👈

This allows you to directly test LLMs on curated data, particularly for our most challenging task, Hidden Role Deduction, which is provided in a convenient QA format.

Installation

To set up the environment for running the SocialMaze benchmark scripts, please follow these steps:

# 1. Create a new conda environment (Python 3.9 is recommended)
conda create -n socialmaze python=3.9

# 2. Activate the environment
conda activate socialmaze

# 3. Install the required dependencies
pip install -r requirements.txt

Configuration

To use models that require API keys (e.g., OpenAI, DeepInfra) or access specific data sources (e.g., OpenReview), you need to configure your credentials.

Create a file named .env in the utils/ directory of this project.
Add the necessary API keys and credentials to this file. You only need to add keys for the models or data sources you intend to use.

Here is an example structure for your .env file:

# For OpenAI models
OPENAI_API_KEY=your_openai_api_key

# For models hosted on DeepInfra
DEEPINFRA_BASE_URL=https://api.deepinfra.com/v1/openai
DEEPINFRA_API_KEY=your_deepinfra_api_key

# For the "Review Decision Prediction" task (accessing ICLR data from OpenReview)
# These are ONLY required if you intend to generate or evaluate data for this specific task.
OPENREVIEW_USERNAME=your_openreview_username
OPENREVIEW_PASSWORD=your_openreview_password

Usage

The SocialMaze benchmark includes scripts for both data generation (_gen.py) and model evaluation (_eva.py) for each task.

Below is a list of tasks and their corresponding scripts:

Hidden Role Deduction
- Generate data: python hidden_role_deduction/hrd_gen.py
- Evaluate models: python hidden_role_deduction/hrd_eva.py
Find the Spy
- Generate data: python find_the_spy/fts_gen.py
- Evaluate models: python find_the_spy/fts_eva.py
Rating Estimation from Text
- Generate LLM-based data: python rating_estimation_from_text/reft_gen_llm.py
- Evaluate models: python rating_estimation_from_text/reft_eva.py
Review Decision Prediction
- Fetch ICLR review data from OpenReview: python review_decision_prediction/rdp_gen_iclr.py
  - (Note: Requires OPENREVIEW_USERNAME and OPENREVIEW_PASSWORD in .env)
- Evaluate models: python review_decision_prediction/rdp_eva.py
Social Graph Analysis
- Generate data: python social_graph_analysis/sga_gen.py
- Evaluate models: python social_graph_analysis/sga_eva.py
User Profile Inference
- Generate data: python user_profile_inference/upi_gen.py
- Evaluate models: python user_profile_inference/upi_eva.py

Customizing Script Execution

For Data Generation Scripts (*_gen.py): You can customize various parameters, such as:

dataset_types: Specify the types or variants of data to generate for a task.
num_scenarios: Define the number of data instances to create.
--models: (If applicable, e.g., for reft_gen_llm.py) Specify which LLMs to use for generating data.

For Evaluation Scripts (*_eva.py): You can customize parameters like:

--models: Select the LLMs you want to evaluate.
num_scenarios: Specify the number of data instances from the dataset to use for evaluation.
dataset_types: Choose specific subsets or types of data to evaluate on.

For detailed options and specific arguments for each script, please refer to the respective Python files and their argument parsers.

Citation

If you use the SocialMaze benchmark or its datasets in your research, we kindly ask you to cite our work:

@article{xu2025socialmaze,
  title={SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models},
  author={Xu, Zixiang and Wang, Yanbo and Huang, Yue and Ye, Jiayi and Zhuang, Haomin and Song, Zirui and Gao, Lang and Wang, Chenxi and Chen, Zhaorun and Zhou, Yujun and Li, Sixian and Pan, Wang and Zhao, Yue and Zhao, Jieyu and Zhang, Xiangliang and Chen, Xiuying},
  year={2025},
  note={Under review}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Introduction

Installation

Configuration

Usage

Customizing Script Execution

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
find_the_spy		find_the_spy
hidden_role_deduction		hidden_role_deduction
rating_estimation_from_text		rating_estimation_from_text
review_decision_prediction		review_decision_prediction
social_graph_analysis		social_graph_analysis
user_profile_inference		user_profile_inference
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

xzx34/SocialMaze

Folders and files

Latest commit

History

Repository files navigation

SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Introduction

Installation

Configuration

Usage

Customizing Script Execution

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages