Skip to content

SalesforceAIResearch/personabench

👩‍🎓 👨‍⚕️PersonaBench 🏄‍♀️

Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data

📃 Paper | 📚 Data | 🌐 Website

Note

  • This dataset was generated using GPT-4o and should not be used to develop models that compete with OpenAI.
  • This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

License

This project is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

Project Introduction

💡 Overview

PersonaBench addresses the challenge of evaluating AI models’ ability to personalize interactions using private user data, such as conversation history and app usage, by introducing a synthetic data generation pipeline. This pipeline creates realistic user profiles and private documents simulating human activities, enabling safe and effective model evaluation.

The benchmark focuses on assessing Retrieval-Augmented Generation (RAG) pipelines through questions tied to user-specific information supported by synthetic documents. Initial results show that current RAG models struggle with extracting and interpreting personal information, highlighting the need for improved methodologies.

Features:

  • Synthetic User Documents: A synthetic data generation pipeline that creates private user documents, including user-user conversations, user-AI interactions, and purchase histories. These documents are grounded in confidential user profiles and designed to simulate realistic human activities.
  • Questions on Personal Information: For each user, generate questions with ground truth answers about their personal information. These answers can be accurately extracted from specific parts of the user documents.
  • Evaluation Framework: Evaluate retrieval models for their ability to extract relevant document chunks and evaluate RAG systems for end-to-end Q&A performance.
  • Result Analysis: Present key results along with detailed fine-grained analyses.

Explore PersonaBench and contribute to advancing AI personalization research!

📚 Data Example (v1.0)

Two communities of data from the PersonaBench dataset have been released in the current version, each giving personal documents and Q&As for three individuals. More data will be released after necessary review. Please don't hesitate to report potential issues or share your suggestions for improving the dataset.

Communities

Each node represents an individual with complex user profile attributes, private documents, and questions/answers about personal information.

Figure 1 Figure 2

Document Examples

An example conversation session between two socially connected individuals. The session reveals one person’s prefered music artist is Hans Zimmer.

Quick Start

🛠️ Installation

Create and activate a new conda environment:

conda create -n personabench python=3.11
conda activate personabench

Install the required packages:

pip install -r requirements.txt
pip install .

Set up OpenAI API key:

export OPENAI_API_KEY=<your_openai_api_key>

📝 Evaluation

To test the RAG pipeline on the provided two-community evaluation datasets, simply run the following bash script:

bash scripts/evaluation/run_test.sh

All results will be saved to the logs folder.

Specifically, for testing the RAG framework for answering personal questions using user documents as support context, use the following command:

python scripts/evaluation/retrieval_and_generation.py --seed 2024 \
    --log_dir logs \
    --data_dir eval_data/eval_data_v1 \
    --test_community_ids community_0,community_1 \
    --num_chunks 5 \
    --base_models gpt-4o-mini,gpt-4-0613,gpt-3.5-turbo,gpt-4o \
    --retrievers all-mpnet-base-v2,gt-context,all-MiniLM-L6-v2,BAAI/bge-m3 \
    --test_noises 0.0,0.3,0.5,0.7 \
    --verbose

The generation results will be automatically saved to the specified log_dir.

Then, run the following command to calculate statistics and generate figures:

python scripts/evaluation/eval.py --log_dir logs \
    --data_dir eval_data/eval_data_v1

📊 Results

Retrieval evaluation results:

End-to-end evaluation results:
More detailed evaluation:

Citation

@misc{tan2025personabenchevaluatingaimodels,
      title={PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data}, 
      author={Juntao Tan and Liangwei Yang and Zuxin Liu and Zhiwei Liu and Rithesh Murthy and Tulika Manoj Awalgaonkar and Jianguo Zhang and Weiran Yao and Ming Zhu and Shirley Kokane and Silvio Savarese and Huan Wang and Caiming Xiong and Shelby Heinecke},
      year={2025},
      eprint={2502.20616},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.20616}, 
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors