- This dataset was generated using GPT-4o and should not be used to develop models that compete with OpenAI.
- This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.
This project is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.
PersonaBench addresses the challenge of evaluating AI models’ ability to personalize interactions using private user data, such as conversation history and app usage, by introducing a synthetic data generation pipeline. This pipeline creates realistic user profiles and private documents simulating human activities, enabling safe and effective model evaluation.
The benchmark focuses on assessing Retrieval-Augmented Generation (RAG) pipelines through questions tied to user-specific information supported by synthetic documents. Initial results show that current RAG models struggle with extracting and interpreting personal information, highlighting the need for improved methodologies.
- Synthetic User Documents: A synthetic data generation pipeline that creates private user documents, including user-user conversations, user-AI interactions, and purchase histories. These documents are grounded in confidential user profiles and designed to simulate realistic human activities.
- Questions on Personal Information: For each user, generate questions with ground truth answers about their personal information. These answers can be accurately extracted from specific parts of the user documents.
- Evaluation Framework: Evaluate retrieval models for their ability to extract relevant document chunks and evaluate RAG systems for end-to-end Q&A performance.
- Result Analysis: Present key results along with detailed fine-grained analyses.
Explore PersonaBench and contribute to advancing AI personalization research!
Two communities of data from the PersonaBench dataset have been released in the current version, each giving personal documents and Q&As for three individuals. More data will be released after necessary review. Please don't hesitate to report potential issues or share your suggestions for improving the dataset.
Each node represents an individual with complex user profile attributes, private documents, and questions/answers about personal information.
An example conversation session between two socially connected individuals. The session reveals one person’s prefered music artist is Hans Zimmer.
Create and activate a new conda environment:
conda create -n personabench python=3.11
conda activate personabenchInstall the required packages:
pip install -r requirements.txt
pip install .Set up OpenAI API key:
export OPENAI_API_KEY=<your_openai_api_key>To test the RAG pipeline on the provided two-community evaluation datasets, simply run the following bash script:
bash scripts/evaluation/run_test.shAll results will be saved to the logs folder.
Specifically, for testing the RAG framework for answering personal questions using user documents as support context, use the following command:
python scripts/evaluation/retrieval_and_generation.py --seed 2024 \
--log_dir logs \
--data_dir eval_data/eval_data_v1 \
--test_community_ids community_0,community_1 \
--num_chunks 5 \
--base_models gpt-4o-mini,gpt-4-0613,gpt-3.5-turbo,gpt-4o \
--retrievers all-mpnet-base-v2,gt-context,all-MiniLM-L6-v2,BAAI/bge-m3 \
--test_noises 0.0,0.3,0.5,0.7 \
--verboseThe generation results will be automatically saved to the specified log_dir.
Then, run the following command to calculate statistics and generate figures:
python scripts/evaluation/eval.py --log_dir logs \
--data_dir eval_data/eval_data_v1Retrieval evaluation results:
End-to-end evaluation results: More detailed evaluation:@misc{tan2025personabenchevaluatingaimodels,
title={PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data},
author={Juntao Tan and Liangwei Yang and Zuxin Liu and Zhiwei Liu and Rithesh Murthy and Tulika Manoj Awalgaonkar and Jianguo Zhang and Weiran Yao and Ming Zhu and Shirley Kokane and Silvio Savarese and Huan Wang and Caiming Xiong and Shelby Heinecke},
year={2025},
eprint={2502.20616},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.20616},
}






