HippoCamp is a benchmark for evaluating contextual agents on realistic personal-computing environments. It covers multimodal file management across documents, images, audio, video, emails, calendars, and other everyday artifacts, with 42.4 GB of data across more than 2K files. On top of these environments, HippoCamp provides 581 QA pairs and 46.1K structured trajectory annotations for analyzing search, perception, and multi-step reasoning failures.
HippoCamp instantiates three archetypal personal-computing environments and evaluates two task families:
- Factual Retention: retrieve, comprehend, and reason over factual information grounded in multimodal files.
- Profiling: aggregate weak, distributed evidence across files and time to infer a coherent user model.
The current release includes:
- 42.4 GB of benchmark data
- 2K+ real-world files
- 581 QA pairs
- 46.1K structured trajectory annotations
- 3 user profiles
- 2 task families
The released annotation JSONs follow the hierarchy below.
| Asset | Location | Contents |
|---|---|---|
| GitHub Repository | This repository | Code, configs, docs, evaluation scripts |
| Paper | HuggingFace | HippoCamp paper |
| Dataset | HuggingFace | Raw environments, annotations, HippoCamp_Gold, metadata |
| Project Page | hippocamp-ai.github.io | Benchmark overview, examples, leaderboard |
| Data Visualization | hippocamp-ai.github.io/hippocamp | Interactive environment visualization |
| Docker Archives | Google Drive | Six prebuilt benchmark images |
| Demo Video | YouTube | End-to-end WebUI and agent demo |
The Hugging Face dataset is the authoritative data release. Its main structure is:
HippoCamp/
├── Adam/
│ ├── Subset/
│ │ ├── Adam_Subset/
│ │ ├── Adam_Subset.json
│ │ └── Adam_Subset.xlsx
│ └── Fullset/
│ ├── Adam/
│ ├── Adam.json
│ └── Adam_files.xlsx
├── Bei/
│ ├── Subset/
│ │ ├── Bei_Subset/
│ │ ├── Bei_Subset.json
│ │ └── Bei_Subset.xlsx
│ └── Fullset/
│ ├── Bei/
│ ├── Bei.json
│ └── Bei_files.xlsx
└── Victoria/
├── Subset/
│ ├── Victoria_Subset/
│ ├── Victoria_Subset.json
│ └── Victoria_Subset.xlsx
└── Fullset/
├── Victoria/
├── Victoria.json
└── Victoria_files.xlsx
These artifacts serve different roles:
- The six source directories store the raw personal-computing files.
- The six annotation JSON files store released QA pairs together with annotations such as
file_path,file_number,file_modality,file_type,evidence,rationale,agent_cap,QA_type, andprofiling_type. HippoCamp_Goldstores parsed-text JSON files with the schema{file_info, summary, segments}.- The
*_files.xlsxspreadsheets store explicit metadata such as creation time, modification time, and location fields.
The Hugging Face Dataset Viewer exposes six configs, each with profiling and factual_retention splits:
| Config | Profile | Scope | Raw files | Total QA | Profiling | Factual retention |
|---|---|---|---|---|---|---|
adam_fullset |
Adam | Full | 344 | 123 | 20 | 103 |
adam_subset |
Adam | Subset | 158 | 18 | 6 | 12 |
bei_fullset |
Bei | Full | 875 | 235 | 20 | 215 |
bei_subset |
Bei | Subset | 147 | 27 | 4 | 23 |
victoria_fullset |
Victoria | Full | 711 | 223 | 20 | 203 |
victoria_subset |
Victoria | Subset | 137 | 11 | 6 | 5 |
All public benchmark data is distributed from the Hugging Face dataset page:
On that page, open the Files and versions tab to browse and download the released directories and files.
| If you want to... | Download this | Why it is needed | Local destination |
|---|---|---|---|
| run the RAG / search-agent pipeline | HippoCamp_Gold/ |
it stores the parsed-text JSON used for indexing and retrieval | benchmark/HippoCamp_Gold/ |
| run terminal-agent batch evaluation | one official annotation JSON such as Adam.json or Adam_Subset.json |
it provides the released questions, answers, and evidence annotations used as --questions-file |
any local path |
| reproduce the analysis figures | Adam.json, Bei.json, Victoria.json, Adam_files.xlsx, Bei_files.xlsx, Victoria_files.xlsx |
the analysis scripts read the fullset annotations and metadata spreadsheets directly | benchmark/analysis/data/ |
| inspect or study the raw benchmark environments | the six source directories under Adam/, Bei/, and Victoria/ |
they contain the original personal-computing files | any local path |
HippoCamp_Gold is not just an optional extra. It is the parsed-text release that powers the public RAG workflow and the Docker-side return_txt interface. If you only want to browse the raw files in Docker, you do not need it locally. If you want to run the released retrieval pipeline, you do.
.
├── README.md
├── .env.example
├── requirements.txt
├── evaluate.py
├── CITATION.cff
├── assets/
│ ├── figs/
│ └── tables/
├── docs/
│ ├── docker_api.md
│ ├── evaluation.md
│ └── reproduction.md
├── benchmark/
│ ├── README.md
│ ├── pyproject.toml
│ ├── sample_questions.json
│ ├── configs/
│ │ ├── evaluation.yaml
│ │ ├── providers.yaml
│ │ ├── retriever_server.yaml
│ │ ├── services.yaml.example
│ │ └── pipelines/
│ ├── scripts/
│ │ ├── run_offline.py
│ │ ├── run_query.py
│ │ ├── run_evaluation.py
│ │ └── retriever_server.py
│ ├── src/
│ │ ├── providers/
│ │ │ ├── generator/
│ │ │ └── retrieval/
│ │ ├── rag/
│ │ └── shared/
│ ├── analysis/
│ │ ├── README.md
│ │ └── data/README.md
│ └── HippoCamp_Gold/README.md
└── agent/
├── README.md
├── gemini.py
├── chatgpt.py
├── claude.py
├── vllm.py
├── gemini_batch.py
├── chatgpt_batch.py
├── claude_batch.py
├── vllm_batch.py
└── prompt_modules/
├── config.py
└── prompt_body.py
git clone https://github.com/Savannah-yz/HippoCamp.git
cd HippoCamp
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtOptional editable install for the benchmark subsystem:
pip install -e ./benchmark --no-depsrequirements.txt already includes the merged dependency set used by the public release.
export XDG_CACHE_HOME=$PWD/.cache
export MPLCONFIGDIR=$PWD/.cache/matplotlibcp .env.example .envThe root .env covers terminal-agent keys, RAG provider keys, judge settings, and optional local-service configuration.
Use the Hugging Face dataset pieces as follows:
- RAG / search-agent pipeline: place the parsed-text release under
benchmark/HippoCamp_Gold/. - Terminal-agent batch evaluation: use an official annotation JSON such as
Adam.json,Adam_Subset.json,Bei.json, orVictoria_Subset.jsonas--questions-file. - Analysis reproduction: place the three fullset annotation JSON files and the three fullset metadata spreadsheets under
benchmark/analysis/data/.
Concrete analysis-input placement:
mkdir -p benchmark/analysis/data
cp /path/to/HippoCamp/Adam/Fullset/Adam.json benchmark/analysis/data/Adam.json
cp /path/to/HippoCamp/Bei/Fullset/Bei.json benchmark/analysis/data/Bei.json
cp /path/to/HippoCamp/Victoria/Fullset/Victoria.json benchmark/analysis/data/Victoria.json
cp /path/to/HippoCamp/Adam/Fullset/Adam_files.xlsx benchmark/analysis/data/Adam_files.xlsx
cp /path/to/HippoCamp/Bei/Fullset/Bei_files.xlsx benchmark/analysis/data/Bei_files.xlsx
cp /path/to/HippoCamp/Victoria/Fullset/Victoria_files.xlsx benchmark/analysis/data/Victoria_files.xlsxIf you are unsure which Hugging Face asset corresponds to your workflow, use the What To Download And Why table above first.
- macOS / Windows: https://www.docker.com/products/docker-desktop/
- Linux: follow your distribution-specific Docker Engine setup
The public workflow uses six prebuilt Docker archives. All images are collected in the shared folder below, and each archive also has a direct Google Drive link:
Load an archive once you have it:
docker load -i hippocamp_adam_subset.tarStart a container:
docker run -it -p 18082:8080 --name hippocamp-adam-subset hippocamp/adam_subset:latestThe docker run -it ... command gives you the interactive shell. Start the browser WebUI inside the container with:
webuiFor detailed container, WebUI, and HTTP-route behavior, see docs/docker_api.md.
The main workflows use different inputs and produce different artifacts:
| Workflow | Main inputs | Required external assets | Main outputs |
|---|---|---|---|
| RAG / search-agent pipeline | benchmark/sample_questions.json for smoke tests, or an official annotation JSON via --batch |
benchmark/HippoCamp_Gold/ |
per-query result JSONs in --output-dir, plus summary_*.json and evaluation_*.json |
| Terminal agent, single question | a Docker container plus --question |
Docker image archive | one session log JSON via --log-json |
| Terminal agent, batch | --questions-file pointing to an official annotation JSON |
Docker image archive | summary.jsonl, per-question result JSON files, aggregate.json, and stdout/stderr logs |
| Top-level evaluator | JSON or JSONL file via evaluate.py --input-dataset |
none | per-query judge results JSON and aggregate metrics JSON |
| Analysis scripts | fullset annotation JSON files and *_files.xlsx spreadsheets |
Hugging Face fullset assets | figures and reports under benchmark/analysis/outputs/ |
If you are unsure which files to feed into which script, start with docs/reproduction.md, benchmark/README.md, and agent/README.md.
HippoCamp exposes two complementary evaluation paths:
- a RAG / search-agent pipeline under
benchmark/ - a terminal-agent pipeline under
agent/
For complete step-by-step commands covering all methods and configurations, see docs/reproduction.md.
Run these commands from benchmark/.
- Copy the parsed-text release into
benchmark/HippoCamp_Gold/. - Copy and configure the service config and environment file:
cp configs/services.yaml.example configs/services.yaml cp ../.env.example ../.env
- Start Qdrant if you use the default local vector-store setup.
- Build the local index.
- Run a baseline.
docker run -p 6333:6333 -p 6334:6334 \
-v "$PWD/data/qdrant_storage:/qdrant/storage" \
qdrant/qdrant
python3 scripts/run_offline.py HippoCamp_Gold/ --all -e hippo
python3 scripts/run_query.py --batch sample_questions.json -e hippo \
--retrieval standard_rag --generator gemini --evaluateUse sample_questions.json only for smoke tests. For full evaluation, replace it with one of the official Hugging Face annotation JSON files.
Run the terminal-agent commands from the repository root.
Single-question example:
python3 agent/chatgpt.py \
--container hippocamp-adam-subset \
--question "What does the guide say about court dress code?" \
--ensure-webui \
--log-json result/chatgpt_docker_session.jsonBatch example:
python3 agent/chatgpt_batch.py \
--container hippocamp-adam-subset \
--questions-file /path/to/Adam_Subset.json \
--ensure-webui \
--log-dir log/chatgpt_batch \
--result-dir result/chatgpt_batchThe canonical batch input is an official annotation JSON from Hugging Face, not HippoCamp_Gold.
The terminal-agent wrappers expose --prompt-config so you can control whether the agent may use:
| Config | return_ori |
return_txt |
return_img |
Recommended use |
|---|---|---|---|---|
config0 |
on | on | on | Full auxiliary interface |
config1 |
on | off | off | Source-only setting |
config2 |
on | off | on | Image-enabled, text-disabled |
config3 |
on | on | off | Text-enabled, image-disabled |
For terminal-agent outputs and other custom agent results, use evaluate.py:
python3 evaluate.py \
--input-dataset result/chatgpt_batch/aggregate.json \
--per-query-results-json result/chatgpt_batch/judge_results.json \
--aggregate-metrics-json result/chatgpt_batch/judge_summary.jsonHippoCamp exposes two distinct evaluation entrypoints for different output formats.
| Entrypoint | Intended for | Metrics |
|---|---|---|
benchmark/scripts/run_evaluation.py |
RAG / search-agent outputs from run_query.py |
ROUGE, BLEU, exact match, semantic similarity, BERTScore, retrieval P/R/F1, LLM judge |
evaluate.py |
Terminal-agent and custom agent outputs | LLM-as-a-judge answer quality, file-list P/R/F1 |
For detailed input/output schemas, JSON examples, and command options, see docs/evaluation.md.
The agent/ directory is designed to be extensible. The released wrappers use a tag-based interaction contract centered on <tool> and <answer>:
<think>...</think>
<tool>{"name":"terminal","arguments":{"command":"..."}}</tool>
<answer>...</answer>
To build your own prompt-based agent:
- Start from
agent/gemini.pyoragent/vllm.py. - Keep the same terminal-tool contract and JSON command shape.
- Treat
/hippocamp/dataas the working directory root for benchmark file paths. - Use the released Docker commands as the environment interface:
list_files,return_txt,return_img,return_ori,return_metadata,set_flags,webui,webui_status, andwebui_stop. - Preserve the batch output schema so that
evaluate.pycan score your results without extra adapters.
Main results on HippoCamp across user profiles. We evaluate representative MLLMs and agent methods on profiling and factual retention, reporting F1 and accuracy (Acc) for each profile and the overall average.
Agent capability-wise analysis on HippoCamp. We report F1 and LLM-judge accuracy aggregated by capability labels, decomposed into search, perception, and reasoning.
This figure shows how many ground-truth supporting files each question requires. It is the benchmark's direct view of evidence breadth.
This figure shows how many distinct file modalities each question spans, such as documents, images, audio, or other file types.
This figure shows the number of reasoning steps required by the released rationale annotations.
This figure summarizes the released scalar difficulty score, which combines evidence breadth, modality breadth, file types, evidence items, reasoning steps, question length, answer length, and time span.
This figure aligns question difficulty with per-question judge scores across released methods, showing how performance changes as questions become harder.
See benchmark/analysis/README.md for the scripts that reproduce these figures.
The public leaderboard is hosted on the project page:
If you evaluate a new prompt-based agent or baseline, email your result package to [email protected]. Include the method name, model name, settings summary, and either the result JSON or the aggregate evaluation output.
This video walks through the current public-facing visualization materials and interaction demos. It includes the data visualization view, the Docker-based environment visualization, and the agent's automated QA workflow inside the benchmark environment.
@misc{yang2026hippocampbenchmarkingcontextualagents,
title={HippoCamp: Benchmarking Contextual Agents on Personal Computers},
author={Zhe Yang and Shulin Tian and Kairui Hu and Shuai Liu and Hoang-Nhat Nguyen and Yichi Zhang and Zujin Guo and Mengying Yu and Zinan Zhang and Jingkang Yang and Chen Change Loy and Ziwei Liu},
year={2026},
eprint={2604.01221},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.01221},
}







