This repository contains the official code for generating YouTube watch histories using an LLM-based user simulator.
📖 Proposed in the COLM 2025 paper:
HIPPO-VIDEO: Simulating Watch Histories with Large Language Models for History-Driven Video Highlighting
You can access the full dataset generated by this simulation pipeline on Hugging Face.
git clone https://github.com/jeongeunnn-e/HIPPO-Video.git
cd HIPPO-Video
# Conda (recommended)
conda create -n hippo python=3.10 -y
conda activate hippo
pip install -r requirements.txt⸻
- Prepare config and seed data
You need to provide a configuration and input data in JSON format. We include an example seed file: seed_data.json.
Example: config.json
{
"data_path": "your_path/seed_data.json",
"save_path": "your_path/outputs/",
"donwload_path": "your_path/downloads/",
"model_name": "gpt-4o",
"max_length": 10,
"OPENAI_API_KEY": "your_openai_key"
}📑 Example: seed_data.json
[
{
"topic": "Clothes",
"sub_topic": "Shoes",
"feature": "informative",
"initial_query": "how shoes are made from start to finish"
},
{
"topic": "Music",
"sub_topic": "Jazz",
"feature": "emotional",
"initial_query": "best emotional jazz solos"
}
]You can include multiple seeds to generate multiple simulated sessions.
⸻
- Run simulation
python run.py