You can create a virtual environment and install the required packages using the following commands:
conda create -n livevqa python=3.9.0 -y
conda activate livevqa
pip install -r requirements.txtPlease refer to the liveVQA_benchmarks/README.md for detailed information.
This module can help you collect news from BBC, CNN, Forbes, AP and Variety.
Before collecting news, you need to do settings in collectors/config.py.
After simple settings, you can run the following command to collect news articles:
cd LIVEVQA
python run.pyEvery time you run the command, it will collect news articles and save them in hot_topics_{timestamp}.json.
This module can rank and filter irrelevant images from the collected news articles.
You should set your api key and base path in ranking/config.py. After that, you can run the following command to filter images:
cd ranking
python Model_ranking.pyEvery time you run the command, it will read the latest hot_topics_{timestamp}.json and filter images. The filtered file will be saved in modified_topics_{timestamp}.json.
This module can generate and filter Level 1 QAs from the filtered news articles.
You should set your api key and base path in qa_makers/config.py & qa_Filter/config.py. After that, you can run the following commands to generate Level 1 QAs:
Generate Level 1 QAs:
cd qa_makers
python main.pyEvery time you run the command, it will read the latest modified_topics_{timestamp}.json and generate QAs. The output file will be saved in l1_topics_{timestamp}.json.
Filter Level 1 QAs:
cd qa_Filter
python main.pyEvery time you run the command, it will read the latest l1_topics_{timestamp}.json and filter QAs. The filtered file will be saved in l1_filtered_topics_{timestamp}.json.
This module can generate Level 2 QAs from the filtered Level 1 QAs.
You should set your api key and base path in qa_makers_mh/config.py. After that, you can run the following command to generate Level 2 QAs:
cd qa_makers_mh
python main.pyEvery time you run the command, it will read the latest l1_filtered_topics_{timestamp}.json and generate Level 2 QAs. The output file will be saved in l23_topics_{timestamp}.json.
This module can filter and validate Level 2 QAs using GPT-4.1 API to ensure answer quality and accuracy.
You should set your project root directory and OpenRouter API key in qa_L2_Filter/L2_Filter.py. After that, you can run the following command to filter Level 2 QAs:
cd qa_L2_Filter
python L2_Filter.pyEvery time you run the command, it will:
- Find the latest
l23_topics_{timestamp}.jsonfile - Skip entries that are already discarded
- Validate each Level 2 question by calling GPT-4.1 API with the question, options, text context, and image
- Compare API answers with ground truth
- Remove questions that fail validation
- Discard entire entries if all Level 2 questions are removed
- Save the filtered results in
l23_filtered_topics_{timestamp}.json(using the same timestamp as input)
Before running the script, make sure to:
- Set
PROJECT_ROOTto your LIVEVQA project directory - Replace
OPENROUTER_API_KEYwith your actual OpenRouter API key - Ensure all image files referenced in the JSON exist and are accessible
- Automatic file detection: Finds the latest l23_topics file automatically
- Quality validation: Uses GPT-4.1 to verify answer correctness
- Consistent naming: Output file uses the same timestamp as input
- Progress tracking: Detailed logging of validation results
- Error handling: Graceful handling of missing images and API errors
- Rate limiting: Built-in delays to respect API limits
If you want to run the whole pipeline automatically, you can set your base path in start.py and run the following command:
python start.pyThis will automatically:
- Collect news
- Filter images
- Generate Level 1 QAs
- Filter Level 1 QAs
- Generate Level 2 QAs
- Filter Level 2 QAs (Note: L2 filtering needs to be run separately due to API requirements)
The final output will be saved in l23_topics_{timestamp}.json.
This module can help you collect videos from YouTube.
Before collecting videos, you need to:
- Configure settings in
video_code/video_pipeline.sh - Download and configure the following repositories according to their instructions:
- Modify the
demo.pyfiles in both folders based on the implementations inuvd.pyanddoclayout.py
Torch version may conflict with the CUDA version. We recommend checking your CUDA version:
nvcc --version
nvidia-smiThen install the corresponding torch version:
For CUDA 12.4:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124For CUDA 11.8:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118For CPU only:
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpuAfter configuration, run the following command to collect YouTube videos:
cd LIVEVQA/video_code
bash video_pipeline.shπ‘ Tips: Make sure to install both
ffprobeandffmpeg, otherwise the pipeline will fail with errors.
This module includes:
- Downloading videos
- Splitting videos by text
- Extracting keyframes
- Deduplication
- Selecting final pictures
Finally, it processes a JSON file named modified_{timestamp}.json, and the QA generation follows the same process as NEWS.
π Note: We made a small modification to
qa_makers/main.pyβ before generating QAs, the module now evaluates whether the associated text is meaningful enough for QA generation. Therefore, to generate QAs from videos, you should use the QA generation code provided in thevideo_codedirectory. Other components remain unchanged.
This section helps you collect ArXiv data.
cd arxivFirst, configure settings in arxiv/config.py. Specifically, change BASE_DIR to the directory where you want to save the downloaded papers. Then run:
python direct_download.py --yearmonth 2504 --start-id 1 --end-id 100 --concurrent 5 --processes 4You can see crawled data in data/raw.
Process the downloaded papers to extract images and associations:
python get_article.py --dir /path/to/html/files --workers 4Then you can see the processed data in data/processed.
Set environment variable OPENAI_API_KEY to your OpenAI API key. Then run the following command to select the best images from the processed papers:
python select_best_images.py --input_dir /path/to/processed/jsons --workers 4 --start_index 0 --end_index 100When synthesizing QAs about the authors, we put all authors from all papers in authors.json.
Generate Level 1 QAs:
python construct_level1.py -i /path/to/processed/jsons -o /path/to/output/level1.jsonl --workers 4Generate Level 2 QAs:
python construct_level2.py -i /path/to/output/level1.jsonl -o /path/to/output/level2.jsonl --processes 4