Skip to content

dominickrei/VisCoP

Repository files navigation

ego_depth_robotics_viscop.mp4

⚙️ Installation

  1. Create a conda environment
conda create --name=viscop python=3.10
conda activate viscop
  1. Clone VisCoP and install the required Python packages (we use torch 2.4.0 + cuda 12.4 in our experiments)
git clone https://github.com/dominickrei/VisCoP.git
cd VisCoP
pip install -r requirements.txt

pip install flash-attn --no-build-isolation

💻 Inference

First, download pre-trained VisCoP models through HuggingFace: hf download dreilly/viscop-models --local-dir ./viscop_trained_models

from viscop import model_init, mm_infer
from viscop.mm_utils import load_video

## Load model
model_path = './viscop_trained_models/viscop_qwen2.5_7b_viscop-lora_egocentric-expert'

model, processor = model_init(
    model_path=model_path,
    device_map={"": "cuda"}
)

## Load video
video_path = './assets/ego_cut_carrot.mp4'

frames, timestamps = load_video(video_path, fps=1, max_frames=180)

## Create conversation
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "timestamps": timestamps, "num_frames": len(frames)},
            {"type": "text", "text": "What vegetable is the person cutting in the video?"},
        ]
    }
]

## Perform inference
inputs = processor(
    images=[frames],
    text=conversation,
    merge_size=2,
    return_tensors="pt",
)

prediction = mm_infer(
    inputs,
    model=model,
    tokenizer=processor.tokenizer,
    do_sample=False,
    modal='video'
)

print(prediction)

🕸️ Gradio Demo

We provide a Gradio interface to interact with VisCoP models. After downloading the models following the instructions above, run the following command to launch the demo:

python inference/launch_gradio_demo.py --model-path ./viscop_trained_models/viscop_qwen2.5_7b_viscop-lora_egocentric-expert

Note: if you are running the demo on a remote machine, be sure to forward the client port (9999 by default) so you can access the web interface through your local network. For example, ssh -L 9999:localhost:9999 user@host

🏋️ Training VisCoP

🎥 Preparing Training Data for Egocentric Viewpoint and Depth Modality

We provide the instruction pairs as well as videos for training through HuggingFace. After downloading the data, update the following variables in scripts/train/ego_depth_video/train_viscop.sh:

  • DATA_DIR: Update with the path to either egocentric or depth videos
  • TRAINING_JSON: Update with the path to either egocentric or depth instructions

🤖 Preparing Training Data for Simulated Robot Control

Firstly, download and extract the VIMA data through HuggingFace. Next, generate the training instruction pairs using the conversion script provided by LLaRA. We use the D-inBC-text-multi-train-8k-front instructions for all of our simulated robot control experiments.

  • After extracting the VIMA data and generating the instructions, update DATA_DIR and TRAINING_JSON in scripts/train/robotic_control/train_viscop.sh

🤖 Preparing Training Data for Real-world Robot Control

Coming soon!

🔥 Update Training Script and Launch Training

In scripts/train/ego_depth_video/train_viscop.sh and scripts/train/robotic_control/train_viscop.sh, update the following arguments to match your system settings and paths:

  • INIT_MODEL: This is the path to weights of the base VLM (VideoLLaMA3). Please use the following command to download and save the weights python scripts/save_basevlm_for_finetuning.py --save-path-for-local-basevlm /path/to/save/base_vlm
  • DATA_DIR: The path to your data directory containing the egocentric, depth, or robot control data
  • TRAINING_JSON: The path to a json file containing the egocentric, depth, or robot control instructions
  • (Optional) NUM_VISUAL_PROBES: The number of Visual Probes to use in VisCoP
  • (Optional) INTERACTION_MODULE_POS: The positions of the interaction modules. Acceptable values are all or a comma-separated list of integers (denoting zero-indexed layer indices of the vision encoder)

(Single node training) After updating the training scripts, initiate the training with the following command:

bash scripts/train/ego_depth_video/train_viscop.sh 1 <NUM_GPUS>

(Multi-node training with SLURM) After updating the training scripts, update the arguments in ex_multi_node_slurm_job.sh and submit the job:

sbatch ex_multi_node_slurm_job.sh

❄️ Evaluating VisCoP

💾 Preparing Source and Target Domain Data

Target domain Target Domain Data Source Domain Data
Egocentric Viewpoint Ego-in-Exo PerceptionMCQ, EgoSchema NeXTQA, VideoMME, ADL-X
Depth Modality Ego-in-Exo PerceptionMCQ (Exo Depth) (contained in depth_videos.zip) Ego-in-Exo PerceptionMCQ (Exo RGB), NeXTQA, VideoMME, ADL-X
Robot Control VIMA-Bench Ego-in-Exo PerceptionMCQ (Exo RGB), NeXTQA, VideoMME, ADL-X
Click to view our evaluation directory structure
/path/to/vlm_eval_bench/
├── adlx
│   ├── Charades-AR.json
│   ├── Charades-Description.json
│   ├── LEMMA-TC.json
│   ├── Smarthome-AR.json
│   ├── TSU-Description.json
│   ├── TSU-TC.json
│   └── videos
│       ├── ADLMCQ-TC-TSU
│       ├── Charades_v1_480
│       ├── lemma_cropped
│       └── SH_cropped224x224_better
├── egoperceptionmcq
│   ├── all_category_qas.json
│   ├── keystep_segments
│   └── depth_videos
├── egoschema
│   ├── GENERATION
│   ├── MC
│   ├── MC_PPL
│   ├── questions.json
│   ├── Subset
│   ├── subset_answers.json
│   └── videos
├── nextqa
│   ├── NExTVideo
│   └── test.csv
└── videomme
    ├── subtitle
    ├── test-00000-of-00001.parquet
    └── videos

🏃🎥 Run the video understanding evaluations

After downloading the data, update the following variables in scripts/eval/eval_video.sh:

  • DATA_ROOT: Update with the path to your evaluation directory, ensure it follows the same structure as shown above

After updating the evaluation script, run the following command:

bash scripts/eval/eval_video.sh /path/to/trained/viscop/model <BENCHMARKS> 1 <NUM_GPUS>
e.g., bash scripts/eval/eval_video.sh /path/to/trained/viscop/model egoperceptionmcq,egoschema,nextqa,videomme,adlx_mcq 1 8

NOTE: Evaluations on Ego-in-Exo PerceptionMCQ and ADL-X require Llama 3.1. You will need to install Ollama and download the Llama 3.1 model by running the command ollama run llama3.1 prior to running the evaluations.

  • If you are using an HPC environment and can not install Ollama, you will need to run an Ollama server locally
    • To do this, download the Ollama server that matches your system architecture from their releases page. Then update and uncomment lines 59-62 in scripts/eval/eval_video.sh

🏃🤖 Run the simulated robotics evaluations

We provide code to evaluate VisCoP models on simulated robotics tasks (VIMA-Bench) in robotics_evaluation, adapted from LLaRA

  1. Install the VIMA-Bench dependencies from their official reposiotory
  2. Run the following command
torchrun --nproc_per_node=<NUM_GPUS> ./robotics_evaluation/eval-viscop.py <OUTPUT NAME> --model-path /path/to/trained/viscop_model --output-path ./robotics_evaluation/results/ --prompt-mode hso
  1. After the evaluation is complete, run the cells in robotics_evaluation/viscop-results.ipynb to see the final success rates across each VIMA-Bench subset

🛠️ Building on VisCoP

If you would like to build on VisCoP, we mention the specific codes that might be helpful:

  • viscop/model/viscop_vision_encoder/modeling_viscop_vision_encoder.py contains the implementations of visual probes and interaction modules
  • viscop/model/viscop_arch.py contains the vision-language connector for the visual features and visual probes
  • viscop/model/viscop_qwen2.py contains the code where visual probes are passed to the LLM
  • viscop/model/processor.py contains the code to add placeholders for visual features and visual probes to the language instruction
  • viscop/train_viscop.py contains the training logic for VisCoP

Please consider citing VisCoP if it is helpful for your project!

@article{viscop2025,
  title={VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models}, 
  author={Dominick Reilly and Manish Kumar Govind and Le Xue and Srijan Das},
  journal={arXiv Preprint},
  year={2025}
}

🙏 Acknowledgements

We thank the researchers behind the following codebases and model releases for their great open source work which VisCoP builds upon! VideoLLaMA3, LLaVA-OneVision, Qwen2.5-VL, SigLIP, and Qwen2.5.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors