ego_depth_robotics_viscop.mp4
- Create a conda environment
conda create --name=viscop python=3.10
conda activate viscop- Clone VisCoP and install the required Python packages (we use
torch 2.4.0 + cuda 12.4in our experiments)
git clone https://github.com/dominickrei/VisCoP.git
cd VisCoP
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
First, download pre-trained VisCoP models through HuggingFace: hf download dreilly/viscop-models --local-dir ./viscop_trained_models
from viscop import model_init, mm_infer
from viscop.mm_utils import load_video
## Load model
model_path = './viscop_trained_models/viscop_qwen2.5_7b_viscop-lora_egocentric-expert'
model, processor = model_init(
model_path=model_path,
device_map={"": "cuda"}
)
## Load video
video_path = './assets/ego_cut_carrot.mp4'
frames, timestamps = load_video(video_path, fps=1, max_frames=180)
## Create conversation
conversation = [
{
"role": "user",
"content": [
{"type": "video", "timestamps": timestamps, "num_frames": len(frames)},
{"type": "text", "text": "What vegetable is the person cutting in the video?"},
]
}
]
## Perform inference
inputs = processor(
images=[frames],
text=conversation,
merge_size=2,
return_tensors="pt",
)
prediction = mm_infer(
inputs,
model=model,
tokenizer=processor.tokenizer,
do_sample=False,
modal='video'
)
print(prediction)We provide a Gradio interface to interact with VisCoP models. After downloading the models following the instructions above, run the following command to launch the demo:
python inference/launch_gradio_demo.py --model-path ./viscop_trained_models/viscop_qwen2.5_7b_viscop-lora_egocentric-expert
Note: if you are running the demo on a remote machine, be sure to forward the client port (9999 by default) so you can access the web interface through your local network. For example, ssh -L 9999:localhost:9999 user@host
We provide the instruction pairs as well as videos for training through HuggingFace. After downloading the data, update the following variables in scripts/train/ego_depth_video/train_viscop.sh:
DATA_DIR: Update with the path to either egocentric or depth videosTRAINING_JSON: Update with the path to either egocentric or depth instructions
Firstly, download and extract the VIMA data through HuggingFace. Next, generate the training instruction pairs using the conversion script provided by LLaRA. We use the D-inBC-text-multi-train-8k-front instructions for all of our simulated robot control experiments.
- After extracting the VIMA data and generating the instructions, update
DATA_DIRandTRAINING_JSONinscripts/train/robotic_control/train_viscop.sh
Coming soon!
In scripts/train/ego_depth_video/train_viscop.sh and scripts/train/robotic_control/train_viscop.sh, update the following arguments to match your system settings and paths:
INIT_MODEL: This is the path to weights of the base VLM (VideoLLaMA3). Please use the following command to download and save the weightspython scripts/save_basevlm_for_finetuning.py --save-path-for-local-basevlm /path/to/save/base_vlmDATA_DIR: The path to your data directory containing the egocentric, depth, or robot control dataTRAINING_JSON: The path to a json file containing the egocentric, depth, or robot control instructions- (Optional)
NUM_VISUAL_PROBES: The number of Visual Probes to use in VisCoP - (Optional)
INTERACTION_MODULE_POS: The positions of the interaction modules. Acceptable values areallor a comma-separated list of integers (denoting zero-indexed layer indices of the vision encoder)
(Single node training) After updating the training scripts, initiate the training with the following command:
bash scripts/train/ego_depth_video/train_viscop.sh 1 <NUM_GPUS>(Multi-node training with SLURM) After updating the training scripts, update the arguments in ex_multi_node_slurm_job.sh and submit the job:
sbatch ex_multi_node_slurm_job.sh| Target domain | Target Domain Data | Source Domain Data |
|---|---|---|
| Egocentric Viewpoint | Ego-in-Exo PerceptionMCQ, EgoSchema | NeXTQA, VideoMME, ADL-X |
| Depth Modality | Ego-in-Exo PerceptionMCQ (Exo Depth) (contained in depth_videos.zip) |
Ego-in-Exo PerceptionMCQ (Exo RGB), NeXTQA, VideoMME, ADL-X |
| Robot Control | VIMA-Bench | Ego-in-Exo PerceptionMCQ (Exo RGB), NeXTQA, VideoMME, ADL-X |
Click to view our evaluation directory structure
/path/to/vlm_eval_bench/
├── adlx
│ ├── Charades-AR.json
│ ├── Charades-Description.json
│ ├── LEMMA-TC.json
│ ├── Smarthome-AR.json
│ ├── TSU-Description.json
│ ├── TSU-TC.json
│ └── videos
│ ├── ADLMCQ-TC-TSU
│ ├── Charades_v1_480
│ ├── lemma_cropped
│ └── SH_cropped224x224_better
├── egoperceptionmcq
│ ├── all_category_qas.json
│ ├── keystep_segments
│ └── depth_videos
├── egoschema
│ ├── GENERATION
│ ├── MC
│ ├── MC_PPL
│ ├── questions.json
│ ├── Subset
│ ├── subset_answers.json
│ └── videos
├── nextqa
│ ├── NExTVideo
│ └── test.csv
└── videomme
├── subtitle
├── test-00000-of-00001.parquet
└── videos
After downloading the data, update the following variables in scripts/eval/eval_video.sh:
DATA_ROOT: Update with the path to your evaluation directory, ensure it follows the same structure as shown above
After updating the evaluation script, run the following command:
bash scripts/eval/eval_video.sh /path/to/trained/viscop/model <BENCHMARKS> 1 <NUM_GPUS>
e.g., bash scripts/eval/eval_video.sh /path/to/trained/viscop/model egoperceptionmcq,egoschema,nextqa,videomme,adlx_mcq 1 8NOTE: Evaluations on Ego-in-Exo PerceptionMCQ and ADL-X require Llama 3.1. You will need to install Ollama and download the Llama 3.1 model by running the command ollama run llama3.1 prior to running the evaluations.
- If you are using an HPC environment and can not install Ollama, you will need to run an Ollama server locally
- To do this, download the Ollama server that matches your system architecture from their releases page. Then update and uncomment lines
59-62inscripts/eval/eval_video.sh
- To do this, download the Ollama server that matches your system architecture from their releases page. Then update and uncomment lines
We provide code to evaluate VisCoP models on simulated robotics tasks (VIMA-Bench) in robotics_evaluation, adapted from LLaRA
- Install the VIMA-Bench dependencies from their official reposiotory
- Run the following command
torchrun --nproc_per_node=<NUM_GPUS> ./robotics_evaluation/eval-viscop.py <OUTPUT NAME> --model-path /path/to/trained/viscop_model --output-path ./robotics_evaluation/results/ --prompt-mode hso- After the evaluation is complete, run the cells in
robotics_evaluation/viscop-results.ipynbto see the final success rates across each VIMA-Bench subset
If you would like to build on VisCoP, we mention the specific codes that might be helpful:
viscop/model/viscop_vision_encoder/modeling_viscop_vision_encoder.pycontains the implementations of visual probes and interaction modulesviscop/model/viscop_arch.pycontains the vision-language connector for the visual features and visual probesviscop/model/viscop_qwen2.pycontains the code where visual probes are passed to the LLMviscop/model/processor.pycontains the code to add placeholders for visual features and visual probes to the language instructionviscop/train_viscop.pycontains the training logic for VisCoP
Please consider citing VisCoP if it is helpful for your project!
@article{viscop2025,
title={VisCoP: Visual Probing for Video Domain Adaptation of Vision Language Models},
author={Dominick Reilly and Manish Kumar Govind and Le Xue and Srijan Das},
journal={arXiv Preprint},
year={2025}
}We thank the researchers behind the following codebases and model releases for their great open source work which VisCoP builds upon! VideoLLaMA3, LLaVA-OneVision, Qwen2.5-VL, SigLIP, and Qwen2.5.
