Although some weights can be downloaded dynamically at runtime, it is recommended to pre-download them for speeding up each run.
wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/eva_vit_g.pth
the path of image encoder weight can be modified here.
# InstructBLIP (recommended)
wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth
# MiniGPT4
wget https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth
wget https://huggingface.co/Vision-CAIR/MiniGPT-4/blob/main/pretrained_minigpt4.pth
the path of Q-Former and Linear Weight can be modified in q_former_model and ckpt in each config here.
-
From Vriptor-STLLM pretrained weights (inference on Vriptor-STLLM) Please download the weights from Vriptor-STLLM.
-
From ST-LLM pretrained weights (training Vriptor-STLLM) Please download the weights from ST-LLM.
-
From Vicuna weights (training from scratch) Please first follow the instructions to prepare Vicuna v1.1 (for InstructBLIP) or Vicuna v1.0 (for MiniGPT4). Then modify the
llama_modelin each config here to the folder that contains Vicuna weights.
You can have a try using the videos in this folder.
-
Download the Vript dataset from Huggingface or ModelScope.
-
Extract the
zipfiles in thevript_long_videos_clipsfolder andvript_short_videos_clipsfolder. Put the extracted videos to thetraining_data/videosfolder. Extract the captions in thevript_captionsfolder and put them to thetraining_data/vript_captionsfolder. -
Run the following script to generate the training data. You can directly run the commands to have a try using the provided data in the
training_datafolder.
### Stage 1 ###
# Generate the training data for Vriptor-STLLM training stage 1 (Whole Video)
build_vript_training_data/build_training_vript_stage1_single.py \
--video_folder training_data/videos \
--caption_dir training_data/vript_captions \
--output_dir training_data/vriptor
# Generate the training data for Vriptor-STLLM training stage 1 (Multiple scenes)
build_vript_training_data/build_training_vript_stage1_concat.py \
--video_folder training_data/videos \
--caption_dir training_data/vript_captions \
--output_dir training_data/vriptor
### Or Stage 2 ###
# Generate the training data for Vriptor-STLLM training stage 2 (Whole Video)
build_vript_training_data/build_training_vript_stage2_single.py \
--video_folder training_data/videos \
--caption_dir training_data/vript_captions \
--output_dir training_data/vriptor
# Generate the training data for Vriptor-STLLM training stage 2 (Multiple scenes)
build_vript_training_data/build_training_vript_stage2_concat.py \
--video_folder training_data/videos \
--caption_dir training_data/vript_captions \
--output_dir training_data/vriptor
conda create -n vriptor python=3.8
conda activate vriptor
pip install -r requirements.txt
python demo.py \
--video-path video_examples/emoji.mp4 \
--cfg-path config/vriptor_stllm_stage2.yaml \
--gpu-id 0 \
--ckpt-path model_weights/vriptor_stllm_stage2
torchrun --nproc_per_node 8 train_hf.py \
--cfg-path config/vriptor_stllm_stage1.yaml
# or config/vriptor_stllm_stage2.yaml