This code is for our paper titled: Understanding Co-speech Gestures in-the-wild.
Authors: Sindhu Hegde, K R Prajwal, Taein Kwon, Andrew Zisserman
| 📝 Paper | 📑 Project Page | 📦 AVS-Spot Dataset | 🛠 Demo |
|---|---|---|---|
| Paper | Website | Dataset | Coming soon |
We present JEGAL, a Joint Embedding space for Gestures, Audio and Language. Our semantic gesture representations can be used to perform multiple downstream tasks such as cross-modal retrieval, spotting gestured words, and identifying who is speaking solely using gestures.
- [2025.08.20] 🔥 Inference code released - It is now possible to extract the gesture embeddings for any real-world video
- [2025.08.20] 🧬 JEGAL pre-trained checkpoints released
- [2025.07.24] 🏆 JEGAL is accepted as an ORAL paper at ICCV 2025!!!
- [2025.03.31] 📋 Paper released on arXiv
- [2025.03.29] 🤗 Our new gesture-spotting dataset: AVS-Spot is released!
We present three new evaluation datasets for the three tasks:
- Gesture word spotting: AVS-Spot dataset
- Cross-modal retrieval: AVS-Ret dataset
- Active speaker detection: AVS-Asd dataset
Refer to dataset section for details on downloading and pre-processing these datasets.
Clone the repository
git clone https://github.com/Sindhu-Hegde/jegal.git
Install the required packages (it is recommended to create a new environment)
python -m venv env_jegal
source env_jegal/bin/activate
pip install -r requirements.txt
FFmpeg is also needed, install if not already present using: sudo apt-get install ffmpeg==4.4.2
Note: The code has been tested with Python 3.12.7
Download the trained models and save in checkpoints folder
mkdir checkpoints
cd checkpoints
#### Visual Feature Extractor (GestSync)
wget https://www.robots.ox.ac.uk/~vgg/research/jegal/checkpoints/gestsync.pth
#### Gesture model (JEGAL)
wget https://www.robots.ox.ac.uk/~vgg/research/jegal/checkpoints/jegal.pth
Alternatively, the links are also displayed in the table below:
| Model | Download Link |
|---|---|
| Visual Feature Extractor (GestSync) | Link |
| Gesture model (JEGAL) | Link |
The first step is to preprocess the video and obtain gesture crops. Run the following command to pre-process the video:
cd preprocess
python inference_preprocess.py --video_file <path-to-video-file>
cd ..
The processed gesture tracks (video and audio files) are saved in: <results/video_file/preprocessed/*.avi>. The default save directory is results, this can be changed by specificying --preprocessed_root in the above command. Once the gesture tracks are extracted, the below script can be used to extract gesture and/or content embeddings.
python inference_embs.py \
--checkpoint_path_gestsync <gestsync-checkpoint-path> \
--checkpoint_path_jegal <jegal-checkpoint-path> \
--modalities <vta/vt/va/ta/v/t/a> \
--video_path <path-to-preprocessed-video-file> \
--audio_path <path-to-preprocessed-audio-file> \
--text_path <path-to-text-file-with-word-boundaries> \
--res_dir <folder-path-to-save-the-extracted-embeddings>By default, content embeddings are extracted using both text and audio modalities, which provides the best performance. However, JEGAL also supports missing modalities at inference time. You can control this using the --modalities flag and by providing the corresponding input paths.
-
All three modalities (visual + text + audio)
- Produces gesture embeddings and content embeddings
- Use:
--modalities vta - Requires:
--video_path,--audio_path, and--text_path
-
Two-modality combinations
- Use:
--modalities vt(visual + text-only) or--modalities va(visual + audio-only) or--modalities ta(combines text and audio for content) - Requires the corresponding paths
- Use:
-
Single-modality embeddings
- Use:
--modalities v,t, ora - Extracts embeddings for visual, text, or audio individually
- Use:
Following pre-processed examples are available for a quick test in samples folder:
sample1.avi,sample1.wav,sample1.txtsample2.avi,sample2.wav,sample2.txt
Note: Step-1 need to be skipped for these examples, since they are already pre-processed.
Example run:
python inference_embs.py \
--checkpoint_path_gestsync checkpoints/gestsync.pth \
--checkpoint_path_jegal checkpoints/jegal.pth \
--modalities vta \
--video_path samples/sample1.avi \
--audio_path samples/sample1.wav \
--text_path samples/sample1.txt \
--res_dir results/sample1On running the above command, the extracted JEGAL embeddings are saved in:
results/sample1/sample1.pkl
The .pkl file contains:
- gesture_emb:
numpy arrayof shape(T, 512)T= number of video framesNoneifvis not included in--modalities
- content_emb:
numpy arrayof shape(W, 512)W= number of words (from text/audio)Noneiftorais not included in--modalities
- info:
dictwith keys{fname, word_boundaries}fname: input file nameword_boundaries: timestamped boundaries for each word [Noneiftorais not included in--modalities]
Once the gesture and content embeddings are extracted, they can be used to plot the similarity heatmap which helps in understanding which words have been gestured in the video. Run the below command to save the heatmap:
python utils/plot_heatmap.py --path <path-to-the-JEGAL-pkl-file> --fname <file-name-of-the-result-heatmap>
Example run:
python utils/plot_heatmap.py --path results/sample1/sample1.pkl --fname heatmap_sample1
For the examples provided, the following heatmaps are obtained:
Sample-1
|
Sample-2
|
The training scripts, along with detailed instructions to fine-tune on custom datasets will be available soon. Until then, stay tuned and watch the repository for updates!
If you find this work useful for your research, please consider citing our paper:
@inproceeding{hegde2025jegal,
title={Understanding Co-speech Gestures in-the-wild},
author={Sindhu B Hegde and K R Prajwal and Taein Kwon and Andrew Zisserman},
year={2025},
booktitle={Arxiv}
}

