Clone the repository and install dependencies:
git clone https://github.com/sophicle/sensory.git
cd sensory
conda env create -f environment.yml
conda activate sensory-
AudioCaps2.0
An audio–caption corpus built on top of the balanced subset of AudioSet (Gemmeke et al., 2017). The original release comprises ten-second YouTube clips. We use 975 test clips containing one instance of each audio clip.
- Download: Request access from the AudioCaps website.
- Usage:
- Place
audiocaps_raw_audio,test.csv,train.csv,val.csvindownloads/audiocaps_data - We use the
test.csvset. Shard this using/src/datasets/shard_audiocaps.py.
- Place
-
BEATs Pre-trained Model
BEATs is a self-supervised audio transformer model developed by Microsoft.
- Download: Download
Tokenizer_iter3andBEATs_iter3from the BEATs GitHub repository or the Hugging Face Hub if available. - Usage:
- Place the pre-trained model files in
downloads/beats_models/. - Clone the BEATs source code:
git clone https://github.com/microsoft/unilm.git
- Add the BEATs directory to your
PYTHONPATHso it can be imported in the embedding script:(Replaceexport PYTHONPATH=$PYTHONPATH:/path/to/unilm/beats
/path/to/unilm/beatswith the actual path to the cloned BEATs repository.) - This will make the
BEATsmodule available for import in the embedding script.
- Place the pre-trained model files in
- Download: Download
Some models require authentication via the Hugging Face Hub. To enable this, set your Hugging Face token as an environment variable:
export HF_TOKEN="your_token_here"Run the embedding pipeline with run_embed.py.
# WIT dataset, Qwen3-32B, 128 tokens, imagine prompts
python run_embed.py --model Qwen/Qwen3-32B --dataset wit --tokensize 128 --prompt imagine
# AudioCaps dataset, Qwen3-14B, embedding only (no generation)
python run_embed.py --model Qwen/Qwen3-14B --dataset audiocapsOutputs are saved under:
runs/<dataset>_<prompt>/<model_tokensX_promptKey>/
Examples:
runs/wit_imagine/qwen3_32b_tokens128_hear_prompt/
runs/wit_imagine/qwen3_32b_tokens128_see_prompt/
runs/audiocaps_caption/qwen3_14b_tokens0_caption/
# BEATs on AudioCaps
python run_embed.py --model beats/BEATs_iter3 --dataset audiocaps
# DINOv2 on WIT
python run_embed.py --model facebook/dinov2-base --dataset witOutputs are saved under:
runs/<dataset>_sensory_encoders/<modelNameAfterSlash>/
Examples:
runs/audiocaps_sensory_encoders/BEATs_iter3/
runs/wit_sensory_encoders/dinov2-base/
Language Models (LLMs)
Qwen/Qwen3-0.6BQwen/Qwen3-1.7BQwen/Qwen3-4BQwen/Qwen3-14BQwen/Qwen3-32B
Vision encoders
facebook/dinov2-base
Audio encoders
BEATs/BEATs_iter3BEATs/BEATs_iter3_plus_AS2M
Image–text datasets
wit
Audio–text datasets
audiocaps
To extend support:
-
Add a model
- Implement or reuse an embed function in
src/models/embedders.py. - Add the model name and embed function to the correct map in
run_embed.py.
- Implement or reuse an embed function in
-
Add a dataset
- Add a loader in
src/datasets/loaders.py. - Register it in either
IMAGE_TEXT_DATASETSorAUDIO_TEXT_DATASETS.
- Add a loader in
runs/
├── wit_imagine/
│ ├── qwen3_32b_tokens128_hear_prompt/
│ ├── qwen3_32b_tokens128_see_prompt/
│ └── qwen3_32b_tokens128_plain_prompt/
├── audiocaps_caption/
│ └── qwen3_14b_tokens0_caption/
├── audiocaps_sensory_encoders/
│ ├── BEATs_iter3/
│ └── BEATs_iter3_plus_AS2M/
└── wit_sensory_encoders/
└── dinov2-base/
This project is licensed under the MIT License. See LICENSE for details.
If you use this codebase in your research, please cite:
@article{wang2025words,
title={Words That Make Language Models Perceive},
author={Wang, Sophie L. and Isola, Phillip and Cheung, Brian},
journal={arXiv preprint arXiv:2510.02425},
year={2025},
url={https://arxiv.org/abs/2510.02425}
}
sensory_prompting/
├── downloads
│ ├── audiocaps_raw_audio
│ └── beats_models
├── environment.yml
├── README.md
├── run_align.py
├── run_embed.py
├── runs
│ └── imagine_embeddings
└── src
├── datasets
├── __init__.py
├── models
└── utils