Words That Make Language Models Perceive

Paper Project Page

Sophie L. Wang Phillip Isola Brian Cheung

Installation

Clone the repository and install dependencies:

git clone https://github.com/sophicle/sensory.git
cd sensory
conda env create -f environment.yml
conda activate sensory

Setup

AudioCaps2.0

An audio–caption corpus built on top of the balanced subset of AudioSet (Gemmeke et al., 2017). The original release comprises ten-second YouTube clips. We use 975 test clips containing one instance of each audio clip.
- Download: Request access from the AudioCaps website.
- Usage:
  - Place audiocaps_raw_audio, test.csv, train.csv, val.csv in downloads/audiocaps_data
  - We use the test.csv set. Shard this using /src/datasets/shard_audiocaps.py.
BEATs Pre-trained Model

BEATs is a self-supervised audio transformer model developed by Microsoft.
- Download: Download Tokenizer_iter3 and BEATs_iter3 from the BEATs GitHub repository or the Hugging Face Hub if available.
- Usage:
  - Place the pre-trained model files in downloads/beats_models/.
  - Clone the BEATs source code:
```
git clone https://github.com/microsoft/unilm.git
```
  - Add the BEATs directory to your PYTHONPATH so it can be imported in the embedding script:
```
export PYTHONPATH=$PYTHONPATH:/path/to/unilm/beats
```
    (Replace /path/to/unilm/beats with the actual path to the cloned BEATs repository.)
  - This will make the BEATs module available for import in the embedding script.

Hugging Face Authentication

Some models require authentication via the Hugging Face Hub. To enable this, set your Hugging Face token as an environment variable:

export HF_TOKEN="your_token_here"

Usage

Run the embedding pipeline with run_embed.py.

Language Model (Qwen3) examples

# WIT dataset, Qwen3-32B, 128 tokens, imagine prompts
python run_embed.py --model Qwen/Qwen3-32B --dataset wit --tokensize 128 --prompt imagine

# AudioCaps dataset, Qwen3-14B, embedding only (no generation)
python run_embed.py --model Qwen/Qwen3-14B --dataset audiocaps

Outputs are saved under:

runs/<dataset>_<prompt>/<model_tokensX_promptKey>/

Examples:

runs/wit_imagine/qwen3_32b_tokens128_hear_prompt/
runs/wit_imagine/qwen3_32b_tokens128_see_prompt/
runs/audiocaps_caption/qwen3_14b_tokens0_caption/

Sensory encoder (BEATs and DINOv2) examples

# BEATs on AudioCaps
python run_embed.py --model beats/BEATs_iter3 --dataset audiocaps

# DINOv2 on WIT
python run_embed.py --model facebook/dinov2-base --dataset wit

Outputs are saved under:

runs/<dataset>_sensory_encoders/<modelNameAfterSlash>/

Examples:

runs/audiocaps_sensory_encoders/BEATs_iter3/
runs/wit_sensory_encoders/dinov2-base/

Supported models

Language Models (LLMs)

Qwen/Qwen3-0.6B
Qwen/Qwen3-1.7B
Qwen/Qwen3-4B
Qwen/Qwen3-14B
Qwen/Qwen3-32B

Vision encoders

facebook/dinov2-base

Audio encoders

BEATs/BEATs_iter3
BEATs/BEATs_iter3_plus_AS2M

Supported datasets

Image–text datasets

wit

Audio–text datasets

audiocaps

Adding new models or datasets

To extend support:

Add a model
1. Implement or reuse an embed function in src/models/embedders.py.
2. Add the model name and embed function to the correct map in run_embed.py.
Add a dataset
1. Add a loader in src/datasets/loaders.py.
2. Register it in either IMAGE_TEXT_DATASETS or AUDIO_TEXT_DATASETS.

Output structure

runs/
├── wit_imagine/
│   ├── qwen3_32b_tokens128_hear_prompt/
│   ├── qwen3_32b_tokens128_see_prompt/
│   └── qwen3_32b_tokens128_plain_prompt/
├── audiocaps_caption/
│   └── qwen3_14b_tokens0_caption/
├── audiocaps_sensory_encoders/
│   ├── BEATs_iter3/
│   └── BEATs_iter3_plus_AS2M/
└── wit_sensory_encoders/
    └── dinov2-base/

License

This project is licensed under the MIT License. See LICENSE for details.

Citation

If you use this codebase in your research, please cite:

@article{wang2025words,
            title={Words That Make Language Models Perceive},
            author={Wang, Sophie L. and Isola, Phillip and Cheung, Brian},
            journal={arXiv preprint arXiv:2510.02425},
            year={2025},
            url={https://arxiv.org/abs/2510.02425}
          }

Repository Structure

sensory_prompting/
├── downloads
│   ├── audiocaps_raw_audio
│   └── beats_models
├── environment.yml
├── README.md
├── run_align.py
├── run_embed.py
├── runs
│   └── imagine_embeddings
└── src
    ├── datasets
    ├── __init__.py
    ├── models
    └── utils

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Words That Make Language Models Perceive

Paper Project Page

Sophie L. Wang Phillip Isola Brian Cheung

Installation

Setup

Hugging Face Authentication

Usage

Language Model (Qwen3) examples

Sensory encoder (BEATs and DINOv2) examples

Supported models

Supported datasets

Adding new models or datasets

Output structure

License

Citation

Repository Structure

About

Uh oh!

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
README.md		README.md
embed-and-align.ipynb		embed-and-align.ipynb
environment.yml		environment.yml
run_align.py		run_align.py
run_embed.py		run_embed.py

sophicle/sensory

Folders and files

Latest commit

History

Repository files navigation

Words That Make Language Models Perceive

Paper Project Page

Sophie L. Wang Phillip Isola Brian Cheung

Installation

Setup

Hugging Face Authentication

Usage

Language Model (Qwen3) examples

Sensory encoder (BEATs and DINOv2) examples

Supported models

Supported datasets

Adding new models or datasets

Output structure

License

Citation

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages