Skip to content

sophicle/sensory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Words That Make Language Models Perceive

Paper            Project Page

Sophie L. Wang       Phillip Isola       Brian Cheung

Installation

Clone the repository and install dependencies:

git clone https://github.com/sophicle/sensory.git
cd sensory
conda env create -f environment.yml
conda activate sensory

Setup

  1. AudioCaps2.0

    An audio–caption corpus built on top of the balanced subset of AudioSet (Gemmeke et al., 2017). The original release comprises ten-second YouTube clips. We use 975 test clips containing one instance of each audio clip.

    • Download: Request access from the AudioCaps website.
    • Usage:
      • Place audiocaps_raw_audio, test.csv, train.csv, val.csv in downloads/audiocaps_data
      • We use the test.csv set. Shard this using /src/datasets/shard_audiocaps.py.
  2. BEATs Pre-trained Model

    BEATs is a self-supervised audio transformer model developed by Microsoft.

    • Download: Download Tokenizer_iter3 and BEATs_iter3 from the BEATs GitHub repository or the Hugging Face Hub if available.
    • Usage:
      • Place the pre-trained model files in downloads/beats_models/.
      • Clone the BEATs source code:
        git clone https://github.com/microsoft/unilm.git
      • Add the BEATs directory to your PYTHONPATH so it can be imported in the embedding script:
        export PYTHONPATH=$PYTHONPATH:/path/to/unilm/beats
        (Replace /path/to/unilm/beats with the actual path to the cloned BEATs repository.)
      • This will make the BEATs module available for import in the embedding script.

Hugging Face Authentication

Some models require authentication via the Hugging Face Hub. To enable this, set your Hugging Face token as an environment variable:

export HF_TOKEN="your_token_here"

Usage

Run the embedding pipeline with run_embed.py.

Language Model (Qwen3) examples

# WIT dataset, Qwen3-32B, 128 tokens, imagine prompts
python run_embed.py --model Qwen/Qwen3-32B --dataset wit --tokensize 128 --prompt imagine

# AudioCaps dataset, Qwen3-14B, embedding only (no generation)
python run_embed.py --model Qwen/Qwen3-14B --dataset audiocaps

Outputs are saved under:

runs/<dataset>_<prompt>/<model_tokensX_promptKey>/

Examples:

runs/wit_imagine/qwen3_32b_tokens128_hear_prompt/
runs/wit_imagine/qwen3_32b_tokens128_see_prompt/
runs/audiocaps_caption/qwen3_14b_tokens0_caption/

Sensory encoder (BEATs and DINOv2) examples

# BEATs on AudioCaps
python run_embed.py --model beats/BEATs_iter3 --dataset audiocaps

# DINOv2 on WIT
python run_embed.py --model facebook/dinov2-base --dataset wit

Outputs are saved under:

runs/<dataset>_sensory_encoders/<modelNameAfterSlash>/

Examples:

runs/audiocaps_sensory_encoders/BEATs_iter3/
runs/wit_sensory_encoders/dinov2-base/

Supported models

Language Models (LLMs)

  • Qwen/Qwen3-0.6B
  • Qwen/Qwen3-1.7B
  • Qwen/Qwen3-4B
  • Qwen/Qwen3-14B
  • Qwen/Qwen3-32B

Vision encoders

  • facebook/dinov2-base

Audio encoders

  • BEATs/BEATs_iter3
  • BEATs/BEATs_iter3_plus_AS2M

Supported datasets

Image–text datasets

  • wit

Audio–text datasets

  • audiocaps

Adding new models or datasets

To extend support:

  • Add a model

    1. Implement or reuse an embed function in src/models/embedders.py.
    2. Add the model name and embed function to the correct map in run_embed.py.
  • Add a dataset

    1. Add a loader in src/datasets/loaders.py.
    2. Register it in either IMAGE_TEXT_DATASETS or AUDIO_TEXT_DATASETS.

Output structure

runs/
├── wit_imagine/
│   ├── qwen3_32b_tokens128_hear_prompt/
│   ├── qwen3_32b_tokens128_see_prompt/
│   └── qwen3_32b_tokens128_plain_prompt/
├── audiocaps_caption/
│   └── qwen3_14b_tokens0_caption/
├── audiocaps_sensory_encoders/
│   ├── BEATs_iter3/
│   └── BEATs_iter3_plus_AS2M/
└── wit_sensory_encoders/
    └── dinov2-base/

License

This project is licensed under the MIT License. See LICENSE for details.

Citation

If you use this codebase in your research, please cite:

@article{wang2025words,
            title={Words That Make Language Models Perceive},
            author={Wang, Sophie L. and Isola, Phillip and Cheung, Brian},
            journal={arXiv preprint arXiv:2510.02425},
            year={2025},
            url={https://arxiv.org/abs/2510.02425}
          }

Repository Structure

sensory_prompting/
├── downloads
│   ├── audiocaps_raw_audio
│   └── beats_models
├── environment.yml
├── README.md
├── run_align.py
├── run_embed.py
├── runs
│   └── imagine_embeddings
└── src
    ├── datasets
    ├── __init__.py
    ├── models
    └── utils

About

Code for Words That Make Language Models Perceive

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published