This repository accompanies the paper: Leveraging Whisper Embeddings for Audio-based Lyrics Matching by Eleonora Mancini, Joan Serrà, Paolo Torroni, and Yuki Mitsufuji [📄 Read the paper on arXiv]
This project introduces WEALY — Whisper Embeddings for Audio-based LYrics matching — a fully reproducible pipeline that leverages Whisper decoder embeddings for audio-based lyrics matching.
WEALY establishes transparent and reproducible baselines for version identification using:
- Pre-extracted Whisper decoder embeddings (hidden states)
- Learned transformer based model with contrastive learning
- Support for multiple datasets (SHS100K, Discogs-VI, LyricCovers)
|
🎤 Multi-Modal Feature Extraction
|
🚀 Training & Inference
|
|
📊 Comprehensive Evaluation
|
🤗 Pre-trained Models
|
This codebase is designed for large-scale experiments on high-performance computing (HPC) systems:
- Multi-GPU training via Lightning Fabric (distributed data parallel)
- Efficient data loading with caching and parallel workers
- Large datasets: ~100K-500K audio tracks per dataset
- Computationally intensive: Feature extraction and training require significant GPU resources
All scripts support distributed execution across multiple GPUs, making them suitable for both local multi-GPU setups and HPC cluster environments.
- Installation
- Data
- Dataset Organization
- Usage
- Pre-trained Models
- Configuration
- Code Organization
- Citation
- License
- Contact
- Python 3.11+
- CUDA-capable GPU(s) - Recommended: 4+ GPUs for training
- FFmpeg (for audio processing)
- ~1TB disk space (for datasets and embeddings)
- HPC environment (optional, but recommended for large-scale experiments)
# Create virtual environment
python -m venv venv
source venv/bin/activate # Install from GitHub
pip install git+https://github.com/helemanc/whisper.git
# Or force reinstall if already present
pip install --force-reinstall git+https://github.com/helemanc/whisper.gitFor details about this fork, see: https://github.com/helemanc/whisper
# Clone repository
git clone https://github.com/yourusername/audio-based-lyrics-matching.git
cd audio-based-lyrics-matching
# Install dependencies
pip install -r requirements.txtCLEWS (Contrastive Learning of Musical Embeddings with Weak Supervision) audio embeddings require a separate Python environment due to dependency conflicts with the main WEALY environment.
Clone our CLEWS fork at the same parent level as this repository:
# From the parent directory of audio-based-lyrics-matching
cd ..
git clone https://github.com/helemanc/clews.git
# Verify folder structure:
# Projects/
# ├── audio-based-lyrics-matching/
# └── clews/# Create separate conda environment
conda create -n clews python=3.11
conda activate clews
# Install CLEWS dependencies
cd clews
pip install -r requirements.txt
# Install additional dependencies for feature extraction
pip install lightning omegaconf tqdm requestsUpdate configs/extraction/clews.yaml with your paths:
clews:
project_dir: "/path/to/clews" # Path to cloned CLEWS repository
config_path: null # Auto-download from Zenodo
checkpoint_path: "/path/to/clews_checkpoints" # Where to store checkpointsNote: CLEWS checkpoints are automatically downloaded from Zenodo on first use.
| Model | Dataset | Zenodo |
|---|---|---|
shs-clews |
SHS100K | zenodo.org/records/15045900 |
dvi-clews |
Discogs-VI | zenodo.org/records/15045900 |
We support three datasets for version identification research:
| Dataset | Cliques | Versions | Source | Collection Rate |
|---|---|---|---|---|
| SHS100K | ~10K | ~100K | Standard | 82% (YouTube) |
| Discogs-VI-YT | ~98K | ~493K | Standard | Full |
| LyricCovers 2.0 | ~24K | ~54K | Custom | Full |
Dataset Properties:
- All audio processed at 16 kHz mono with 5-minute maximum length
- SHS100K-v2: Established benchmark; YouTube dependencies limited collection to 82%
- Discogs-VI-YT: YouTube-available subset (~493K versions, ~98K cliques); addresses SHS limitations
- LyricCovers 2.0: Deduplicated version (54,301 covers, 24,561 originals, 80 languages)
Place all datasets in your data directory:
/path/to/data/
├── SHS100K/
│ ├── audio/
│ │ └── <clique_id>-<version_id>.mp3
│ └── metadata/
├── DiscogsVI/
│ ├── audio/
│ │ └── <artist>/<track>.mp3
│ └── metadata/
└── LyricCovers/
├── <version_id>/
│ ├── <version_id>_audio.mp3
│ └── <version_id>_lyrics.txt
└── metadata/
Required metadata files are in datasets/.
datasets/lyric-covers/data.csv.
Expected Directory Structure:
/path/to/data/LyricCovers/
├── 1001/
│ ├── 1001_lyrics.txt
│ └── 1001_audio.mp3
├── 1002/
│ ├── 1002_lyrics.txt
│ └── 1002_audio.mp3
└── ...
Audio Format Requirements:
- Format: MP3
- Channels: Mono
- Sample rate: 16 kHz
- Naming:
{song_id}_audio.mp3
Lyrics Format Requirements:
- Format: Plain text (.txt)
- Encoding: UTF-8
- Naming:
{song_id}_lyrics.txt - Content: Full lyrics text
Data Preparation: After obtaining the audio files, resample them to 16 kHz mono MP3 format and organize them following the directory structure above.
Cache location: cache/{dataset_name}/
What's cached:
- Audio metadata and file paths
- Clique/version ID mappings
- Split assignments
- Embedding path mappings (for training)
To regenerate cache: Delete cache/{dataset_name}/ directory
This codebase uses two complementary dataset classes:
- Purpose: Load raw audio files for embedding extraction
- Used in:
scripts/feature_extraction.py - Returns: Audio waveforms + metadata
- When to use: When you need to extract features from audio
- Purpose: Load pre-extracted embeddings for model training
- Used in:
scripts/train.py,scripts/inference.py - Returns: Pre-computed embeddings + metadata
- When to use: When training or evaluating models
Workflow:
Audio Files → [AudioDataset] → Feature Extraction → Embeddings
↓
[EmbeddingDataset] → Training/Evaluation
Extract Whisper decoder hidden states from audio using AudioDataset.
Command Template:
python scripts/feature_extraction.py \
jobname=<JOB_NAME> \
conf=configs/extraction/whisper.yaml \
data.dataset_name=<DATASET_NAME> \
data.split=<SPLIT> \
path.data=<PATH_TO_AUDIO_DATA> \
path.base_path=<PATH_TO_BASE> \
path.save_data_path=<PATH_TO_SAVE_EMBEDDINGS> \
path.working_dir=<PATH_TO_PROJECT> \
fabric.ngpus=<NUM_GPUS> \
fabric.precision=<PRECISION>Common Parameters:
jobname: Descriptive name for this extraction jobdata.dataset_name: Dataset identifier (shs,lyric-covers,discogs-vi)data.embedding_type: Embedding type to extract (default:last_hidden_states)last_hidden_states: Auto-detect languagelast_hidden_states_en: Force Englishencoder: Whisper encoder embeddings
path.data: Root directory with audio filespath.base_path: Base path for dataset-specific parameters (<PATH_TO_WORKING_DIR>/datasets)path.save_data_path: Where to save extracted embeddingspath.working_dir: Working directory (<PATH_TO_WORKING_DIR>/audio-based-lyrics-matching)fabric.ngpus: Number of GPUs (recommended: 4-8 for faster extraction)fabric.precision: Computation precision (bf16-mixedfor speed,32for accuracy)
Output Structure:
<PATH_TO_SAVE_EMBEDDINGS>/{Dataset}-hidden-states/
├── <clique_id>/
│ ├── <version_id>/
│ │ ├── hs_last_seq.pt # Hidden states embeddings
Example Output:
SHS100K-hidden-states/
├── 0/
│ ├── 0/
│ │ ├── hs_last_seq.pt # Shape: (seq_len, 1280)
Extract SBERT text embeddings from Whisper transcriptions:
python scripts/feature_extraction.py \
jobname=<JOB_NAME> \
conf=configs/extraction/sbert.yaml \
data.dataset_name=<DATASET_NAME> \
data.split=<SPLIT_NAME> \
path.base_path=<PATH_TO_WORKING_DIR>/datasets \
path.working_dir=<PATH_TO_WORKING_DIR>/audio-based-lyrics-matching \
path.data=<PATH_TO_AUDIO_DATA> \
path.save_data_path=<PATH_TO_SAVE_EMBEDDINGS> \
path.transcriptions=<PATH_TO_WHISPER_TRANSCRIPTIONS> \
fabric.ngpus=1Notes:
- Requires pre-extracted Whisper transcriptions
- SBERT processing is fast (single GPU sufficient)
- Creates
hs_sbert.ptfiles alongside Whisper embeddings
CLEWS extracts audio embeddings using a CNN-based architecture with shingle (overlapping window) processing. Follow this complete workflow for CLEWS feature extraction and distance matrix computation.
Before extracting CLEWS features, preprocess the dataset to create metadata files using data_preproc.py from the CLEWS repository.
Activate CLEWS environment and navigate to CLEWS directory:
conda activate clews
cd /path/to/clewsFor SHS100K:
python data_preproc.py \
--njobs=16 \
--dataset=SHS100K \
--path_meta=/path/to/audio-based-lyrics-matching/datasets/shs \
--path_audio=/path/to/data/SHS100K/audio/ \
--ext_in=mp3 \
--fn_out=/path/to/audio-based-lyrics-matching/cache/clews/metadata-shs.ptFor Discogs-VI:
python data_preproc.py \
--njobs=16 \
--dataset=DiscogsVI \
--path_meta=/path/to/audio-based-lyrics-matching/datasets/discogs-vi \
--path_audio=/path/to/data/DiscogsVI/audio/ \
--ext_in=mp3 \
--fn_out=/path/to/audio-based-lyrics-matching/cache/clews/metadata-dvi.ptSwitch to the audio-based-lyrics-matching repository while keeping the CLEWS environment active:
conda activate clews
cd /path/to/audio-based-lyrics-matching
python scripts/feature_extraction.py \
conf=configs/extraction/clews.yaml \
jobname=shs_clews_extraction \
data.dataset_name=shs \
data.split=test \
path.save_data_path=/path/to/data/SHS100K-hidden-states \
path.clews_cache_dir=/path/to/audio-based-lyrics-matching/cache/clews \
path.clews_audio_dir=/path/to/data/SHS100K/audio/ \
clews.project_dir=/path/to/clews \
clews.checkpoint_path=/path/to/clews/checkpoints \
fabric.ngpus=4Switch back to the CLEWS repository to compute distance matrices using CLEWS scripts:
conda activate clews
cd /path/to/clews
python compute_distance_matrix.py \
checkpoint_dir=/path/to/clews/checkpoints \
dataset=shs \
output_dir=/path/to/audio-based-lyrics-matching/logs/clews-shs/distance_matrix \
path_meta=/path/to/cache/clews/metadata-shs.pt \
path_audio=/path/to/data/SHS100K/audio/ \
partition=test \
ngpus=4CLEWS Parameters:
| Parameter | Default | Description |
|---|---|---|
maxlen |
600 | Maximum audio length in seconds (10 min) |
qshop |
5 | Shingle hop in seconds |
qslen |
null | Shingle length (null = use model default) |
clews.project_dir |
required | Path to cloned CLEWS repository |
clews.checkpoint_path |
required | Path to store/load CLEWS checkpoints |
checkpoint_dir |
required | Path to CLEWS checkpoints for distance computation |
output_dir |
required | Where to save distance matrices |
Output Structure:
logs/clews-shs/distance_matrix/
└── shs_clews_distance_matrix_test.pkl
Notes:
- CLEWS checkpoints are auto-downloaded from Zenodo on first use
- Extraction uses half-precision for storage efficiency
- Distance matrix includes metadata for fusion with other modalities
Train WEALY models on pre-extracted embeddings using EmbeddingDataset.
Command Template:
python scripts/train.py \
jobname=<EXPERIMENT_NAME> \
conf=configs/training/wealy.yaml \
data.dataset_name=<DATASET_NAME> \
training.batch_size=<BATCH_SIZE> \
training.numepochs=<NUM_EPOCHS> \
path.cache=<PATH_TO_CACHE> \
path.logs=<PATH_TO_LOGS> \
path.working_dir=<PATH_TO_PROJECT> \
path.data=<PATH_TO_AUDIO_DATA> \
path.save_data_path=<PATH_TO_SAVE_DATA> \
path.hidden_states=<PATH_TO_EMBEDDINGS> \
path.meta=<PATH_TO_CACHED_METADATA> \
fabric.ngpus=<NUM_GPUS> \
fabric.precision=<PRECISION>Dataset-Specific Parameters:
For SHS100K, add:
path.shs_data=<PATH_TO_DATASETS>/shs/shs_data.csv \
path.shs_splits=<PATH_TO_DATASETS>/shsFor LyricCovers, add:
path.lyric_covers_data=<PATH_TO_DATASETS>/lyric-coversFor Discogs-VI, add:
path.discogs_vi_data=<PATH_TO_DATASETS>/discogs-viCommon Training Parameters:
jobname: Experiment name (createslogs/<jobname>/directory)data.dataset_name: Dataset to train onpath.hidden_states: Pre-extracted embeddings directorypath.logs: Base directory for checkpoints (will create<logs>/<jobname>/)path.meta: Cached metadata (auto-generated on first run)fabric.ngpus: Number of GPUs (recommended: 4 for optimal training speed)fabric.precision:bf16-mixed(faster) or32(more accurate)training.batchsize: Batch size per GPU (default: 64)training.numepochs: Maximum epochs (default: 1000)
📖 For all available parameters, see configs/training/wealy.yaml
Training Output:
<PATH_TO_LOGS>/<EXPERIMENT_NAME>/
├── configuration.yaml # Auto-saved config
├── checkpoint_last.ckpt # Latest epoch
├── checkpoint_best.ckpt # Best model (based on validation MAP)
└── checkpoint_epoch_N.ckpt # Periodic checkpoints (if enabled)
Example - Training on SHS100K:
python scripts/train.py \
jobname=wealy_shs_baseline \
conf=configs/training/wealy.yaml \
data.dataset_name=shs \
path.cache=/scratch/cache \
path.logs=/scratch/logs \
path.hidden_states=/scratch/embeddings/SHS100K-hidden-states \
path.meta=/scratch/cache/shs/metadata-shs.pt \
path.shs_data=/project/datasets/shs/shs_data.csv \
path.shs_splits=/project/datasets/shs \
fabric.ngpus=4Expected Training Time (4 GPUs):
- SHS100K: ~24-48 hours
- LyricCovers: ~12-24 hours
- Discogs-VI: ~48-72 hours
Evaluate trained models on test sets using EmbeddingDataset.
The inference script supports two ways to load models:
- Local checkpoint: Use a checkpoint file from your filesystem
- Hugging Face Hub: Automatically download pre-trained models from HF
python scripts/inference.py \
checkpoint=logs/wealy-whisper-shs/checkpoint_best.ckpt \
partition=test \
chunk_size=1500 \
overlap_percentage=0.9 \
topk_distance=1 \
use_overlapping_chunks=true \
hidden_states=/path/to/hidden-states \
ngpus=4 \
disable_memory_logging=truepython scripts/inference.py \
model_name=wealy-whisper-shs \
hidden_states=/path/to/hidden-states \
partition=test \
use_overlapping_chunks=true \
ngpus=4The model will be automatically downloaded to logs/<model-name>/ on first use and cached for subsequent runs.
For multi-GPU inference, use torchrun to spawn processes:
torchrun --nproc_per_node=4 --standalone scripts/inference.py \
model_name=wealy-whisper-shs \
hidden_states=/path/to/SHS100K-hidden-states \
shs_data=/path/to/datasets/shs/shs_data.csv \
shs_splits=/path/to/datasets/shs \
cache=/path/to/cache \
data=/path/to/data \
partition=test \
use_overlapping_chunks=true \
ngpus=4Since models downloaded from HuggingFace have sanitized configurations (personal paths removed), you must provide the required paths via CLI arguments.
For SHS100K models (wealy-*-shs):
hidden_states=/path/to/SHS100K-hidden-states \
shs_data=/path/to/datasets/shs/shs_data.csv \
shs_splits=/path/to/datasets/shs \
cache=/path/to/cache \
data=/path/to/dataFor LyricCovers models (wealy-*-lyc):
hidden_states=/path/to/LyricCovers-hidden-states \
lyric_covers_data=/path/to/datasets/lyric-covers \
cache=/path/to/cache \
data=/path/to/dataFor Discogs-VI models (wealy-*-dvi):
hidden_states=/path/to/DiscogsVI-hidden-states \
discogs_vi_data=/path/to/datasets/discogs-vi \
cache=/path/to/cache \
data=/path/to/dataWhen running inference, the following directory structure is created:
logs/<model-name>/ # Model directory (HF models stored here)
├── best.ckpt # Model checkpoint
├── configuration.yaml # Model configuration
└── eval_checkpoints/ # Created during inference
├── extraction_checkpoint_*.pkl # Intermediate extraction checkpoints
├── evaluation_checkpoint_*.pkl # Intermediate evaluation checkpoints
├── final_results.pkl # Final evaluation metrics
└── crash_report_rank_*.txt # Error logs (if any)
The eval_checkpoints/ folder enables resumable inference - if evaluation is interrupted, it will resume from the last checkpoint automatically.
Standard Evaluation (faster):
use_overlapping_chunks=falseUses single embedding per version. Suitable for quick testing.
Overlapping Chunks Evaluation (recommended for final results):
use_overlapping_chunks=true \
chunk_size=1500 \
overlap_percentage=0.9 \
topk_distance=1Generates multiple overlapping chunks per audio file for more robust evaluation.
| Parameter | Default | Description |
|---|---|---|
model_name |
None | HF model name (e.g., wealy-whisper-shs) |
checkpoint |
None | Local checkpoint path (alternative to model_name) |
partition |
test |
Dataset split (test, val, train) |
ngpus |
1 | Number of GPUs |
precision |
bf16-mixed |
Computation precision |
use_overlapping_chunks |
false |
Enable overlapping chunk evaluation |
chunk_size |
1500 | Size of overlapping chunks |
overlap_percentage |
0.9 | Overlap between chunks (0.0-0.99) |
topk_distance |
1 | Top-k distance aggregation |
disable_checkpointing |
false |
Disable intermediate checkpoints |
disable_memory_logging |
true |
Disable GPU memory logging |
force_download |
false |
Force re-download from HF |
- MAP (Mean Average Precision): Primary metric for retrieval quality
- MR1 (Mean Rank-1): Percentage of queries with correct match at rank 1
- ARP (Average Rank Percentile): Average rank position as percentile
Results are printed to console and saved to <model_dir>/eval_checkpoints/final_results.pkl
Compute pairwise distance matrices for trained models. These matrices can be used for:
- Detailed analysis and debugging
- Multimodal fusion (combining CLEWS + WEALY)
- Custom retrieval experiments
Compute WEALY distance matrix with overlapping chunks for robust evaluation:
python scripts/compute_distance_matrix.py \
checkpoint=/path/to/logs/wealy-whisper-shs \
partition=test \
use_overlapping_chunks=true \
overlap_percentage=0.9 \
topk_distance=3 \
ngpus=4Output:
logs/wealy-whisper-shs/distance_matrix/
└── test_overlapping_topk3_distances.pkl
Compute distance matrix with single embedding per song:
python scripts/compute_distance_matrix.py \
checkpoint=logs/wealy_shs/best.ckpt \
partition=test \
save_distance_matrix=distances/wealy_shs_test.pkl \
ngpus=4Distance matrices are saved as pickle files with metadata:
{
'distance_matrix': np.ndarray, # (n_queries, n_candidates)
'query_references': [
{'clique': int, 'version': int, 'matrix_row': int},
...
],
'candidate_references': [
{'clique': int, 'version': int, 'matrix_col': int},
...
],
'metadata': {
'checkpoint': str,
'partition': str,
'use_overlapping_chunks': bool,
'topk_distance': int,
...
}
}The matrix_row and matrix_col fields ensure exact correspondence between the distance matrix and the song metadata.
Combine distance matrices from different modalities (e.g., CLEWS audio + WEALY lyrics) to find optimal fusion weights. This allows combining the strengths of audio-based and lyrics-based approaches.
Prerequisite: Compute both CLEWS and WEALY distance matrices (see Step 3 in CLEWS workflow and Step 4 above).
Basic Fusion:
python scripts/multimodal_fusion.py \
--matrix1 /path/to/logs/clews-shs/distance_matrix/shs_clews_distance_matrix_test.pkl \
--matrix2 /path/to/logs/wealy-whisper-shs/distance_matrix/test_overlapping_topk3_distances.pkl \
--output /path/to/logs/fusion-wealy-clews/test_results.csvThis evaluates the fusion: combined_distance = matrix1 + alpha * matrix2 across different alpha values.
Fine-grained search around a specific range:
python scripts/multimodal_fusion.py \
--matrix1 distances/clews_shs_test.pkl \
--matrix2 distances/wealy_shs_test.pkl \
--alpha_range 0.0 2.0 0.1 \
--output fusion_results.csvOr specify exact values:
python scripts/multimodal_fusion.py \
--matrix1 distances/clews_shs_test.pkl \
--matrix2 distances/wealy_shs_test.pkl \
--alphas 0.0 0.5 1.0 1.5 2.0 \
--output fusion_results.csvThe script prints a table showing MAP/MR1/ARP for each alpha value.
Results are also saved to the CSV file for further analysis.
# 1. Extract CLEWS embeddings (separate environment)
conda activate clews
python scripts/feature_extraction.py \
conf=configs/extraction/clews.yaml \
data.dataset_name=shs
# 2. Extract WEALY embeddings (main environment)
conda activate wealy
python scripts/feature_extraction.py \
conf=configs/extraction/whisper.yaml \
data.dataset_name=shs
# 3. Train WEALY model
python scripts/train.py \
conf=configs/training/wealy.yaml \
data.dataset_name=shs
# 4. Compute CLEWS distance matrix (requires CLEWS preprocessing)
# (CLEWS distances are typically pre-computed during CLEWS workflow)
python path/to/clews/compute_distance_matrix.py \
checkpoint_dir=/path/to/clews/checkpoints \
dataset=shs \
output_dir=/path/to/audio-based-lyrics-matching/logs/clews-shs/distance_matrix \
path_meta=/path/to/cache/clews/metadata-shs.pt \
path_audio=/path/to/data/SHS100K/audio/ \
partition=test \
ngpus=4
# 5. Compute WEALY distance matrix
python scripts/compute_distance_matrix.py \
checkpoint=logs/wealy_shs/best.ckpt \
partition=test \
save_distance_matrix=/path/to/audio-based-lyrics-matching/logs/wealy-whisper-shs/distance_matrix \
# 6. Find optimal fusion
python scripts/multimodal_fusion.py \
--matrix1 /path/to/audio-based-lyrics-matching/logs/clews-shs/clews_shs_test.pkl \
--matrix2 /path/to/audio-based-lyrics-matching/logs/wealy-whisper-shs/distance_matrix/wealy_shs_test.pkl \
--alpha_range 0.0 2.5 0.1 \
--output /path/to/audio-based-lyrics-matching/logs/fusion-wealy-clews/fusion_results.csvEvaluate transcription-based baselines (SBERT, TF-IDF) and theoretical bounds on Whisper transcriptions.
- Pre-extracted Whisper transcriptions (generated during feature extraction)
- Transcriptions stored in:
<data_folder>/<Dataset>-transcriptions/transcriptions/
python scripts/compute_baselines.py \
jobname=<JOB_NAME> \
conf=configs/evaluation/baselines.yaml \
data.dataset_name=<DATASET_NAME> \
data.split=<SPLIT> \
data.whisper_set=<WHISPER_SET> \
path.base_path=<PATH_TO_DATASETS> \
path.data=<PATH_TO_DATA> \
path.save_results_path=<PATH_TO_SAVE_RESULTS> \
'baselines.compute=[sbert,tfidf-cosine,tfidf-lucene,ideal,random,modified_ideal,modified-random]'python scripts/compute_baselines.py \
jobname=shs_test_baselines \
conf=configs/evaluation/baselines.yaml \
data.dataset_name=shs \
data.split=test \
data.whisper_set=prompt_whisper_42_transcribe \
path.base_path=/path/to/audio-based-lyrics-matching/datasets \
path.data=/path/to/data \
path.save_results_path=/path/to/logs/baselines \
'baselines.compute=[sbert,tfidf-cosine,tfidf-lucene,ideal,random,modified_ideal,modified-random]'| Parameter | Description | Example |
|---|---|---|
jobname |
Descriptive name for this baseline run | shs_test_baselines |
data.dataset_name |
Dataset identifier | shs, lyric-covers, discogs-vi |
data.split |
Dataset split to evaluate | test, val, train |
data.whisper_set |
Whisper configuration identifier (without dataset prefix) | prompt_whisper_42_transcribe |
path.base_path |
Parent directory containing dataset metadata folders | /path/to/datasets/ |
path.data |
Root directory with audio files and transcriptions | /path/to/data/ |
path.save_results_path |
Where to save baseline results | /path/to/logs/baselines/ |
baselines.compute |
List of baselines to compute | See available baselines below |
sbert: Sentence-BERT embeddings with cosine similaritytfidf-cosine: TF-IDF with cosine similaritytfidf-lucene: TF-IDF with Lucene-style scoringideal: Theoretical upper bound (perfect clique matching)random: Random baseline (lower bound)modified_ideal: Perfect matching only for valid transcriptionsmodified-random: Upper bound for transcription-based methods
The whisper_set parameter should match your transcription files without the dataset prefix. For example:
- If files are named
shs_prompt_whisper_42_transcribe.txt - Use
data.whisper_set=prompt_whisper_42_transcribe
The dataset name (shs_) is automatically prepended by the dataloader.
Results are saved to <save_results_path>/<jobname>/:
logs/baselines/shs_test_baselines/
├── baseline_results.json # All baseline metrics
├── detailed_results.pkl # Full result tensors (if enabled)
└── comparison_table.txt # Human-readable comparison
Metrics computed:
- MAP (Mean Average Precision): Primary ranking quality metric
- MR1 (Mean Rank @ 1): Percentage of queries with top-1 correct
- ARP (Average Rank Percentile): Normalized ranking (0-100)
Config: See configs/evaluation/baselines.yaml for all configuration options.
We provide 8 pre-trained WEALY models on Hugging Face Hub trained on different datasets and embedding types.
| Model Name | Dataset | Embedding Type | HF Repository |
|---|---|---|---|
wealy-whisper-shs |
SHS100K | Whisper (auto-lang) | audio-based-lyrics-matching/wealy-whisper-shs |
wealy-sbert-shs |
SHS100K | SBERT | audio-based-lyrics-matching/wealy-sbert-shs |
wealy-whisper-en-shs |
SHS100K | Whisper (English) | audio-based-lyrics-matching/wealy-whisper-en-shs |
wealy-avgembmlp-shs |
SHS100K | Avg Embedding MLP | audio-based-lyrics-matching/wealy-avgembmlp-shs |
wealy-cls-shs |
SHS100K | CLS Token | audio-based-lyrics-matching/wealy-cls-shs |
wealy-whisper-lyc |
LyricCovers | Whisper (auto-lang) | audio-based-lyrics-matching/wealy-whisper-lyc |
wealy-sbert-lyc |
LyricCovers | SBERT | audio-based-lyrics-matching/wealy-sbert-lyc |
wealy-whisper-dvi |
Discogs-VI | Whisper (auto-lang) | audio-based-lyrics-matching/wealy-whisper-dvi |
SHS100K model:
torchrun --nproc_per_node=4 --standalone scripts/inference.py \
model_name=wealy-whisper-shs \
hidden_states=/path/to/SHS100K-hidden-states \
shs_data=/path/to/datasets/shs/shs_data.csv \
shs_splits=/path/to/datasets/shs \
cache=/path/to/cache \
data=/path/to/data \
partition=test \
use_overlapping_chunks=true \
ngpus=4LyricCovers model:
torchrun --nproc_per_node=4 --standalone scripts/inference.py \
model_name=wealy-whisper-lyc \
hidden_states=/path/to/LyricCovers-hidden-states \
lyric_covers_data=/path/to/datasets/lyric-covers \
cache=/path/to/cache \
data=/path/to/data \
partition=test \
use_overlapping_chunks=true \
ngpus=4Discogs-VI model:
torchrun --nproc_per_node=4 --standalone scripts/inference.py \
model_name=wealy-whisper-dvi \
hidden_states=/path/to/DiscogsVI-hidden-states \
discogs_vi_data=/path/to/datasets/discogs-vi \
cache=/path/to/cache \
data=/path/to/data \
partition=test \
use_overlapping_chunks=true \
ngpus=4To see all available models programmatically:
from utils.hf_utils import print_available_models
print_available_models()Or use the upload script with --list:
python scripts/upload_to_hf.py --listAll configurations are in configs/.
This file contains all training parameters with detailed documentation:
# See configs/training/wealy.yaml for:
# - Path configurations
# - Dataset settings (chunk size, augmentation, etc.)
# - Model architecture (layers, dimensions, attention heads)
# - Training hyperparameters (learning rate, batch size, scheduler)
# - Monitoring and early stopping
# - Distributed training setupTo customize training, either:
- Edit the config file directly, or
- Override via command line:
python scripts/train.py \
conf=configs/training/wealy.yaml \
training.batchsize=128 \
training.optim.lr=5e-4 \
model.num_transformer_blocks=6audio-based-lyrics-matching/
├── configs/ # Configuration files
│ ├── extraction/
│ │ ├── whisper.yaml # Whisper extraction config
│ │ ├── sbert.yaml # SBERT extraction config
│ │ └── clews.yaml # CLEWS extraction config
│ ├── training/
│ │ └── wealy.yaml # Complete WEALY training config
│ └── evaluation/
│ └── baselines.yaml # Baseline evaluation config
│
├── datasets/ # Dataset metadata
│ ├── shs/
│ │ ├── shs_data.csv # SHS100K metadata
│ │ ├── SHS100K-TRAIN # Train split clique list
│ │ ├── SHS100K-VAL # Validation split clique list
│ │ ├── SHS100K-TEST # Test split clique list
│ │ └── list # Full clique list
│ ├── discogs-vi/
│ │ ├── id-to-file-mapping.csv # Song ID to filename mapping
│ │ ├── DiscogsVI-YT-20240701-light.json.train # Train split
│ │ ├── DiscogsVI-YT-20240701-light.json.val # Validation split
│ │ └── DiscogsVI-YT-20240701-light.json.test # Test split
│ └── lyric-covers/
│ ├── data.csv # LyricCovers metadata
│ ├── train_no_dup.csv # Train split
│ ├── val_no_dup.csv # Validation split
│ └── test_no_dup.csv # Test split
│
├── lib/ # Core library
│ ├── audio_dataset/ # Audio data loading
│ │ ├── dataloader.py # DataLoader creation
│ │ ├── dataset.py # AudioDataset for feature extraction
│ │ ├── cache.py # TranscriptionCache
│ │ ├── validator.py # Transcription validation
│ │ ├── data_processing.py # Dataset-specific processing
│ │ └── utils.py # Audio dataset utilities
│ ├── embedding_dataset/ # Embedding data loading
│ │ ├── base_dataset.py # EmbeddingDataset (training/eval)
│ │ ├── multimodal_dataset.py # Multimodal dataset handling
│ │ ├── collate_functions.py # Batch collation functions
│ │ ├── cache_manager.py # Embedding cache management
│ │ ├── data_processing.py # Embedding data processing
│ │ └── utils.py # Embedding dataset utilities
│ ├── models/
│ │ └── wealy.py # WEALY model architecture
│ ├── evaluation/
│ │ ├── eval.py # Evaluation metrics (MAP, MR1, ARP)
│ │ ├── distances.py # Distance functions (cosine, euclidean)
│ │ ├── fusion.py # Multimodal fusion utilities
│ │ └── baselines.py # Baseline implementations
│ ├── extractors.py # Feature extraction classes
│ ├── layers.py # Neural network layers (documented)
│ ├── losses.py # Loss definitions
│ └── tensor_ops.py # Tensor operations
│
├── utils/ # Utility modules (all documented with type hints)
│ ├── training_utils.py # Training logic and loops
│ ├── inference_utils.py # Evaluation logic
│ ├── extraction_utils.py # Feature extraction helpers
│ ├── latents_extraction_utils.py # Latent representation extraction
│ ├── evaluation_utils.py # Metric computation (documented)
│ ├── distance_matrix_utils.py # Distance matrix computation (documented)
│ ├── baselines_utils.py # Baseline evaluation utilities
│ ├── hf_utils.py # HuggingFace Hub utilities
│ ├── clews_utils.py # CLEWS-specific utilities
│ ├── print_utils.py # Logging utilities (documented)
│ └── pytorch_utils.py # PyTorch helpers (documented)
│
├── scripts/ # Executable scripts
│ ├── feature_extraction.py # Extract embeddings (Whisper/SBERT/CLEWS)
│ ├── train.py # Train models
│ ├── inference.py # Evaluate models
│ ├── compute_distance_matrix.py # Compute pairwise distance matrices
│ ├── compute_baselines.py # Compute transcription baselines
│ ├── multimodal_fusion.py # Combine distance matrices from different modalities
│ └── upload_to_hf.py # Upload models to HuggingFace Hub
│
├── requirements.txt # Python dependencies
├── LICENSE # License file
└── README.md # This file (comprehensive documentation)
Configuration System: OmegaConf-based YAML configs with CLI overrides
Data Pipeline:
AudioDataset: Raw audio loading for feature extractionEmbeddingDataset: Pre-extracted embedding loading for trainingTranscriptionCache: Efficient transcription caching with validation
Model Architecture:
- Transformer-based encoder with contrastive learning
- Support for multiple embedding types (Whisper, SBERT, CLEWS)
- Flexible pooling strategies (mean, max, attention)
Evaluation Framework:
- Standard metrics: MAP, MR1, ARP
- Baseline methods: SBERT, TF-IDF, theoretical bounds
- Multimodal fusion with grid search
Documentation:
- All utility modules have Google-style docstrings
- Type hints throughout the codebase
- Comprehensive README with detailed usage examples
If you use this code in your research, please cite our paper:
@article{mancini2025wealy,
title={Leveraging Whisper Embeddings for Audio-based Lyrics Matching},
author={Mancini, Eleonora and Serrà, Joan and Torroni, Paolo and Mitsufuji, Yuki},
journal={arXiv preprint arXiv:2510.08176},
year={2025}
}[LICENSE TYPE] - See LICENSE file for details
For questions or issues:
- Open an issue: GitHub Issues
- Email: [email protected]
⭐ Star this repository if you find it useful!
Watch for updates as we continue adding features and improvements.