This folder provides scripts for extracting MERT-based audio features—the representation used by CLaMP 3’s audio encoder. These features are generated using the MERT-v1-95M model, which processes audio into 5-second non-overlapping segments and averages across all layers and time steps to produce a single feature per segment.
Download MERT-v1-95M model from Hugging Face.
Step 1: Extracts MERT features from audio files.
- Execution:
Run the script using the following command:python extract_mert.py --input_path <input_path> --output_path <output_path> --model_path m-a-p/MERT-v1-95M --mean_features
- Input: Audio files (
.mp3,.wav). - Output: MERT-extracted features (
.npy).