Peize He1*,
Zichen Wen1,2*,
Yubo Wang1*,
Yuxuan Wang1,
Xiaoqian Liu1,3,
Jiajie Huang1,
Zehui Lei1,
Zhuangcheng Gu4,
Xiangqi Jin1,
Jiabing Yang5,
Kai Li6,
Zhifei Liu1,
Weijia Li7,2,
Cunxiang Wang6,
Conghui He2,
Linfeng Zhang1β
1Shanghai Jiao Tong University 2Shanghai AI Laboratory 3Northeastern University
4Carnegie Mellon University 5University of Chinese Academy of Sciences
6Tsinghua University 7Sun Yat-sen University
*Equal contribution β Corresponding author
2025.10.08π€π€ We release our latest work AudioMarathon, a comprehensive benchmark designed to evaluate the performance and efficiency of Audio-LLMs on long-form audio understanding tasks. Code is available!
- π Overview
- πͺ Supported Tasks
- π Pipeline
- π Leader Board π
- π Visible Results π
- ποΈ Repository Structure
- π Quick Start
- βοΈ Configuration
- π Performance Analysis
- π οΈ Utility Tools
- π Data Preparation
- π Citation
- π§ Contact
- π Acknowledgments
- π License
Quick Links: π View Leader Board | π View Results | π Get Started
AudioMarathon is a comprehensive benchmark designed to evaluate Audio Large Language Models (Audio-LLMs) on long-form audio understanding tasks. This repository contains the evaluation code and tools for testing various state-of-the-art audio-language models across multiple challenging tasks.
- π― Multi-Task Evaluation: Supports 10+ diverse audio understanding tasks
- π Long-Form Audio: Handles extended audio sequences up to several minutes
- π§ Multiple Models: Evaluation scripts for Phi-4-MM, Qwen2.5-Omni, and Aero-1.
- β‘ Audio Token Pruning: Built-in support for various KV-cache compression methods
- π Comprehensive Metrics: Detailed performance analysis with timing statistics
AudioMarathon evaluates models across the following task categories with 6,563 samples:
| Task | Dataset | Samples | Description |
|---|---|---|---|
| Automatic Speech Recognition (ASR) | LibriSpeech | 204 (3.10%) | Transcribe and understand spoken content |
| Speech Content Reasoning (SCR) | RACE | 820 (12.49%) | Answer questions based on read-aloud passages |
| Speech Entity Recognition (SER) | SLUE | 490 (7.46%) | Recognize and extract entities from spoken language |
| Task | Dataset | Samples | Description |
|---|---|---|---|
| Audio Scene Classifier (ASC) | TAU | 1,145 (17.44%) | Classify acoustic scenes (indoor/outdoor environments) |
| Music Classifier (MC) | GTZAN | 120 (1.83%) | Classify music genres from audio clips |
| Sound Event Detection (SED) | DESED | 254 (3.87%) | Detect and classify sound events in domestic environments |
| Task | Dataset | Samples | Description |
|---|---|---|---|
| Emotion Recognition (ER) | VESUS | 185 (2.82%) | Recognize emotions from speech |
| Speech Detection (SD) | HAD | 776 (11.82%) | Distinguish between real and AI-generated speech |
| Speaker Age Recognition (SAR) | VoxCeleb | 959 (14.60%) | Classify speaker age groups from voice |
| Speaker Gender Recognition (SGR) | VoxCeleb | 1,614 (24.58%) | Classify speaker gender from voice |
AudioMarathon/
βββ Phi4MM/
β βββ DART/ # DART (Dynamic Audio Reduction Technique) implementations
β βββ Others/ # Standard evaluation scripts for Phi-4-MM
β β βββ DESED_test.py # Sound event detection evaluation
β β βββ gtzan_test.py # Music genre classification
β β βββ HAD_test.py # Audio deepfake detection
β β βββ race_test.py # Reading comprehension
β β βββ SLUE_test.py # Spoken language understanding
β β βββ TAU_test.py # Acoustic scene classification
β β βββ VESUS_test.py # Emotion recognition
β β βββ Vox_age_test.py # Age classification
β β βββ Vox_test.py # Gender classification
β βββ phi4_kvpress/ # KV-cache compression methods
β
βββ Qwen_2.5_Omni/
β βββ Dart/ # DART implementations for Qwen
β βββ Others/ # Standard evaluation scripts for Qwen2.5-Omni
β βββ qwen_kvpress/ # KV-cache compression methods
β
βββ Voxtral/
β βββ eval_DESED.py # Voxtral evaluation scripts
β βββ eval_GTZAN.py
β βββ eval_HAD.py
β βββ eval_LibriSpeech.py
β βββ eval_RACE.py
β βββ eval_SLUE.py
β βββ eval_TAU.py
β βββ eval_VESUS.py
β βββ eval_Vox_Age.py
β βββ eval_Vox.py
β
βββ Aero-1/ # Aero-1 model evaluation scripts
β βββ DART/
β βββ Others/
β
βββ kvpress/ # KV-cache compression implementations
β βββ presses/ # Various compression strategies
β βββ attention_patch.py
β βββ audio_features.py
β βββ pipeline.py
β
βββ Segment/ # Audio segmentation tools
β βββ GTZAN_task.py
β βββ HAD_segment.py
β βββ TAU_task.py
β βββ Vox2_task.py
β
βββ analyse_audio_duration/ # Audio duration analysis utilities
- Clone the repository
git clone https://github.com/YourUsername/AudioMarathon.git
cd AudioMarathon- Install dependencies
Choose the appropriate requirements file based on the model you want to evaluate:
# For Phi-4-MM
pip install -r Phi4_requirements.txt
# For Qwen2.5-Omni
pip install -r Qwen_requirements.txt
# For Aero-1
pip install -r Aero1_requirements.txtNote: Each model has its own environment requirements. We recommend using separate virtual environments for different models to avoid dependency conflicts.
cd Phi4MM/Others
# Basic evaluation
export CUDA_VISIBLE_DEVICES=0
export PRUNE_RATIO=0
export PRUNE_METHOD=base
export SAMPLE_LIMIT=100
export RESULTS_DIR=./GTZAN_Results
python gtzan_test.py# Using FastV pruning with 50% compression
export CUDA_VISIBLE_DEVICES=0
export PRUNE_LAYER_IDX=2
export PRUNE_RATIO=0.5
export PRUNE_METHOD=fast_v
export RESULTS_DIR=./GTZAN_Results_FastV50
python gtzan_test.pyFor evaluating multiple sparsity ratios in a single run, use the batch testing scripts:
cd Qwen_2.5_Omni/Dart
# Basic usage - test HAD task with default ratios (0.1-0.9)
bash batch_test.sh HAD
# Test with custom GPU
bash batch_test.sh --gpu-id 1 TAU
# Test with sample limit (useful for quick testing)
bash batch_test.sh --sample-limit 100 SLUE
# Test with specific sparsity ratios
bash batch_test.sh --ratios 0.0,0.3,0.5,0.7 VESUS
# Test with custom pruned layers and output directory
bash batch_test.sh --pruned-layer 3 --output-dir ./my_results GTZAN
# Comprehensive test with all options
bash batch_test.sh \
--gpu-id 0 \
--sample-limit 200 \
--pruned-layer 2 \
--ratios 0.0,0.2,0.4,0.6,0.8 \
--output-dir ./Qwen_DART_Results \
raceAvailable Options:
-g, --gpu-id <id>: Specify GPU device (default: 0)-s, --sample-limit <num>: Limit number of samples (default: 0 for no limit)-l, --pruned-layer <num>: Number of layers to prune (default: 2)-r, --ratios <ratios>: Comma-separated sparsity ratios (default: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9)-o, --output-dir <dir>: Results output directory (default: ./Qwen_DART_Results)-h, --help: Show help message
Supported Tasks: HAD, race, SLUE, TAU, VESUS, Vox, Vox_age, LibriSpeech, DESED, GTZAN
The batch script will:
- Automatically run tests for all specified sparsity ratios
- Generate logs for each test in
<output-dir>/logs/ - Create a summary report with all results
- Display timing statistics and accuracy metrics
All evaluation scripts support the following environment variables:
| Variable | Description | Default | Options |
|---|---|---|---|
CUDA_VISIBLE_DEVICES |
GPU device ID | 0 |
Any valid GPU ID |
PRUNE_LAYER_IDX |
Layer index for audio pruning | 2 |
Integer >= 0 |
PRUNE_RATIO |
Ratio of audio tokens to prune | 0 |
0.0 - 1.0 |
PRUNE_METHOD |
Pruning method to use | base |
base, fast_v, random, frame |
SAMPLE_LIMIT |
Limit number of samples | 0 (no limit) |
Integer >= 0 |
RESULTS_DIR |
Output directory for results | Task-specific | Any valid path |
AudioMarathon supports multiple KV-cache compression strategies:
- base: No pruning (baseline)
- Fast_v: FastV attention-based pruning
- Random: Random token pruning
- Frame: Frame-based structured pruning
- DART: Pruning tokens based on its duplication with other tokens
Each script tracks detailed timing metrics:
- Prefill Time: Time for initial audio encoding
- Decode Time: Time for generating responses
- Tokens per Second: Generation throughput
- Audio Duration: Input audio length
Based on average F1-scores across all 10 tasks (SER, SCR, ASR, SED, MC, ASC, SD, ER, SAR, SGR):
| Rank | Model | Avg. Score | SER | SCR | ASR | SED | MC | ASC | SD | ER | SAR | SGR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| π₯ | Qwen2.5-Omni-7B | 70.5 | 26.3 | 85.1 | 98.1 | 78.4 | 100.0 | 72.2 | 72.3 | 53.4 | 21.4 | 98.0 |
| π₯ | Qwen2.5-Omni-3B | 67.2 | 25.2 | 82.3 | 94.7 | 70.2 | 97.4 | 69.3 | 67.3 | 39.6 | 29.1 | 97.2 |
| π₯ | Audio-Flamingo-3 | 63.0 | 21.7 | 78.9 | 94.3 | 59.5 | 97.0 | 54.1 | 33.7 | 54.3 | 40.7 | 96.2 |
| 4 | Voxtral-Mini-3B-2507 | 57.4 | 24.3 | 71.1 | 96.8 | 71.0 | 83.8 | 27.2 | 68.0 | 29.7 | 30.7 | 71.0 |
| 5 | Gemma-3n-E4B-it | 49.3 | 19.0 | 56.9 | 93.2 | 50.2 | 71.9 | 31.7 | 35.9 | 18.9 | 21.8 | 93.0 |
| 6 | Phi-4-Multimodal | 47.7 | 18.4 | 69.3 | 92.7 | 55.1 | 46.7 | 23.4 | 26.4 | 27.3 | 26.6 | 91.1 |
| 7 | Gemma-3n-E2B-it | 45.5 | 22.5 | 51.6 | 91.3 | 50.2 | 56.8 | 28.2 | 35.1 | 15.2 | 12.2 | 91.6 |
| 8 | Aero-1-Audio | 42.8 | 17.9 | 56.6 | 43.7 | 55.0 | 83.9 | 39.9 | 33.7 | 32.0 | 17.8 | 47.5 |
| 9 | Baichuan-Omni-1.5 | 39.3 | 12.4 | 11.2 | 86.5 | 45.7 | 52.0 | 25.8 | 49.2 | 18.9 | 10.2 | 81.5 |
| 10 | Audio-Flamingo-2 | 35.6 | 26.8 | 39.8 | 1.0 | 27.1 | 66.8 | 29.7 | 45.9 | 13.1 | 20.3 | 85.1 |
| Rank | Model | Avg. Score | SER | SCR | ASR | SED | MC | ASC | SD | ER | SAR | SGR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| π₯ | Gemini-2.5-Flash | 59.6 | 28.1 | 83.6 | 96.8 | 69.2 | 79.3 | 40.8 | 33.1 | 31.9 | 34.3 | 99.3 |
| π₯ | Gemini-2.0-Flash | 59.4 | 30.9 | 71.8 | 96.4 | 68.1 | 88.5 | 54.1 | 32.1 | 20.1 | 39.2 | 93.1 |
| π₯ | Gemini-2.0-Flash-Lite | 53.1 | 23.7 | 65.6 | 97.4 | 60.9 | 86.9 | 43.4 | 34.5 | 17.3 | 19.0 | 82.1 |
| 4 | Gemini-2.5-Flash-Lite | 50.6 | 30.3 | 64.0 | 96.5 | 68.0 | 64.8 | 36.8 | 33.9 | 14.6 | 19.6 | 77.9 |
| 5 | GPT-4o-Audio (2024-12-17) | 48.7 | 25.7 | 60.2 | 94.7 | 51.2 | 67.6 | 41.9 | 30.8 | 21.8 | 19.9 | 73.1 |
| 6 | GPT-4o-Audio (2024-10-01) | 47.4 | 25.8 | 61.4 | 94.4 | 50.7 | 59.5 | 40.8 | 32.5 | 22.5 | 17.2 | 69.2 |
- SER: Speech Entity Recognition (SLUE)
- SCR: Speech Content Reasoning (RACE)
- ASR: Automatic Speech Recognition (LibriSpeech)
- SED: Sound Event Detection (DESED)
- MC: Music Classification (GTZAN)
- ASC: Audio Scene Classification (TAU)
- SD: Speech Detection/Deepfake Detection (HAD)
- ER: Emotion Recognition (VESUS)
- SAR: Speaker Age Recognition (VoxCeleb)
- SGR: Speaker Gender Recognition (VoxCeleb)
| Category | Champion Model | Score | Runner-up | Score |
|---|---|---|---|---|
| Speech Content Extraction | Gemini-2.0-Flash | 30.9 (SER) | Gemini-2.5-Flash-Lite | 30.3 |
| Qwen2.5-Omni-7B | 85.1 (SCR) | Gemini-2.5-Flash | 83.6 | |
| Qwen2.5-Omni-7B | 98.1 (ASR) | Gemini-2.0-Flash-Lite | 97.4 | |
| Audio Classification | Qwen2.5-Omni-7B | 78.4 (SED) | Voxtral-Mini | 71.0 |
| Qwen2.5-Omni-7B | 100.0 (MC) | Audio-Flamingo-3 | 97.0 | |
| Qwen2.5-Omni-7B | 72.2 (ASC) | Qwen2.5-Omni-3B | 69.3 | |
| Speaker Information | Qwen2.5-Omni-7B | 72.3 (SD) | Voxtral-Mini | 68.0 |
| Audio-Flamingo-3 | 54.3 (ER) | Qwen2.5-Omni-7B | 53.4 | |
| Audio-Flamingo-3 | 40.7 (SAR) | Gemini-2.0-Flash | 39.2 | |
| Gemini-2.5-Flash | 99.3 (SGR) | Qwen2.5-Omni-7B | 98.0 |
Analyze audio lengths in your datasets:
cd analyse_audio_duration
python GTZAN.py
python HAD.py
python TAU.pySegment long audio files for processing:
cd Segment
python GTZAN_task.py
python HAD_segment.py
python TAU_task.pyEach task expects data in a specific format:
[
{
"path": "audio/blues_001.wav",
"question": "What is the genre of this music?",
"choice_a": "Blues",
"choice_b": "Classical",
"choice_c": "Rock",
"choice_d": "Jazz",
"answer_gt": "A"
}
]HAD/
βββ real/
β βββ audio_001.wav
β βββ audio_002.wav
βββ fake/
βββ audio_001.wav
βββ audio_002.wav
{
"tasks": [
{
"path": "audio_001.wav",
"task_type": "detection",
"question": "What sound events are present?",
"choices": {
"A": "Dog barking",
"B": "Car horn",
"C": "Phone ringing",
"D": "Door slamming"
},
"answer_gt": "C"
}
]
}If you use AudioMarathon in your research, please cite:
@article{he2025audiomarathon,
title={AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs},
author={He, Peize and Wen, Zichen and Wang, Yubo and Wang, Yuxuan and Liu, Xiaoqian and Huang, Jiajie and Lei, Zehui and Gu, Zhuangcheng and Jin, Xiangqi and Yang, Jiabing and Li, Kai and Liu, Zhifei and Li, Weijia and Wang, Cunxiang and He, Conghui and Zhang, Linfeng},
journal={arXiv preprint arXiv:2510.07293},
year={2025},
url={https://arxiv.org/abs/2510.07293}
}For questions or issues, please:
- Open an issue on GitHub
- Contact: [email protected]
We thank the following for their contributions:
- Microsoft for Phi-4-Multimodal
- Qwen team for Qwen2.5-Omni
- Fixie AI for Ultravox/Voxtral
- All dataset providers (DESED, GTZAN, HAD, LibriSpeech, RACE, SLUE, TAU, VESUS, VoxCeleb)
This project is released under the Apache 2.0 license.
Note: This benchmark is designed for research purposes. Please ensure you have the proper licenses and permissions for all datasets before use.


