Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Abstract

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community.

We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.

News

[2025/07/17] Our paper collection is now available on on Hugging Face! We will continue to actively maintain and update it. Stay tuned!
[2025/05/23] Our paper is now available on arXiv

Taxonomy and Paper List

🔊 General Auditory Awareness and Processing

Auditory Awareness

Year	Authors	Venue	Paper
2025	Maimon et al.	ICASSP 2025	Salmon: A Suite for Acoustic Language Model Evaluation
2023	Seyssel et al.	EMNLP 2024 (Main)	EmphAssess: a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
2025	Deshmukh et al.	ICLR 2025	ADIFF: Explaining audio difference using natural language
2024	Bu et al.	Preprint	Roadmap towards Superhuman Speech Understanding using Large Language Models
2025	Guo et al.	Preprint	DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech
2025	Yosha et al.	Preprint	StressTest: Can YOUR Speech LM Handle the Stress?
2025	Yang et al.	ACL 2025 (Findings)	Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models

Auditory Processing

Year	Authors	Venue	Paper
2023	Huang et al.	ICASSP 2024	Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
2024	Huang et al.	ICLR 2025	Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
2024	Yang et al.	ACL 2024 (Main)	AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
2024	Wang et al.	NAACL 2025 (Main)	AudioBench: A Universal Benchmark for Audio Large Language Models
2024	Weck et al.	ISMIR 2024	MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
2025	Cao et al.	Preprint	FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
2024	Wu et al.	SLT 2024	Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
2024	Bu et al.	Preprint	Roadmap towards Superhuman Speech Understanding using Large Language Models
2024	Chen et al.	EMNLP 2024 (Findings)	Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
2025	Zang et al.	Preprint	Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
2024	Zhao et al.	Preprint	OpenMU: Your Swiss Army Knife for Music Understanding
2025	Wang et al.	Preprint	Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
2024	Gong et al.	Preprint	AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
2025	Xue et al.	Preprint	Audio-FLAN: A Preliminary Release
2025	Wang et al.	Preprint	QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions
2025	Pandey et al.	Preprint	SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
2023	Gong et al.	ICLR 2024	Listen, Think, and Understand
2022	Lipping et al.	EUSIPCO 2022	Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
2025	Huang et al.	ICASSP 2025	SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning
2024	Wei et al.	Preprint	ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction
2024	Li et al.	SLT 2024	WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding
2025	Robinson et al.	Preprint	NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics
2025	Ma et al.	ISMIR 2025	CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
2025	Beyene et al.	Preprint	mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
2025	Wang et al.	Preprint	MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
2025	Hou et al.	Preprint	SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025	Ahia et al.	Preprint	BLAB: Brutally Long Audio Bench
2025	Wan et al.	ACL 2025 (Main)	SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models
2025	Jiang et al.	Preprint	Advancing the Foundation Model for Music Understanding

🧠 Knowledge and Reasoning

Linguistic Knowledge

Year	Authors	Venue	Paper
2020	Nguyen et al.	Workshop@NeuRIPS 2020	The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling
2024	Huang et al.	ICASSP 2024	Zero Resource Code-Switched Speech Benchmark Using Speech Utterance Pairs for Multiple Spoken Languages
2023	Hassid et al.	NeurIPS 2023	Textually Pretrained Speech Language Models
2023	Lavechin et al.	Interspeech 2023	BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models
2025	Fang et al.	Preprint	S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models

World Knowledge Assessment

Year	Authors	Venue	Paper
2025	Sakshi et al.	ICLR 2025	MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
2025	Penamakuri et al.	ICASSP 2025	Audiopedia: Audio QA with Knowledge
2024	Chen et al.	Preprint	VoiceBench: Benchmarking LLM-Based Voice Assistants
2025	Cui et al.	Preprint	VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
2025	Yan et al.	Preprint	URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2024	Gao et al.	Preprint	Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
2024	Bu et al.	Preprint	Roadmap towards Superhuman Speech Understanding using Large Language Models
2024	Weck et al.	ISMIR 2024	MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
2025	Zang et al.	Preprint	Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
2024	Zhao et al.	Preprint	OpenMU: Your Swiss Army Knife for Music Understanding
2025	Wang et al.	Preprint	MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
2025	Hou et al.	Preprint	SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025	Fang et al.	Preprint	S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
2025	Ma et al.	Preprint	MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Reasoning

Year	Authors	Venue	Paper
2024	Ghosh et al.	ICLR 2024	CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
2025	Sakshi et al.	ICLR 2025	MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
2025	Cui et al.	Preprint	VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
2025	Yang et al.	Interspeech 2025	SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
2025	Yan et al.	Preprint	URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025	Deshmukh et al.	AAAI 2025	Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
2024	Gao et al.	Preprint	Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
2024	Zhao et al.	Preprint	OpenMU: Your Swiss Army Knife for Music Understanding
2024	Gong et al.	Preprint	AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
2024	Ghosh et al.	EMNLP 2024 (Main)	GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
2023	Gong et al.	ICLR 2024	Listen, Think, and Understand
2022	Lipping et al.	EUSIPCO 2022	Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
2024	Li et al.	SLT 2024	WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding
2025	Huang et al.	ICASSP 2025	SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning
2025	Wang et al.	ICASSP 2025	What Are They Doing? Joint Audio-Speech Co-Reasoning
2025	Deshmukh et al.	ICLR 2025	ADIFF: Explaining audio difference using natural language
2025	Wang et al.	Preprint	MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
2025	Hou et al.	Preprint	SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025	Yosha et al.	Preprint	StressTest: Can YOUR Speech LM Handle the Stress?
2025	Wei et al.	Preprint	Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems
2025	Fang et al.	Preprint	S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
2025	Bhattacharya et al.	Interspeech 2025	Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
2025	Ma et al.	Preprint	MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
2025	Yang et al.	DCASE 2025 Audio QA Challenge	Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
2025	Ahia et al.	Preprint	BLAB: Brutally Long Audio Bench
2025	Yang et al.	Preprint	SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models

🗣️ Dialogue-oriented Ability

Conversation Ability

Year	Authors	Venue	Paper
2024	Lin et al.	ACL 2024 (Main)	Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations
2024	Ao et al.	NeurIPS 2024	SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
2025	Cheng et al.	ICLR 2025	VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
2025	Arora et al.	ICLR 2025	Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
2025	Lin et al.	Preprint	Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities
2025	Li et al.	Preprint	Mind the Gap! Static and Interactive Evaluations of Large Audio Models
2025	Kim et al.	Preprint	Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models
2024	Gao et al.	Preprint	Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
2025	Yan et al.	Preprint	URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025	Yang et al.	ACL 2025 (Findings)	Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models
2025	Jiang et al.	Preprint	SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents

Instruction Following

Year	Authors	Venue	Paper
2024	Chen et al.	Preprint	VoiceBench: Benchmarking LLM-Based Voice Assistants
2025	Yan et al.	Preprint	URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025	Lu et al.	Interspeech 2025	Speech-IFeval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
2025	Jiang et al.	Preprint	S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
2025	Pandey et al.	Preprint	SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
2025	Hou et al.	Preprint	SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025	Ma et al.	ISMIR 2025	CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

🛡️ Fairness, Safety, and Trustworthiness

Fairness and Bias

Year	Authors	Venue	Paper
2024	Lin et al.	SLT 2024	Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models
2024	Lin et al.	SLT 2024	Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models
2025	Li et al.	Preprint	AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Safety

Year	Authors	Venue	Paper
2024	Chen et al.	Preprint	VoiceBench: Benchmarking LLM-Based Voice Assistants
2025	Yang et al.	NAACL 2025 (Main)	Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
2025	Roh et al.	Preprint	Multilingual and Multi-Accent Jailbreaking of Audio LLMs
2025	Kang et al.	ICLR 2025	AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
2025	Xiao et al.	Preprint	Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak
2025	Gupta et al.	Preprint	"I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
2024	Hughes et al.	Preprint	Best-of-N Jailbreaking
2025	Yan et al.	Preprint	URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025	Li et al.	Preprint	AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
2025	Song et al.	Preprint	Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
2025	Peng et al.	Preprint	JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
2025	Lin et al.	Preprint	Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers
2025	Yang et al.	ACL 2025 (Findings)	Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models

Hallucination

Year	Authors	Venue	Paper
2024	Kuan et al.	Interspeech 2024	Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
2024	Leng et al.	Preprint	The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
2025	Kuan et al.	ICASSP 2025	Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
2025	Li et al.	Preprint	AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

How to Contribute

If you know of any interesting papers that aren’t listed yet, we welcome your contributions! Please open an issue using the format below:

Year	Authors	Venue	Paper
2025	Yang et al.	Preprint	Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

We’ll review your suggestion and update the list as soon as possible. Thank you for helping us keep this resource up to date!

Citations

If you find this survey helpful for your research, please consider to cite our paper.

@article{yang2025towards,
  title={Towards holistic evaluation of large audio-language models: A comprehensive survey},
  author={Yang, Chih-Kai and Ho, Neo S and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2505.15957},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Abstract

News

Taxonomy and Paper List

🔊 General Auditory Awareness and Processing

🧠 Knowledge and Reasoning

🗣️ Dialogue-oriented Ability

🛡️ Fairness, Safety, and Trustworthiness

How to Contribute

Citations

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Abstract

News

Taxonomy and Paper List

🔊 General Auditory Awareness and Processing

🧠 Knowledge and Reasoning

🗣️ Dialogue-oriented Ability

🛡️ Fairness, Safety, and Trustworthiness

How to Contribute

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages