Skip to content

ckyang1124/LALM-Evaluation-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 

Repository files navigation

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey


Overview

Abstract

With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community.

We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.

News

  • [2025/07/17] Our paper collection is now available on on Hugging Face! We will continue to actively maintain and update it. Stay tuned!
  • [2025/05/23] Our paper is now available on arXiv

Taxonomy and Paper List

🔊 General Auditory Awareness and Processing

Auditory Awareness
Year Authors Venue Paper
2025 Maimon et al. ICASSP 2025 Salmon: A Suite for Acoustic Language Model Evaluation
2023 Seyssel et al. EMNLP 2024 (Main) EmphAssess: a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
2025 Deshmukh et al. ICLR 2025 ADIFF: Explaining audio difference using natural language
2024 Bu et al. Preprint Roadmap towards Superhuman Speech Understanding using Large Language Models
2025 Guo et al. Preprint DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech
2025 Yosha et al. Preprint StressTest: Can YOUR Speech LM Handle the Stress?
2025 Yang et al. ACL 2025 (Findings) Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models
Auditory Processing
Year Authors Venue Paper
2023 Huang et al. ICASSP 2024 Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech
2024 Huang et al. ICLR 2025 Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
2024 Yang et al. ACL 2024 (Main) AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
2024 Wang et al. NAACL 2025 (Main) AudioBench: A Universal Benchmark for Audio Large Language Models
2024 Weck et al. ISMIR 2024 MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
2025 Cao et al. Preprint FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
2024 Wu et al. SLT 2024 Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
2024 Bu et al. Preprint Roadmap towards Superhuman Speech Understanding using Large Language Models
2024 Chen et al. EMNLP 2024 (Findings) Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models
2025 Zang et al. Preprint Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
2024 Zhao et al. Preprint OpenMU: Your Swiss Army Knife for Music Understanding
2025 Wang et al. Preprint Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
2024 Gong et al. Preprint AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
2025 Xue et al. Preprint Audio-FLAN: A Preliminary Release
2025 Wang et al. Preprint QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions
2025 Pandey et al. Preprint SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
2023 Gong et al. ICLR 2024 Listen, Think, and Understand
2022 Lipping et al. EUSIPCO 2022 Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
2025 Huang et al. ICASSP 2025 SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning
2024 Wei et al. Preprint ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction
2024 Li et al. SLT 2024 WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding
2025 Robinson et al. Preprint NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics
2025 Ma et al. ISMIR 2025 CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
2025 Beyene et al. Preprint mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
2025 Wang et al. Preprint MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
2025 Hou et al. Preprint SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025 Ahia et al. Preprint BLAB: Brutally Long Audio Bench
2025 Wan et al. ACL 2025 (Main) SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models
2025 Jiang et al. Preprint Advancing the Foundation Model for Music Understanding

🧠 Knowledge and Reasoning

Linguistic Knowledge
Year Authors Venue Paper
2020 Nguyen et al. Workshop@NeuRIPS 2020 The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling
2024 Huang et al. ICASSP 2024 Zero Resource Code-Switched Speech Benchmark Using Speech Utterance Pairs for Multiple Spoken Languages
2023 Hassid et al. NeurIPS 2023 Textually Pretrained Speech Language Models
2023 Lavechin et al. Interspeech 2023 BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models
2025 Fang et al. Preprint S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
World Knowledge Assessment
Year Authors Venue Paper
2025 Sakshi et al. ICLR 2025 MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
2025 Penamakuri et al. ICASSP 2025 Audiopedia: Audio QA with Knowledge
2024 Chen et al. Preprint VoiceBench: Benchmarking LLM-Based Voice Assistants
2025 Cui et al. Preprint VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
2025 Yan et al. Preprint URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2024 Gao et al. Preprint Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
2024 Bu et al. Preprint Roadmap towards Superhuman Speech Understanding using Large Language Models
2024 Weck et al. ISMIR 2024 MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models
2025 Zang et al. Preprint Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks
2024 Zhao et al. Preprint OpenMU: Your Swiss Army Knife for Music Understanding
2025 Wang et al. Preprint MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
2025 Hou et al. Preprint SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025 Fang et al. Preprint S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
2025 Ma et al. Preprint MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Reasoning
Year Authors Venue Paper
2024 Ghosh et al. ICLR 2024 CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
2025 Sakshi et al. ICLR 2025 MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
2025 Cui et al. Preprint VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
2025 Yang et al. Interspeech 2025 SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
2025 Yan et al. Preprint URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025 Deshmukh et al. AAAI 2025 Audio Entailment: Assessing Deductive Reasoning for Audio Understanding
2024 Gao et al. Preprint Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
2024 Zhao et al. Preprint OpenMU: Your Swiss Army Knife for Music Understanding
2024 Gong et al. Preprint AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
2024 Ghosh et al. EMNLP 2024 (Main) GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
2023 Gong et al. ICLR 2024 Listen, Think, and Understand
2022 Lipping et al. EUSIPCO 2022 Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
2024 Li et al. SLT 2024 WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding
2025 Huang et al. ICASSP 2025 SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning
2025 Wang et al. ICASSP 2025 What Are They Doing? Joint Audio-Speech Co-Reasoning
2025 Deshmukh et al. ICLR 2025 ADIFF: Explaining audio difference using natural language
2025 Wang et al. Preprint MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
2025 Hou et al. Preprint SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025 Yosha et al. Preprint StressTest: Can YOUR Speech LM Handle the Stress?
2025 Wei et al. Preprint Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems
2025 Fang et al. Preprint S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
2025 Bhattacharya et al. Interspeech 2025 Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
2025 Ma et al. Preprint MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
2025 Yang et al. DCASE 2025 Audio QA Challenge Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge
2025 Ahia et al. Preprint BLAB: Brutally Long Audio Bench
2025 Yang et al. Preprint SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models

🗣️ Dialogue-oriented Ability

Conversation Ability
Year Authors Venue Paper
2024 Lin et al. ACL 2024 (Main) Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations
2024 Ao et al. NeurIPS 2024 SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
2025 Cheng et al. ICLR 2025 VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?
2025 Arora et al. ICLR 2025 Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
2025 Lin et al. Preprint Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities
2025 Li et al. Preprint Mind the Gap! Static and Interactive Evaluations of Large Audio Models
2025 Kim et al. Preprint Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models
2024 Gao et al. Preprint Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
2025 Yan et al. Preprint URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025 Yang et al. ACL 2025 (Findings) Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models
2025 Jiang et al. Preprint SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
Instruction Following
Year Authors Venue Paper
2024 Chen et al. Preprint VoiceBench: Benchmarking LLM-Based Voice Assistants
2025 Yan et al. Preprint URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025 Lu et al. Interspeech 2025 Speech-IFeval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
2025 Jiang et al. Preprint S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information
2025 Pandey et al. Preprint SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
2025 Hou et al. Preprint SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
2025 Ma et al. ISMIR 2025 CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following

🛡️ Fairness, Safety, and Trustworthiness

Fairness and Bias
Year Authors Venue Paper
2024 Lin et al. SLT 2024 Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models
2024 Lin et al. SLT 2024 Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models
2025 Li et al. Preprint AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Safety
Year Authors Venue Paper
2024 Chen et al. Preprint VoiceBench: Benchmarking LLM-Based Voice Assistants
2025 Yang et al. NAACL 2025 (Main) Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models
2025 Roh et al. Preprint Multilingual and Multi-Accent Jailbreaking of Audio LLMs
2025 Kang et al. ICLR 2025 AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
2025 Xiao et al. Preprint Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak
2025 Gupta et al. Preprint "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models
2024 Hughes et al. Preprint Best-of-N Jailbreaking
2025 Yan et al. Preprint URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
2025 Li et al. Preprint AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
2025 Song et al. Preprint Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models
2025 Peng et al. Preprint JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
2025 Lin et al. Preprint Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment through Latent Acoustic Pattern Triggers
2025 Yang et al. ACL 2025 (Findings) Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models
Hallucination
Year Authors Venue Paper
2024 Kuan et al. Interspeech 2024 Understanding Sounds, Missing the Questions: The Challenge of Object Hallucination in Large Audio-Language Models
2024 Leng et al. Preprint The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
2025 Kuan et al. ICASSP 2025 Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning
2025 Li et al. Preprint AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

How to Contribute

If you know of any interesting papers that aren’t listed yet, we welcome your contributions! Please open an issue using the format below:

Year Authors Venue Paper
2025 Yang et al. Preprint Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

We’ll review your suggestion and update the list as soon as possible. Thank you for helping us keep this resource up to date!

Citations

If you find this survey helpful for your research, please consider to cite our paper.

@article{yang2025towards,
  title={Towards holistic evaluation of large audio-language models: A comprehensive survey},
  author={Yang, Chih-Kai and Ho, Neo S and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2505.15957},
  year={2025}
}

About

Collection of works for evaluating (and analyzing) large audio-language models (LALMs)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors