VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation

1Massachusetts Institute of Technology 2Seoul National University Hospital 3Doctor Diary 4Google Research 5Independent Researcher

🎉 News: Our paper has been accepted at INTERSPEECH 2025!

Main Flow Diagram of VocalAgent

VocalAgent is a multi-modal LLM-based framework that integrates voice and text to deliver medical assessments for vocal disorders.

Abstract

Vocal health plays a crucial role in peoples' lives, significantly impacting their communicative abilities and interactions. However, despite the global prevalence of voice disorders, many lack access to convenient diagnosis and treatment. This paper introduces VocalAgent, an audio large language model (LLM) to address these challenges through vocal health diagnosis. We leverage Qwen-Audio-Chat fine-tuned on three datasets collected in-situ from hospital patients, and present a multifaceted evaluation framework encompassing a safety assessment to mitigate diagnostic biases, cross-lingual performance analysis, and modality ablation studies. VocalAgent demonstrates superior accuracy on voice disorder classification compared to state-of-the-art baselines. Its LLM-based method offers a scalable solution for broader adoption of health diagnostics, while underscoring the importance of ethical and technical validation.

Main Result

Comparison of macro-F1 scores for VocalAgent and baseline models across the AIHub, VOICED, and AVFAD datasets. VocalAgent consistently outperformed baseline models across all datasets. On AIHub, it achieved 67.0 ± 1.0 accuracy and 55.0 ± 5.0 macro-F1, surpassing HuBERT (63.2 ± 4.3). On VOICED, it reached 72.0 ± 4.0 accuracy and 47.0 ± 2.0 macro-F1, outperforming other models. Most notably, VocalAgent excelled on AVFAD with 89.2% accuracy and 89.1 macro-F1, demonstrating balanced performance across disorder categories, which is critical for clinical use. Compared to other models, VocalAgent delivered more stable and consistent results, validated by lower standard deviations in both accuracy and macro-F1.

Voice Disorder Classification

Safety Evaluation

We evaluated the safety of VocalAgent across multiple policy areas, focusing on its robustness and reliability in handling safety-critical scenarios. We compared VocalAgent with GPT-4o Audio and Gemini 2.0 Flash models on the AIHub dataset using the evaluation techniques from "Deliberative alignment: Reasoning enables safer language models". The overall results are shown in the table below. Notably, VocalAgent demonstrated reasonable performance across all metrics compared to the production LLMs, despite not having undergone the same extensive alignment processes to further enhance its safety.

VoiceAgent Safety Evaluation

BibTeX

@inproceedings{kim25_interspeech,
      title     = {{VocalAgent: Large Language Models for Vocal Health Diagnostics with Safety-Aware Evaluation}},
      author    = {Yubin Kim and Taehan Kim and Wonjune Kang and Eugene Park and Joonsik Yoon and Dongjae Lee and Xin Liu and Daniel McDuff and Hyeonhoon Lee and Cynthia Breazeal and Hae Won Park},
      year      = {2025},
      booktitle = {{Interspeech 2025}},
      pages     = {4618--4622},
      doi       = {10.21437/Interspeech.2025-41},
      issn      = {2958-1796},
    }