Please download the following datasets first.
Speech quality assessment datasets:
BVCC: How do Voices from Past Speech Synthesis Challenges Compare Today
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis
Creators
(Note: you need to resample audios to 16kHz since the input sampling rate of SALMONN is 16kHz)
Speaker similarity dataset:
VoxSim: A perceptual voice similarity dataset
Then run data processing scripts to generate annotations, you can also refer to our annotations.
Our finetuned SALMONN-7B checkpoint can also be downloaded here.
Just follow the salmonn branch.
Please refer to salmonn branch for more details.
If you find this work useful, please cite our papers.
@inproceedings{wang2024enabling,
title={Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation},
author={Wang, Siyin and Yu, Wenyi and Yang, Yudong and Tang, Changli and Li, Yixuan and Zhuang, Jimin and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun and Sun, Guangzhi and others},
booktitle={Proc. ICASSP},
address={Hyderabad},
year={2025}
}
@inproceedings{wang2024enabling,
title={QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions},
author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun and Sun, Guangzhi and others},
booktitle={Proc. ACL},
address={Vienna},
year={2025}
}