A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
About
We release a 10000+ hours multi-domain transcribed Mandarin Speech Corpus collected from YouTube and Podcast.
Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively.
To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.
10,000 +
hours high-label data
with confidence >= 95%, for supervised training.
2400 +
hours weak-label data
0.6 < confidence < 0.95, for semi-supervised or noisy training, etc.
22400 +
hours audio in total
consists of both labeled and unlabeled data, for unsupervised training or pretraining, etc.
Diversity
The WenetSpeech can be mainly classified into 10 categories according to speaking styles and spoken scenarios.
License
The WenetSpeech dataset is available to download for non-commercial purposes under a Creative Commons Attribution 4.0 International
License.
WenetSpeech doesn't own the copyright of the audios, the copyright remains with the original owners of the
video or audio,
and the public URL is given for the original video or audio.
DOWNLOAD
Please fill out the Google Form here, checkout your mailbox, and follow the
instructions to download the WenetSpeech dataset. If you fail to get the email, please write to [email protected].
Schedule
Oct 08, 2021: Release paper
Oct 25, 2021: Release data
Nov 11, 2021: Release various ASR models trianed using WenetSpeech
WenetSpeech 2.0
We are preparing for WenetSpeech 2.0, which will contains more data as well as richer data.
If you are willing to cooperate and contribute, please contact the authors by the following WeChat or email.
WenetSpeech refers a lot of work of GigaSpeech
. The authors would like to thank Jiayu Du and Guoguo Chen for their suggestions on this work.
The authors would like to thank my colleagues, Lianhui Zhang and Yu Mao, for collecting some of the YouTube
data.