WenetSpeech

About

We release a 10000+ hours multi-domain transcribed Mandarin Speech Corpus collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

10,000 +

hours high-label data

with confidence >= 95%, for supervised training.

2400 +

hours weak-label data

0.6 < confidence < 0.95, for semi-supervised or noisy training, etc.

22400 +

hours audio in total

consists of both labeled and unlabeled data, for unsupervised training or pretraining, etc.

Diversity

The WenetSpeech can be mainly classified into 10 categories according to speaking styles and spoken scenarios.

License

The WenetSpeech dataset is available to download for non-commercial purposes under a Creative Commons Attribution 4.0 International License. WenetSpeech doesn't own the copyright of the audios, the copyright remains with the original owners of the video or audio, and the public URL is given for the original video or audio.

DOWNLOAD

Please fill out the Google Form here, checkout your mailbox, and follow the instructions to download the WenetSpeech dataset. If you fail to get the email, please write to [email protected].

Schedule

Oct 08, 2021: Release paper

Oct 25, 2021: Release data

Nov 11, 2021: Release various ASR models trianed using WenetSpeech

WenetSpeech 2.0

We are preparing for WenetSpeech 2.0, which will contains more data as well as richer data. If you are willing to cooperate and contribute, please contact the authors by the following WeChat or email.

Binbin Zhang

[email protected]

Hang Lv