# 宗旨 1. 毛主席说,没有调查就没有发言权,我们采访了很多开发者得出结论:对于有数据闭环和迭代需求的公司,微调是刚需。我们的宗旨是 **人人都能快速方便地训练和部署语音大模型**,这里突出一个 **大** 字,大模型的训练和部署,和小模型基本是两套不同方案,在 “微调是刚需” 的前提下,结合大模型的时代背景和趋势所向,如何方便快捷地 **训练/魔改**(微调)和 **部署** 语音大模型,会是即将到来时代的痛点,解决这个痛点,就是 next gen wenet 的任务,也即 wenet 3.0 的总目标。 2. 数据、模型、代码全部都会开源开放,欢迎大家贡献,有数据的出数据,有意见的出意见,有机器的出机器,大家共创。 # 目标 1. 中文开源语音大模型,开源的效果最好的中文模型,具备流式识别能力(其他语种做不做?做到什么程度?开放不设限,且做且分析) 2. wepipe,积累一套从 [数据爬取](wecrawler)-> [数据处理](借助wesubtitle制作标注和将来的wedata过滤低质量数据)-> [大模型增量训练] 完整pipeline # Action ## 数据 - [x] 开源数据汇总 - https://github.com/wenet-e2e/wenet/issues/2094 - https://github.com/coqui-ai/open-speech-corpora - https://openslr.org/ - https://www.zhihu.com/question/401383501/answer/2641388711 - https://yqli.tech/page/data.html - https://github.com/LAION-AI/audio-dataset/blob/main/data_collection/README.md - [ ] 开源数据汇总后的处理recipe - [x] wenet 增加 paraformer 支持(目前最好的中文模型,可以用来打标注)https://github.com/wenet-e2e/wenet/pull/2067 - [x] wenet 增加 whisper 支持(目前最好的英文模型,可以用来打标注)https://github.com/wenet-e2e/wenet/pull/2141 , https://github.com/wenet-e2e/wenet/pull/2196 , https://github.com/wenet-e2e/wenet/pull/2238 - [ ] wesubtitle 中英文支持 - [ ] wedata,支持语种识别,数据过滤等数据处理操作,quality is more important than quantity - [ ] wecrawler,爬虫计划,鼓励数据共创,数据共享 ## 训练 - [x] deepspeed 训练重构,易用性,多机多卡支持 https://github.com/wenet-e2e/wenet/pull/2055 - [x] IO 重构 #2152, https://github.com/wenet-e2e/wenet/pull/2316 - [x] activation checkpointing 支持,增加大模型训练的batchsize,提高训练效率 https://github.com/wenet-e2e/wenet/pull/2173 - [x] flash attention 支持,提高训练效率 https://github.com/wenet-e2e/wenet/pull/2191 , https://github.com/wenet-e2e/wenet/pull/2351 - [x] 支持 whisper-style decoder 输入数据构造以支持多任务、多语言等功能(encoder延续u2pp动态chunk训练以支持流式识别,joint ctc/attention/timestamp/punc/itn training)https://github.com/wenet-e2e/wenet/pull/2141 , https://github.com/wenet-e2e/wenet/pull/2196, https://github.com/wenet-e2e/wenet/pull/2342 - [x] torch 原生 fsdp 支持 (和deepspeed功能重叠) https://github.com/wenet-e2e/wenet/pull/2412 ## 部署 - [ ] int4量化,降低带宽需求
宗旨
目标
Action
数据
训练
部署