中文开源语音大模型计划

# 宗旨

1. 毛主席说，没有调查就没有发言权，我们采访了很多开发者得出结论：对于有数据闭环和迭代需求的公司，微调是刚需。我们的宗旨是 **人人都能快速方便地训练和部署语音大模型**，这里突出一个 **大** 字，大模型的训练和部署，和小模型基本是两套不同方案，在 “微调是刚需” 的前提下，结合大模型的时代背景和趋势所向，如何方便快捷地 **训练/魔改**（微调）和 **部署** 语音大模型，会是即将到来时代的痛点，解决这个痛点，就是 next gen wenet 的任务，也即 wenet 3.0 的总目标。
2. 数据、模型、代码全部都会开源开放，欢迎大家贡献，有数据的出数据，有意见的出意见，有机器的出机器，大家共创。

# 目标

1. 中文开源语音大模型，开源的效果最好的中文模型，具备流式识别能力（其他语种做不做？做到什么程度？开放不设限，且做且分析）
2. wepipe，积累一套从 [数据爬取]（wecrawler）-> [数据处理]（借助wesubtitle制作标注和将来的wedata过滤低质量数据）-> [大模型增量训练] 完整pipeline

# Action

## 数据
- [x] 开源数据汇总
    - https://github.com/wenet-e2e/wenet/issues/2094
    - https://github.com/coqui-ai/open-speech-corpora
    - https://openslr.org/
    - https://www.zhihu.com/question/401383501/answer/2641388711
    - https://yqli.tech/page/data.html
    - https://github.com/LAION-AI/audio-dataset/blob/main/data_collection/README.md
- [ ] 开源数据汇总后的处理recipe
- [x] wenet 增加 paraformer 支持（目前最好的中文模型，可以用来打标注）https://github.com/wenet-e2e/wenet/pull/2067
- [x] wenet 增加 whisper 支持（目前最好的英文模型，可以用来打标注）https://github.com/wenet-e2e/wenet/pull/2141  , https://github.com/wenet-e2e/wenet/pull/2196  , https://github.com/wenet-e2e/wenet/pull/2238
- [ ] wesubtitle 中英文支持
- [ ] wedata，支持语种识别，数据过滤等数据处理操作，quality is more important than quantity
- [ ] wecrawler，爬虫计划，鼓励数据共创，数据共享

## 训练
- [x] deepspeed 训练重构，易用性，多机多卡支持 https://github.com/wenet-e2e/wenet/pull/2055
- [x] IO 重构 #2152， https://github.com/wenet-e2e/wenet/pull/2316
- [x] activation checkpointing 支持，增加大模型训练的batchsize，提高训练效率 https://github.com/wenet-e2e/wenet/pull/2173
- [x] flash attention 支持，提高训练效率 https://github.com/wenet-e2e/wenet/pull/2191  , https://github.com/wenet-e2e/wenet/pull/2351
- [x] 支持 whisper-style decoder 输入数据构造以支持多任务、多语言等功能（encoder延续u2pp动态chunk训练以支持流式识别，joint ctc/attention/timestamp/punc/itn training）https://github.com/wenet-e2e/wenet/pull/2141  , https://github.com/wenet-e2e/wenet/pull/2196,  https://github.com/wenet-e2e/wenet/pull/2342
- [x] torch 原生 fsdp 支持 (和deepspeed功能重叠) https://github.com/wenet-e2e/wenet/pull/2412

## 部署
- [ ] int4量化，降低带宽需求

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

中文开源语音大模型计划 #2097

宗旨

目标

Action

数据

训练

部署

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

中文开源语音大模型计划 #2097

Description

宗旨

目标

Action

数据

训练

部署

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions