| Name | Release Date | Paper/Blog | Dataset | Tokens (T) | License |
|---|---|---|---|---|---|
| Anthropic HH | Anthropic HH | ||||
| HC3 | How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection | HC3 数据集 | |||
| koala-test-set | koala-test-set | ||||
| MTP(massive text pairs) | 2023/09 | 智源发布超3亿对面向中英文语义向量模型训练数据集 | BAAI-MTP | 1.3 | |
| OpenAI WebGPT | OpenAI WebGPT | ||||
| OpenAI Summarization | OpenAI Summarization | ||||
| RedPajama | 2023/04 | RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens | RedPajama-Data | ||
| ShareGPT | ShareGPT | ||||
| starcoderdata | 2023/05 | StarCoder: A State-of-the-Art LLM for Code | starcoderdata | 0.25 | Apache 2.0 |
| Stanford Alpaca | Stanford Alpaca | Alpaca Dataset |
| Name | Release Date | Paper/Blog | Dataset | Tokens (T) | License |
|---|---|---|---|---|---|
| Baize | |||||
| Dolly | |||||
| databricks-dolly-15k | 2023/04 | Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM | databricks-dolly-15k | 15 | CC BY-SA-3.0 |
| Evol-Instruct | |||||
| Flan 2021 | |||||
| LIMA | |||||
| MPT-7B-Instruct | 2023/05 | Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs | dolly_hhrlhf | 59 | CC BY-SA-3.0 |
| MetaMathQA | 2023/09 | MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models,MetaMathQA blog | MetaMathQA | --- | --- |
| Natural Instructions | |||||
| OIG (Open Instruction Generalist) | 2023/03 | THE OIG DATASET | OIG | 44,000 | Apache 2.0 |
| OpenAssistant Conversations | |||||
| P3 (Public Pool of Prompts) | |||||
| Self-Instruct | |||||
| Super-Natural Instructions | |||||
| Unnatural Instructions | |||||
| UltraFeedback:大规模、多样化、细粒度的偏好数据集 | |||||
| UltraFeedback Code | |||||
| UltraChat:高质量对话数据集,包含 150 余万条多轮指令数据 | UltraChat Code | ||||
| WildChat | 2024/05 | WILDCHAT: 1M CHATGPT INTERACTION LOGS IN THE WILD | allenai/WildChat-1M | 1M | AI2 ImpACT |
| xP3 |
| Name | Release Date | Paper/Blog | Dataset | Tokens (T) | License |
|---|---|---|---|---|---|
| OpenAssistant Conversations Dataset | 2023/04 | OpenAssistant Conversations - Democratizing Large Language Model Alignment | oasst1 | 161 | Apache 2.0 |
| Name | Paper/Blog | Dataset | Samples (K) | License |
|---|---|---|---|---|
| C-Eval | C-Eval | |||
| Gaokao | Gaokao | |||
| AGIEval | AGIEval | |||
| MMLU | MMLU | |||
| LawBench | LawBench: Benchmarking Legal Knowledge of Large Language Models | LawBench Code |
Some examples of DataSets as follows:
| Description | Paper | Code | Blog |
|---|---|---|---|
| 最全《大型语言模型数据集》全面综述pdf及444个数据集获取地址 | Awesome-LLMs-Datasets | ||
| 一篇关于LLM指令微调的综述 | paper | blog | |
| 智源研究院发布国内首个大规模、可商用中文开源指令数据集COIG:最大规模中文多任务指令集,上新千个中文数据集 | paper | blog,COIG-PC数据下载地址,COIG数据下载地址 | |
| 总结当前开源可用的Instruct/Prompt Tuning数据 | blog | ||
| GPT-4平替版:MiniGPT-4,支持图像理解和对话,现已开源 | dataset | ||
| 多模态C4:一个开放的、10亿规模的、与文本交错的图像语料库 | paper | code | |
| Mind2Web: 首个全面衡量大模型上网能力的数据集 | blog | ||
| 该数据集是一个由人工生成、人工注释的助理式对话语料库,覆盖了广泛的主题和写作风格,由 161443 条消息组成,分布在 66497 个会话树中,使用 35 种不同的语言。该语料库是全球众包工作的产物,涉及超过 13500 名志愿者。为了证明 OpenAssistant Conversations 数据集的有效性,该研究还提出了一个基于聊天的助手 OpenAssistant,其可以理解任务、与第三方系统交互、动态检索信息。 | paper | code | dataset |
| 为了让Panda LLM在中文数据集上获得强大的性能,作者使用了强大的指令微调instruction-tuning技术,将LLaMA基础模型在五个开源的中文数据集进行混合训练,其中包括来自各种语言领域的1530万个样本,例如维基百科语料,新闻语料,百科问答语料,社区问答语料,和翻译语料。 | blog | ||
| RedPajama开源项目|复制超过1.2万亿个令牌的LLaMA训练数据集 | code | 原始blog,中文blog,dataset |