| 指令微调 |
self-instruct,GPT3自动生成&过滤得到指令集 |
https://github.com/yizhongw/self-instruct |
| 指令微调 |
Standford Alpaca:52K text-davinci-003生成的self-instruct指令数据集 |
https://github.com/tatsu-lab/stanford_alpaca |
| 指令微调 |
GPT4-for-LLM 中文+英文+对比指令 |
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM |
| 指令微调 |
GPTTeacher更多样的通用指令,角色扮演和代码指令 |
https://github.com/teknium1/GPTeacher/tree/main |
| 指令微调 |
中文翻译Alpaca还有一些其他指令数据集 |
https://github.com/hikariming/alpaca_chinese_dataset https://github.com/carbonz0/alpaca-chinese-dataset |
| 指令微调 |
alpaca指令GPT4生成,和以上几版对比显著质量更高,回复更长 |
https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/tree/main |
| 指令微调 |
Guanaco数据:对Alphca指令重写后以不同语言生成总共534K,有对话和非对话类型,还有补充的QA生成样本 |
https://huggingface.co/datasets/JosephusCheung/GuanacoDataset |
| 指令微调 |
OIG中文指令包括翻译alpaca+natural+unnatural,多轮对话,考试,leetcode指令 |
https://github.com/BAAI-Zlab/COIG |
| 指令微调 |
Vicuna训练使用的样本,用API获取了sharegpt上用户和chatgpt对话历史,部分网友整理到了HF |
https://github.com/domeccleston/sharegpt https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main |
| 指令微调 |
HC3指令数据中英文,包括金融,开放QA,百科,DBQA,医学等包含人工回复 |
https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese/tree/main |
| 指令微调 |
MOSS开源的SFT数据包含使用plugin的对话数据 |
https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese/tree/main |
| 指令微调 |
InstructWild数据:用四处爬取的chatgpt指令作为种子self-instruct扩充生成,中英双语 |
https://github.com/XueFuzhao/InstructionWild/tree/main/data |
| 指令微调 |
BELLE100万指令数据,参考Alpaca用ChatGPT生成,有数学,多轮对话,校色对话等等 |
https://github.com/LianjiaTech/BELLE |
| 指令微调 |
PromptCLUE多任务提示数据集:模板构建,只包含标准NLP任务 |
https://github.com/CLUEbenchmark/pCLUE |
| 指令微调 |
TK-Instruct微调用的指令数据集, 全人工标注1600+NLP任务 |
https://instructions.apps.allenai.org/ |
| 指令微调 |
T0微调用的指令数据集(P3) |
https://huggingface.co/datasets/bigscience/P3 |
| 指令微调 |
p3衍生的46种多语言数据集(xmtf) |
https://github.com/bigscience-workshop/xmtf |
| 指令微调 |
Unnatural Instruction使用GPT3生成后改写得到240k |
https://github.com/orhonovich/unnatural-instructions |
| 指令微调 |
alpaca COT对多个数据源进行了清理并统一格式放到的了HF, 重点是人工整理的COT数据 |
https://github.com/PhoebusSi/Alpaca-CoT |
| 指令微调 |
人工编写包含23种常见的中文NLP任务的指令数据,中文写作方向 |
https://github.com/yangjianxin1/Firefly |
| 指令微调 |
Amazon COT指令样本包括各类QA,bigbench,math等 |
https://github.com/amazon-science/auto-cot |
| 指令微调 |
CSL包含 396,209 篇中文核心期刊论文元信息 (标题、摘要、关键词、学科、门类)可做预训练可构建NLP指令任务 |
https://github.com/ydli-ai/CSL |
| 指令微调 |
alpaca code 20K代码指令数据 |
https://github.com/sahil280114/codealpaca#data-release |
| 指令微调 |
GPT4Tools 71K GPT4指令样本 |
https://github.com/StevenGrove/GPT4Tools |
| 指令微调 |
GPT4指令+角色扮演+代码指令 |
https://github.com/teknium1/GPTeacher |
| 数学 |
腾讯人工智能实验室发布网上爬取的数学问题APE210k |
https://github.com/Chenny0808/ape210k |
| 数学 |
猿辅导 AI Lab开源小学应用题Math23K |
https://github.com/SCNU203/Math23k/tree/main |
| 数学 |
grade school math把OpenAI的高中数学题有改造成指令样本有2-8步推理过程 |
https://huggingface.co/datasets/qwedsacf/grade-school-math-instructions |
| 数学 |
数学问答数据集有推理过程和多项选择 |
https://huggingface.co/datasets/math_qa/viewer/default/test?row=2 |
| 数学 |
AMC竞赛数学题 |
https://huggingface.co/datasets/competition_math |
| 数学 |
线性代数等纯数学计算题 |
https://huggingface.co/datasets/math_dataset |
| 代码 |
APPS从不同的开放访问编码网站Codeforces、Kattis 等收集的问题 |
https://opendatalab.org.cn/APPS |
| 代码 |
Lyra代码由带有嵌入式 SQL 的 Python 代码组成,经过仔细注释的数据库操作程序,配有中文评论和英文评论。 |
https://opendatalab.org.cn/Lyra |
| 代码 |
Conala来自StackOverflow问题,手动注释3k,英文 |
https://opendatalab.org.cn/CoNaLa/download |
| 代码 |
code-alpaca ChatGPT生成20K代码指令样本 |
https://github.com/sahil280114/codealpaca.git |
| 对话指令 |
LAION 策划的开放指令通用数据集中手动选择的组件子集 已开源40M 3万个,100M在路上 |
https://github.com/LAION-AI/Open-Instruction-Generalist |
| 对话指令 |
Baize基于Chat GPT构建的self-chat数据 |
https://github.com/project-baize/baize-chatbot/tree/main/data |
| 对话指令 |
FaceBook开源BlenderBot训练对话数据~6K |
https://huggingface.co/datasets/blended_skill_talk |
| 对话指令 |
AllenAI开源38.5万个对话高质量数据集SODA |
https://realtoxicityprompts.apps.allenai.org/ |
| 对话指令 |
InstructDial在单一对话任务类型上进行指令微调 |
https://github.com/prakharguptaz/Instructdial |
| 对话指令 |
Ultra Chat 两个独立的 ChatGPT Turbo API 进行对话,从而生成多轮对话数据 |
https://github.com/thunlp/UltraChat |
| 对话指令 |
Awesome Open-domain Dialogue Models提供多个开放域对话数据 |
https://github.com/cingtiye/Awesome-Open-domain-Dialogue-Models#%E4%B8%AD%E6%96%87%E5%BC%80%E6%94%BE%E5%9F%9F%E5%AF%B9%E8%AF%9D%E6%95%B0%E6%8D%AE%E9%9B%86 |
| RLFH |
北大河狸开源RLHF数据集10K,1M需要申请 |
https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K |
| RLHF |
Anthropic hh-rlhf数据集 |
https://huggingface.co/datasets/Anthropic/hh-rlhf |
| RLHF |
Stack-exchange上问题对应多个答案,每个答案有打分 |
https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences/tree/main |
| RLHF |
Facebook Bot Adversarial Dialogues数据集5K |
https://github.com/facebookresearch/ParlAI |
| RLHF |
AllenAI Real Toxicity prompts |
https://github.com/facebookresearch/ParlAI |
| RLHF |
OpenAssistant Conversations 160K消息,13500人工生成, 英文为主 |
https://huggingface.co/datasets/OpenAssistant/oasst1 |
| 评估集 |
BigBench(Beyond the Imitation Game Benchmark) |
https://github.com/google/BIG-bench |
| 评估集 |
Complex QA:用于ChatGPT的评测指令集 |
https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-ChatGPT |
| 评估集 |
Langchain开源评估数据集 |
https://huggingface.co/LangChainDatasets |
| 评估集 |
2010-2022年全国高考卷的题目 |
https://github.com/OpenLMLab/GAOKAO-Bench |
| 评估集 |
中文通用大模型综合性评测基准SuperCLUE |
https://github.com/CLUEbenchmark/SuperCLUE |
| 预训练 |
RedPajama开源的复刻llama的预训练数据集 |
https://github.com/togethercomputer/RedPajama-Data |
| 预训练 |
Pile 22个高质量数据集混合的预训练数据集800G,全量开放下载 |
https://pile.eleuther.ai/ |
| 预训练 |
UER整理CLUECorpusSmall+News Commentary中英 |
https://github.com/dbiir/UER-py/wiki/%E9%A2%84%E8%AE%AD%E7%BB%83%E6%95%B0%E6%8D%AE |
| 预训练 |
智源人工智能开源的wudao 200G预训练数据 |
https://github.com/BAAI-WuDao/WuDaoMM |
| 多源数据集整合 |
opendatalab整合了预训练阶段的多个数据源 |
https://opendatalab.org.cn/?industry=9821&source=JUU3JTlGJUE1JUU0JUI5JThF |