Skip to content

FreedomIntelligence/InstructionZoo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 

Repository files navigation

InstructionZoo

A collection of open-source Instruction-tuning dataset to train chat-based LLMs (ChatGPT,LLaMA,Alpaca).

This is an on-going project. We will soon add tags to classify the following datasets and continuously update our collection.

Table of Contents

The template

## [owner/project-name](https://github.com/link/to/project)

* Size:
* Language:
* Summary:
* Generation Method:
* Paper:
* HuggingFace: (if applicable)
* Demo: (if applicable)
* License:

The English Instruction Datasets

  • Empty for now. Soon to update.
  • Size: 240,000 instructions
  • Language: EN
  • Summary: Unnatural Instructions consist of a core dataset of 68,478 instruction-input-output triplets, and a full dataset.
  • Generateion Method:
    • Step 1 (Core Dataset Generation): Collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth, following a strict instruction-input-output format.
    • Step 2 (Template Expansion): Prompt a language model to reformulate the tasks in the core dataset, and collect two alternative formulations for each generated task
  • Paper: Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
  • License:
  • Size: 62 tasks
  • Language: EN
  • Summary: FLAN 2021 aggregates 62 text datasets on Tensorflow Datasets into a single mixture. It is currently not public.
  • Generateion Method: Map exist datasets into Instruction Schema.
  • Paper: Finetuned Language Models Are Zero-Shot Learners
  • License:
  • Size: 479 seed instructions, 52,191 Chinese instructions, 52,191 English instructions
  • Language: CH, EN
  • Summary: InstructionWild use the same format as Alpaca for fast and easy usage. Its instructions have no input field.
  • Generateion Method:
    • Pick 429 instructions over 700 noisy instructions from Twitter
    • Use a similar method as Alpaca for generating the resulting instructions.
  • License:

ExMix

  • Size: 1,667 tasks, 3,128 instructions
  • Language: EN
  • Summary: OPT-IML dataset expands the Super-Natural-Instructions benchmark with the task collections from multiple existing work on instruction-tuning, cross-task transfer studies, and area-specific task consolidation.
  • Generation Method:
    • Benchmarks included in OPT-IML are Super-Natural-Instructions, PromptSource, CrossFit, FLAN, ExMix, T5, UnifiedSKG, and Reasoning. Authors only kept partial tasks from CrossFit, ExMix and T5 due to the significant overlap.
    • To organize the Instruction schema, authors broadly classify the instructions in these benchmarks into two categories, dataset-level and instance-level.
  • Paper: OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
  • License:
  • Size: 657K instructions
  • Language: EN
  • Summary: UltraChat is a multi-round dialogue dataset powered by Turbo APIs, composed of three sectors, namely Questions about the World, Writing and Creation, and Assistance on Existent Materials.
  • Generation Method:
    • Two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response.
    • We instruct the user model with carefully designed prompts to mimic human user behavior and call the two APIs iteratively.
  • HuggingFace: https://huggingface.co/datasets/stingning/ultrachat
  • License:
  • Size: 7 tasks, 15,000 instructions
  • Language: EN
  • Summary: Dolly is a human-generated corpus, whose categories are Creative Writing, Closed QA, Open QA, Summarization, Information Extraction, Classification and Brainstorming.
  • Generation Method:
    • Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.
    • For instruction categories that require an annotator to consult a reference text, contributors selected passages from Wikipedia for particular subsets of instruction categories.
  • HuggingFace: https://huggingface.co/datasets/databricks/databricks-dolly-15k
  • License:
  • Empty for now. Soon to update.

The Chinese Instruction Datasets

  • Size: 2K tasks, 191,191 instructions in total
  • Language: CH
  • Summary: Chinese Open Instruction Generalist (COIG) is a Chinese instruction dataset consisting of 4 sub-tasks.
  • Generateion Method:
    • Task 1: Translated Instructions (67,798)
      • Translate the following datasets into Chinese: 1,616 task descriptions in Super-Natural-Instruct v2 along with a single instance for each of them; 175 seed tasks in Self-instruct; 66,007 instructions from Unnatural Instructions.
    • Task 2: Exam Instructions (63,532)
      • Exams include The Chinese National College Entrance Examination (高考), Middle School Entrance Examinations (中考), and Civil Servant Examination (公务员考试).
      • Turn them into Chain-of-Thought (CoT) corpus by extracting six informative elements from original exam questions, including instruction, question context, question, answer, answer analysis, and coarse-grained subject.
    • Task 3: Human Value Alignment Instructions (34,471)
      • Select a set of samples that present shared human values in the Chinese-speaking world, and get 50 seed instructions and 3k resulting instructions.
      • Some additional sets of samples that present regional-culture or country-specific human values are also added.
    • Task 4: Counterfactural Correction Multi-round Chat (13,653)
      • The aim is to alleviate and resolve the pain points of hallucination and factual inconsistency in current LLMs.
      • Based on CN-DBpedia knowledge graph dataset, CCMC has ~13,000 dialogues with an average of 5 rounds per dialogue, resulting in ~65,000 rounds of chat.
    • Leetcode Instructions (11,737)
      • 2,589 programming questions from Leetcode.
  • Paper: Chinese Open Instruction Generalist: A Preliminary Release
  • HuggingFace: https://huggingface.co/datasets/BAAI/COIG
  • License: MIT License
  • Size: 4 tasks, 396,209 instructions
  • Language: CH
  • Summary: CSL is a large-scale Chinese scientific literature dataset.
  • Generation Method:
    • Obtain the paper’s meta-information from the National Engineering Research Center for Science and Technology Resources Sharing Service (NSTR) dated from 2010 to 2020.
    • Label papers with categories and disciplines, with the assistance of volunteers.
    • The data format in CSL is <T,A,K,c,d>, where T is the title, A is the abstract, K is a list of keywords, c is the category label and d is the discipline label.
  • Paper: CSL: A Large-scale Chinese Scientific Literature Dataset
  • License:
  • Language: Multilingual
  • License:

ZeroPrompt

  • Empty for now. Soon to update.

Chinese Alpaca

  • Size: 20,456 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by machine and then clean.
  • Size: 19,442 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by ChatGPT, and check them by humans
  • Size: 51,458 instructions
  • Language: CH
  • Generateion Method: Translate Alpaca into Chinese by ChatGPT, and discard some of them.
  • Size: 51,672 instructions
  • Language: CH
  • Generateion Method: Translate Stanford Alpaca dataset into Chinese by ChatGPT.
  • Size: 20,465 instructions
  • Language: TC
  • Generateion Method: Translate Stanford Alpaca dataset into traditional Chinese using OpenCC.
  • Size: 124,469 instructions
  • Language: EN, TC
  • Generateion Method: Combine the English instruction/input and traditional Chinese output by ChatGPT.
  • Size: 52,002 instructions
  • Language: EN, TC
  • Generateion Method: A Traditional-Chinese version of the Alpaca dataset, whose instruction part is left as English.
  • Size: 52,002 instructions
  • Language: EN, TC
  • Generateion Method: An Traditional-Chinese version of the Alpaca dataset, where there are English and traditional Chinese versions of one single instruction.

The Miltilingual Instruction Datasets

  • Size: 380,835 instructions in total
  • Language: CH, DE, EN, JA, TC
  • Summary: Guanaco dataset builds upon the 175 tasks from Alpaca, containing 3 versions with different sizes and methods.
  • Generateion Method:
    • Original Version (48967): Rewrite 175 Alpaca seed tasks in different languages, and add new tasks specifically designed for English grammar analysis, natural language understanding, cross-lingual self-awareness, and explicit content recognition.
    • Mixed Version (279644): The original 175 tasks were translated into 4 versions and regenerated independently, excluding Deutsch.
    • MIni Version (52224): 52K instrucrion dataset, which is included in the Mixed Version.
  • HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main
  • License:
  • Size: 205,999 instructions in total
  • Language: CH, DE, EN, JA
  • Summary: The Paper/General-QA dataset is a collection of questions and answers constructed for AI-generated papers or general texts in 4 languages. The purpose of this dataset is to generate paragraph-level answers to questions posed about lengthy documents such as PDFs.
  • Generateion Method:
    • The question dataset contains 106,707 questions, and the answer dataset contains 99,292 answers.
    • Similar questions are combined to form a tree-like structure, and graph theory algorithms are used to process user questions, content summaries, and contextual logic.
  • HuggingFace: https://huggingface.co/datasets/JosephusCheung/GuanacoDataset/tree/main/additional
  • License:

The Code Instruction Datasets

  • Size: 20,023 instructions
  • Language: EN
  • Summary:
  • Generateion Method: Self-instuct with prompts to focus on code generation/edting/optimization tasks, using text-davinci-003.
  • HuggingFace:
  • License:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •