This repository contains the Unnatural Instructions dataset. Unnatural Instructions is a dataset of instructions automatically generated by a Large Language model. See full details in the paper: "Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor"
The data folder contains two files: core_data.jsonl, containing the Unnatural Instructions core dataset of 68,478 instruction-input-output triplets, and full_data.jsonl, containing the full 240,670 Unnatural Instructions examples. The full data was constructed by expanding the core data with automatically generated instruction paraphrases.
Each line in core_data.jsonl is a JSON object with two fields - instruction, which is a natural language instruction describing a task, and instances, an array of JSON objects, each contains
input: An input for the task described by theinstructioninstruction_with_input: The instruction concatenated with theinputconstraints: The task's output space constraintsoutput: The output of executinginstructionwith the giveninput
core_data.jsonl has the same structure as core_data.jsonl, but with one additional field - reformulations. reformulations is an array of JSON objects, each corresponds to an automatically generated paraphrase for the given instruction. Each reformulation contains the fields:
instruction: A paraphrase of the original instructioninput: An input for the task described by theinstructioninstruction_with_input: The paraphrased instruction concatenated with theinputoutput: The output of executinginstructionwith the giveninput
If you make use of Unnatural Instructions, please cite the following paper:
@misc{honovich2022unnatural,
title = {Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor},
author = {Honovich, Or and Scialom, Thomas and Levy, Omer and Schick, Timo},
url = {https://arxiv.org/abs/2212.09689},
publisher = {arXiv},
year={2022}
}