Instructions

This code produces the non-anonymized version of the CNN / Daily Mail summarization dataset for fine-tuning BART. It processes the dataset into the non-tokenized cased sample format expected by BPE preprocessing.

Instructions

1. Download data

Download and unzip the stories directories from here for both CNN and Daily Mail.

2. Process into .source and .target files

Run

python make_datafiles.py /path/to/cnn/stories /path/to/dailymail/stories

replacing /path/to/cnn/stories with the path to where you saved the cnn/stories directory that you downloaded; similarly for dailymail/stories.

For each of the URL lists (all_train.txt, all_val.txt and all_test.txt), the corresponding stories are read from file and written to text files train.source, train.target, val.source, val.target, and test.source and test.target. These will be placed in the newly created cnn_dm directory.

The output is now suitable for feeding to the BPE preprocessing step of BART fine-tuning.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
url_lists		url_lists
.gitignore		.gitignore
README.md		README.md
make_datafiles.py		make_datafiles.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Instructions

1. Download data

2. Process into .source and .target files

About

Uh oh!

Releases

Packages

Uh oh!

Languages

artmatsak/cnn-dailymail

Folders and files

Latest commit

History

Repository files navigation

Instructions

1. Download data

2. Process into .source and .target files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages